2 min read

It’s Not Brain Science: Cassandra, Spark, and Microsoft SQL Server

Featured Image

There are many different parts involved in building a custom application that integrates with many different 3rd party and legacy systems. A big aspect is building out a centralized database, and figuring out how to move that data between all the applications. Senior Systems Engineer Judge Hiciano documents some of his experiences going through that process.

In this project, Apache Spark was used to connect to the data sources, in this case MS SQL server, and then copy the data to a Datastax Enterprise version of Cassandra. It still works if you’re using straight Cassandra or DSE Community with the free Datastax driver to connect though. We had many third-party applications that we wanted to tie to a new mobile/web application, so this project involved more than simply migrating the data over from our existing systems, and strictly using Cassandra.

We also didn’t want to build our API to connect directly into the third-party applications via their respective API/Web service. This was due to the fact that we are in a very dynamic environment, with applications changing all the time. It was easier for us to change the extract, transform, load (ETL) process from the existing databases to Cassandra, versus rewriting the API each time we wanted to introduce or move to a new application.

We used Lightweight Directory Access Protocol (LDAP) authentication (currently working on introducing in next project phase). In addition, having Apache Spark and Solr already configured in with DSE,  made our lives much easier.

We went with two datacenters for our Cassandra instance; a Spark D.C., which we used for ETL, and then a Solr D.C. for indexing, and as a source for our API/application to read from.

Getting data from the SQL Server to DSE Cassandra is pretty straightforward once you have the groundwork done. To make things easier for ourselves, we configured views in the MS SQL Server to match what our Cassandra schema would like. We used Talend OpenStudio (Free for download: Here) to create our base Cassandra schematic from MS SQL Server view, and then used Spark to perform our ETL process. You could do all of this in Spark if you really wanted to, but it’s much faster to use OpenStudio, especially if you’re trying out the schema along the way.