By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Randy Zwitch
Dec 13, 2019

Accelerated Extract-Load-Transform Data Pipelines

Try HeavyIQ Conversational Analytics on 400 million tweets

Download HEAVY.AI Free, a full-featured version available for use at no cost.

GET FREE LICENSE

As a columnar database with both strong CPU and GPU performance, the OmniSci platform is well suited for Extract-Load-Transform (ELT) pipelines (as well as the data science workloads we more frequently demonstrate). In this blog post, I’ll demonstrate an example ELT workflow, along with some helpful tips when merging various files with drifting data schemas. If you’re not familiar with the two major data processing workflows, the next section briefly outlines the history and reasoning for ETL-vs-ELT; if you’re just interested in the mechanics of doing ELT in OmniSci, you can skip to the “Baywheels Bikeshare Data” section.

A Brief History of ETL vs. ELT for Loading Data

From the first computerized databases in the 1960s, the Extract-Transform-Load (ETL) data processing methodology has been an integral part of running a data-driven business. Historically, storing and processing data was too expensive to be accumulating data without knowing what you were going to do with it, so a process such as the following would occur each day:

  1. Transfer yesterday’s data from FTP to a staging area
  2. Read each data file, keeping only the fields deemed “useful” by a business process and loading that data to a staging database environment
  3. A series of stored procedures or views transforms the raw data, enriching and/or summarizing the data and storing in another table
  4. The data is loaded from the staging table to the production database table
  5. The production database table indices are re-calculated
  6. The source data is retained for a short period in case the pipeline fails, otherwise it is migrated to cold storage or deleted

While the ETL method of data processing can be effective when the business goal are well-understood, it has two major flaws: inflexibility and latency.

From a flexibility standpoint, once you complete the ETL process, making changes requires developer time (and might not be possible at all depending on how integrated other systems are to the database). If your business rules or business environment changes rapidly, it will be difficult to near impossible to keep the production database up-to-date. If you were able to get a developer to make changes to the ETL process, re-processing the data would take days or weeks at best (assuming the source data was available to begin with).

Fast forwarding to the early-to-mid-2010s, the Hadoop project showed that storing data on commodity hardware could be cost effective, and the concept of “data lakes” started to take shape. Businesses began storing data well in advance of knowing what they wanted to do with it. However, having huge volumes of data in a data lake doesn’t fit well with the ETL view of the world; rather, with newer technologies like distributed clusters of machines, columnar databases and GPUs, Extract-Load-Transform often makes more sense. Instead of building elaborate processing pipelines with pre-defined business rules and aggregations, data scientists need to take a bit more time upfront to create their dataset, but with that extra work comes the flexibility to create the exact transformations desired at whatever data fidelity is required. Luckily, with millisecond response times for OmniSciDB queries, the pain of doing data preparation is significantly reduced.  

Baywheels Bikeshare Data

I’ve previously written about analyzing bikeshare data in OmniSci; in the year since I’ve written that post, the bikeshare program rebranded from FordGoBike to Baywheels, but luckily, they still provide the aggregated system-level dataset on a monthly basis for all of their rides. The data is provided in an S3 bucket, and appears to have a very similar size and format; however, once you start trying to load the data, you’ll notice that unfortunately the files have different schemas.

Create Table and Copy From S3 Into OmniSci

For the 2017 FordGoBike trips, we can use a `create table` statement, then use `copy from` via omnisql to load the data:

However, once you attempt to load the January 2018 file, you’ll get the following message:

omnisql> copy baywheels_tripdata from 's3://baywheels-data/201801-fordgobike-tripdata.csv.zip';

Result

Loaded: 0 recs, Rejected: 94802 recs in 5.752000 secs

When you see a message such as this one, where 0 records are loaded and 100% of the records are rejected, the first question to ask is “Has the schema changed?” In this case, an additional column `bike_share_for_all_trip` has been added to the end of the January 2018 file, and since OmniSci isn’t expecting the extra column, it rejects every record.

Adding a Column Using ALTER TABLE

To handle the additional column, we can use `ALTER TABLE` with `ADD COLUMN` to modify the table schema:

--add column bike_share_for_all_trip, which starts with Jan 2018 file

ALTER TABLE baywheels_tripdata

ADD COLUMN bike_share_for_all_trip TEXT ENCODING DICT

This is as close to a “free” operation as you can get in OmniSci; `ALTER TABLE....ADD COLUMN` only changes the table metadata, it doesn’t copy data or do any other processing. Without running any additional commands, the values in the `bike_share_for_all_trip` column will be NULL, which is appropriate since we don’t know if the 2017 rides would’ve qualified for this program.

Going through a series of `COPY FROM` commands, we can load the 201801 - 201905 monthly files, until the following happens:

omnisql> copy baywheels_tripdata from 's3://baywheels-data/201905-baywheels-tripdata.csv.zip';

Result

Loaded: 182163 recs, Rejected: 0 recs in 2.440000 secs

omnisql> copy baywheels_tripdata from

's3://baywheels-data/201906-baywheels-tripdata.csv.zip';

Result

Loader truncated due to reject count.  Processed : 0 recs, Rejected: 191772 recs in 8.323000 secs

Like before, we get a message the 0 records are loaded, all rejected, so we know that something is wrong. And again after inspecting the file, we see a new column `rental_access_method`. This column has the strange property where it is only populated when `bike_share_for_all_trip` is not populated, which might give us a clue to its meaning. As it stands, it seems like a mistake rather than a new data element, so rather than use `ALTER TABLE` again, I’m going to make a staging table to do some additional processing.

Loading Data to Staging Table, Validate With Immerse

Both the 201906 and 201907 files have an extra column that is only populated when `bike_share_for_all_trip` is blank, so I’ll load both of these months into a separate staging table:

Just to up the level of difficulty, you’ll notice the second `COPY FROM` statement has an extra argument to set the delimiter to `;`...always great to see that rare “SEMI-COLON separated comma-separated-values” file in the wild!

With the data loaded into a staging table, I can now use OmniSci Immerse to validate a hunch: if the extra column has the same distribution as the `bike_share_for_all_trip` column, then I’m going to assume that the extra column is actually used to derive the `bike_share_for_all_trip` column:

The chart on the left is the distribution of `bike_share_for_all_trip` with nulls removed (remember, we have nulls because this column didn’t exist in 2017). The middle chart shows the distribution of the `bike_share_for_all_trip` column in the staging table (without nulls), and the chart on the right shows the `rental_access_method` column in the staging table (without nulls). While the distributions aren’t exactly equal, they look pretty close to me, and the fact the two columns have the same cardinality (two values and NULL) and one is populated when the other isn’t, I’m comfortable merging the two columns together.

Inserting Into an OmniSci Table With a Query

The final step of this ELT process is to insert the data from the staging table into the main table:

Using a case statement, we can re-code the values from the `rental_access_method` column to ‘Yes’ or ‘No’, as well as conditionally use those values only when `bike_share_for_all_trip` is NULL. Within that same operation, we can use `INSERT INTO` to insert that data into the main `baywheels_tripdata` table, and then for database cleanliness we can drop the staging table.

ELT: One More Tool in the Data Science Toolbox

In this post I’ve demonstrated that you can do all the data pre-processing for a publicly-available dataset without leaving the OmniSci platform. With the high-performance, columnar nature of OmniSci, complemented by modern GPU hardware, you can remove the need to build elaborate data processing pipelines before analyzing data. Instead, an ELT workflow can help your business be more agile in its data science explorations.

Questions or comments? Stop by the OmniSci Community Forum to let us know what you think!


Randy Zwitch

Randy Zwitch is a Senior Director of Community at HEAVY.AI, enabling customers and community users alike to utilize HEAVY.AI to its fullest potential. With broad industry experience in Energy, Digital Analytics, Banking, Telecommunications and Media, Randy brings a wealth of knowledge across verticals as well as an in-depth knowledge of open-source tools for analytics.