Help & Support

212.660.6600

Choosing the Right Tools for Data Ingestion, Part 2

This is the second article in a series on how we chose a data ingestion tool for the construction of WyzProfile, our powerful user-friendly tool for business intelligence for direct marketing organizations.  The first article focused on Amazon Redshift, DataBricks, and Talend.  This article will use the same criteria, and report on how we experienced three different tools.

In our previous article, we compared several data ingestion tools with the goal of being able to handle big datasets in real time for on-the-fly analysis of direct marketing data. 

As mentioned, we had chosen Amazon Redshift, due to its real-time processing abilities, ability to integrate easily with AWS, and relative cost-effectiveness.    However, we are aware of some of its limitations, which included, among other aspects, inability to detect duplicates or pre-existing data structures. For this reason, we will share our experiences with a few other data ingestion tools, to discuss their strengths and weaknesses.    

These articles are not designed to serve as full product reviews; they are more reports of our direct experience when evaluating the use of ingestion software for our specific purposes.

 

What Criteria we chose

As we discussed in the previous article, the primary question we wanted to know was whether or not Big Data could be processed and analyzed in real time.   However, many other factors came into play, including the ability to handle multiple different file types, whether a tool could detect correct data types without customized programming, whether it could detect duplicates, and finally cost.  We found that while some tools were able to handle all the functionality extremely well, the expense was prohibitive for our purposes. 

 

Tools Examined

  • Matillion
  • Unifi
  • Dataiku

Matillion

Matillion is a powerful, user-friendly ETL tool designed for working with Big Datasets from Amazon Redshift, Big Query, and Snowflake. We found that it had a user-friendly interface, which allowed dragging and dropping nodes onto a workspace.  It enables non-data science users to be able to create workable data pipelines. Because of this, it is very suitable for business users.

What is particularly appealing about Matillion is that it is platform agnostic.  For instance, if one is using Redshift, solely because it is an Amazon product it works best in the AWS environment. Matillion does not have these restrictions and can work with Google’s Big Query, Snowflake, and others without special customization.

 

Matillion provides an easy-to-use interface tool.  It uses a workspace (similar to KNIME, but more at an enterprise level), which enables dragging and dropping nodes onto the worksheet to specify and diagram data ingestion flows.  Below is an example simple data pipeline:

Real Time Processing

Matillion can perform real time ETL processing.  A drawback, however, is that to do so requires a considerable amount of processing power and time.  Due to Matillion's method for charging by usage hours, the cost will grow rapidly over time. Performing a considerable amount of real-time data processing, which in our case was tantamount to our product, could result in it becoming exceedingly expensive, particularly since this would be in addition to any expenses already accrued using Redshift or other big data engines.

 

File Formats

Matillion works well with a number of different file formats.  It can import CSV and JSON files, and also has some good connectors to enable reading directly from CRMs.

 

Data Type Detection

Matillion is unable to detect specific data types (email addresses, etc), without the need to program a custom component for this purpose.

 

Documentation/APIs

Matillion provides a robust REST API, which is highly documented, including providing many helpful examples.    The product also boasts a strong community, and helpful support was easily found.

 

Custom Coding Needed

For most purposes (primarily serving as an ETL pipeline), Matillion does not need custom coding.  For most of its basic functionality, it takes a largely code-free approach. However, if one needs to perform some higher-level analysis, or make use of data type detection, as mentioned above, custom development would become necessary.   That said, one can perform most functions without needing to resort to specialized development.

 

Duplicate File Checking

In our trials, we found no built-in functionality to prevent loading duplicates.

 

Trial Version

Matillion provides a trial for the basic-level version.

 

Real Time Reporting

Matillion provides the ability to extract basic data quality reports.   

 

Summary

We found that Matillion may be ideal for companies who may not have the technical expertise, or programmers/developers to work directly with Redshift.   It complements these tools well, as it performs some functions that they cannot handle themselves (such as reading data directly from a CRM). In our case we did not choose this product.  Our main reasons were due to a) the fact that the expense would be required to be added to the amount we were already paying, and b) we already had the Human Intelligence to be able to handle some of the complex functions that we needed to handle with Redshift.

Unifi

Unifi presents as a top-tier data ingestion tool, and in fact it boasts the most features of any application we examined.  It can handle almost every criterion on our checklist of desired features we sought in a solid data ingestion product. This included some features that were not available within other tools, including data recognition, identification of existing data structures, and identification of potentially duplicate data.  

 

Real Time Processing

Unifi provides strong real-time processing and allows integration with different data sources simultaneously.  Below is a screenshot of their data set explorer tool, demonstrating how easily one can bring in large amounts of data.

File Formats

Unifi provides support for all needed file formats, including CSV, ZIP, and JSON formats.  It also provides connectors for bringing in data from CRM systems.

 

Data Type Detection

One of the features that Unifi provided which most others did not (at least without custom programming) was automatic data type detection, including recognition of email addresses, gender, and more. 

Documentation/APIs

Unifi does provide some robust APIs for integrating data into further data analysis packages.    However, documentation was one weak spot, at least when we were attempting to get set up with a trial.  Being an enterprise level solution, their business model appears to be focused solely on large corporations willing to provide full funding up front, and not for smaller organizations who are still in the evaluation period.    Their support was unfortunately weak; the trial took 4 months to set up.  

   

Custom Coding Needed

Unifi is a robust and fully developed tool.  Most aspects that one might wish to write custom code have already been written.   As a result, requirements for preparation by developers is minimal for many of the functions which we wanted to use.  In most cases, the functionality already exists.

 

Duplicate File Checking

Another feature that Unifi provides that others do not is duplicate file checking. It automatically identifies if there are datasets that appear similar to others that have already been created.  This could be a huge benefit in preventing false positives later down the line if data is accidentally loaded twice.

 

Below is a “knowledge graph” demonstrating information about data that has been loaded into the system, showing all existing metadata and making it easy to spot possible problems.

Trial Version

Unifi does not, as a matter of course, provide any sort of free trial version.   We were able to get a trial of the Unifi Data Catalog, which is their smaller version, which is more of a BI tool (mostly used for providing data descriptions; it is not really an ETL tool).  This was able to give us a sense about the functionality that the tool provides. We were, however, unable to truly do any load testing with real time big datasets.

 

Real Time Reporting

Unifi provides a few useful real time reports on your data as it is being ingested:

It also provides real time logs tracking each activity as it occurs:

Summary

Unifi is a powerful data ingestion product.   It provides almost every one of the features that we were seeking in this sort of tool (save the free trial, which was somewhat frustrating).   However, the real drawback for Unifi is its cost. They require paying for each user, which means that using this software under the hood for another product (as Wyzoo is doing) was not feasible.  They have no licensing structure for this, as identifying users for SaaS applications is difficult, and the expense would increase with each additional subscriber. The Unifi business model unfortunately placed them out of the running as a tool for Wyzoo.

Dataiku

Another powerful data ingestion tool that we examined was Dataiku.   Like Matillion, it could create workflow pipelines, using an easy-to-use drag and drop interface.

 

This is handled by creating a series of “recipes” following a standard flow that we saw in many other ETL tools, but specifically for the ingestion process. 

Real Time Processing

Dataiku showed that it was capable of processing big data in real time.

 

File Formats

Dataiku proved that it could handle multiple datatypes without trouble.  It was able to handle ZIP, JSON, and CSV files, and integrated CRM data well.

 

Dataiku provides some good code integration tools, built directly into the interface.  It supports development in many different programming languages, making it relatively easy for experts to make modifications.

Data Type Detection

Dataiku could provide some basic information about datasets.   It has field detection mechanisms needed for matching (email, address, name, surname, zip), as we can see below:

However, in our trials, we found that data recognition was weak.  Dataiku has some functionality which is mostly incorporated into plugins.   We tried a few of them, but none were as good as the internal product we were able to develop on our own for Wyzoo.

 

Documentation/APIs

They provide several good APIs for accessing the data with other applications.  Some good documentation exists, and support is excellent. However, given the nature of the product itself, which is quite complex, to truly be able to gain the most out of Dataiku, one would need professional data scientist skills, and/or to be a technology professional.  Support for a complex product can often be only as good as the skillset of the user.

 

Custom Coding Needed

As mentioned, Dataiku is a complex product designed for data science professionals.    While much of the application will run without needing coding, it is designed more as a tool which can be modified to specific needs.   So, while custom coding may not be needed, to really get the most out of this tool, it is recommended.

 

Duplicate File Checking

Does not have the ability to check for already uploaded datasets – we were able to load duplicates without any warnings, which could lead to trouble down the road.

 

Trial Version

Dataiku provided a useful trial version for evaluating the functionality of the application, but the trial was not designed for Big Data processing, so actual performance could not be determined.

 

Real Time Reporting

One of Dataiku’s strengths is its robust reporting.   They provide real-time data logging, and some excellent visual tools.   For instance, here is an image of a data visualization dashboard:

 

One can also create one’s own custom reporting, as we can see below:

Summary

Overall, Dataiku provides a considerably high level of quality. The tool itself is not, however, designed specifically for business users.   To gain the most out of its functionality, and to be able to make any modifications and/or use the plugins requires a level of expertise that typically requires technology professionals.

Summary

Each of the tools examined have some real strengths and could be ideal solutions for circumstances different than ours.  The variation between these tools was large. While they all provide similar functionality, Matillion, for instance, may be ideal for business users who lack considerable technical human capital, but are willing to pay for software which can accomplish many of the functions a data professional might be able to perform.

On the other end of the spectrum, Dataiku is primarily designed for data professionals.  It provides strong flexibility, but the complexity may make it a bit difficult to use for many without the technical strength.   

Finally, Unifi provides a high-quality product which can accomplish pretty much every function that one would wish to perform in the data ingestion process, ranging from high-speed real time data processing to identification of datatypes and duplicates.   Its drawbacks lie specifically in the areas of organizational compatibility and price. Unless one has a considerable amount of capital one is willing to invest, access to Unifi is prohibitive.

 

Please stay tuned for our next article in this series!

Appendix: Comparison Chart

 

Main requirements

Matillion

Unifi

Dataiku

Trial Version

Yes

No

Yes

Streaming ETL

Yes, but with cost

Yes

Yes

Real-time Reporting

Yes

Yes

Yes

Integrate ZIP files

Yes

Yes

Yes

Integrate CSV, JSON files

Yes

Yes

Yes

Integrate with CRM system

Yes

Yes

with coding

Can detect email, address, name, surname, zip

No

Yes

No

Detect other column sources (gender, etc)

No

Yes

No

Data quality report: (#columns, types, distribution)

Yes

Yes

Yes

Documented Integration API

Yes

Yes

Yes

Minimal Coding

Yes

Yes

Yes

Check previously uploaded files

No

Yes

No

Check for existing similar tables

No

Yes

No

Pricing Model

$2.840/hr

$4500 Data Catalog / $50000 Full Platform

N/R

Approx price per year (min 5 connection)

20k

50K

40K

Resources

Other Articles

ETL Tool Comparison: KNIME vs Alteryx

ETL Tool Comparison:...

How We Chose a Tool for Data Ingestion

How We Chose a Tool...
robo happy