At Wyzoo, we have worked hard to create a state-of-the-art, user-friendly tool which combines artificial intelligence and human intelligence to create better market intelligence for consumer marketers.
Creating this tool has involved quite a bit of research and analysis to be able to identify the best available resources which will help us best handle existing data, transform it, and mine it for real information that you can use for your business.
As we began the process, we spent quite a bit of time analyzing and evaluating various existing products and tools which could help us create our powerful engine.
This is the first in a series of articles discussing this process. In this article we will demonstrate how we evaluated some of the alternatives for the purpose of data ingestion or how we extract data for the purpose of analysis and convert it to useful information.
Why Data Ingestion Tools?
Traditionally, data scientists would work individually with datasets. We’d program our own models, do our own analysis, and come up with some sophisticated analyses that would often help our clients better target messaging to their customers and prospects.
Clients will typically come to us with data in a wide range of formats, shapes, and sizes. Often, we will be discussing a need of direct marketers to be able to take an existing list of clients or potential customers and help them segment it for improved campaign performance.
While in some cases, we can work with a seemingly small dataset (10,000 or so records), in many cases we process client data and enrich it with much larger and more valuable datasets which can have billions of records of transactions, user behavior, location, and more.
Handling this data manually, while possible, is unwieldy. With the development of new data ingestion tools, the process of handling vast and different datasets has been made much easier. On top of the ease and speed of being able to combine large amounts of data, functionality now exists to make it possible to see patterns and to segment datasets in ways to gain the best quality information. We can increase the signal to noise ratio considerably, simply by using data ingestion, or “ETL” (Extract, Transform, and Load”) tools.
What criteria we chose
ETL tools provide the function of extracting data from multiple locations and files, provide functionality for preparing and transforming data into a useable format, identifying important variables and factors, and loading it into cloud-based databases to enable not only more analysis, but to be easily compared against new data as it comes in.
There are many ETL tools available on the market, however we had a specific set of criteria for choosing them when it came to the creation of our free WyzProfile buyer persona reporting system. Our goal was to build what we call a “frictionless” user experience. This requires real time processing of very large data files, and it requires creating a user experience that can auto detect data fields and doesn’t require traditional steps such as identifying a record layout.
Our number one question was, “can this tool provide analysis on Big Data in real time?”
We examined their ability to handle real-time processing, file formats supported, data-type detection, API documentation, whether custom coding was needed/warranted, whether it had any built-in checking for duplicate data, check system table structure, whether it had a free/trial version, and real time reporting capabilities. Expense, as always, was also a factor.
For the purposes of this evaluation we examined the three following tools:
While the first two are obvious choices to consider for large scale data ingestion, we also chose to include Talend in this analysis due to its (mostly) ease of use. We will go over each of these tools one at a time, and examine how they met the criteria we specified.
Using AWS Redshift was straightforward. All queries of data used a PostgreSQL style of syntax, so understanding basic SQL made it possible to easily grab and transfer data without a lot of specialized knowledge.
We found that Redshift works fast with large data sets, and like many of Amazon’s products, integrates well with AWS hosted files.
One of the characteristics of how Redshift handles data is that, unlike traditional row-based organizational structure, it uses columns. While this approach may not be as strong for transactional querying, for the purpose of data ingestion this resulted in extremely fast and efficient queries. As a result, column-based indexing is particularly well-suited for the purpose of data analytics, particularly those which must be run in real time with large datasets.
Real Time Processing
The ability to run in real-time varied somewhat, depending on the size of the cluster being imported. One issue with AWS is that their pricing structure is directly based on processing time. Everything is scaled, so if one is dealing with relatively small datasets, this does not become too much of an issue, however real-time processing of massive amounts of data can be somewhat taxing on any system. As a result, the overall cost for Redshift handling big data could tend to skyrocket quickly, unless one keeps a handle on the data flow.
Redshift handles data in CSV and JSON file formats, which is fine for most situations. One drawback is that Redshift cannot handle data from CRM systems. In order to be able to work with these, data needs to be extracted and converted before use, which can impact its ability to handle this information in real time.
Data type detection
One feature that we hoped to be able to find would be for the application to have some built in machine-learning type analytics so as to be able to detect data types. While all systems can generally identify basic datatypes such as strings, integers, and dates, more helpful, particularly working with marketing datasets, would be to identify certain content of strings, such as email and mailing address, or being able to classify names as first names and last names.
We found that, like many of Amazon’s products, Redshift provided extensive documentation. However, as Redshift uses a fairly common SQL syntax for querying data sources, we found it to be mostly unnecessary to spend much time reading the manuals. Typically, we were able to operate without much trouble by relying on our existing knowledge of this language.
Custom Coding needed:
Redshift contains no AI component, so data and field content analysis needed to be handled via external code. As a result, we built our own ML tools for this purpose.
Duplicate File checking
Working with large amounts of data requires consistency and non-redundancy of information, especially when working with real time data from multiple sources, which may overlap. As data leakage or false positives can be a real problem with predictive analytics it becomes crucial to identify and eliminate any obviously redundant data
Amazon Redshift unfortunately does not provide this as a built-in feature.
Redshift provides a free trial version. Also, given Amazon’s pricing structures, one can initiate relatively small trials without an exorbitant cost; expense is directly tied to the amount of processing used, so one can get a relatively good idea of how much it would cost to operate in larger datasets (though, of course, mileage may vary depending on whether or not sudden large amounts of processing power is needed to handle large groupings of real-time data.)
Real Time Reporting
For marketing campaigns, understanding exactly what is happening at a given time can be valuable. Events can occur which can immediately impact a campaign, such as a weather event or social activity (e.g. a regional team wins a championship, resulting in a sudden increase in demand for commemorative t-shirts or memorabilia). For marketing companies to be able to move quickly and take advantage of real information, real-time reporting can be a key factor in being able to make quick decisions.
Redshift does provide real-time reporting.
Overall, while Redshift does not provide every necessary feature available, its ability to handle real time data, both during the processing and reporting phases, made it a strong contender as a data ingestion tool. Ease of use (using a standard SQL syntax) also served as a strong selling point. Where it does not perform some important functions (such as machine learning), it is extensible, making it possible to integrate with other external resources.
The second product we examined for the purpose of data ingestion is Databricks. Unlike Amazon Redshift, which uses a SQL type format, Databricks uses Apache Spark.
The advantage of this is that it is faster on both structured and unstructured data than SQL. As a significant percentage of big data collections are unstructured in their nature, for real time analysis, it may be necessary to query against JSON-style data which is often loosely structured. SQL requires structured data to work well.
Databricks is based on an open source framework and was developed by the same people who created Apache Spark. They have, however, built a large customized scalable infrastructure to allow a large range of custom methods for handling data.
Databricks does provide real-time streaming processing, however it requires a large amount of initial commitment of coding time. As a result, to be able to set it up to do so comes with a high cost.
Databricks allows importing multiple data formats and shows functionality to bring in S3 or DBFS files and can upload normal CSV files. However, when we tried uploading some basic data ourselves, we found out immediately that it cannot read zip files; data came in garbled. This can be problematic if dealing with large datasets that are not necessarily pre-existing in the cloud in their own native format.
Data Type Detection
Databricks allows previewing table structures upon import and enables managing of data types. It does not have any built-in AI/ML like functionality for special text structures, such as email addresses.
Databricks provides detailed, well-documented APIs
Custom Coding Needed
Because Databricks does not sit on the AWS infrastructure, it requires a considerable amount of custom coding to make it work within this environment. Similar coding is assumed to be necessary for Azure and Google clouds.
Duplicate File Checking
While Databricks will prevent creating tables with the same names, it cannot detect whether a file has been previously uploaded, so ensuring that duplicate data has not been incorporated into the final dataset is the responsibility of the end user.
As mentioned earlier, Databricks is based on an open source framework, and Spark is downloadable for free for local hosting. As for more advanced functionality, this requires an expense. Given our needs, this expense would rise considerably.
The Databricks infrastructure provides superior real-time processing power to Redshift, however, due to the need for initial investment in infrastructure and customized coding, it is more expensive. One would need to start with 30K annual investment for a minimum project. However, it does include a high level of support which AWS does not.
Talend is a popular suite of tools. They aim to provide a full top-to-bottom data analysis framework. Part of this platform includes an ETL tool. For our purposes, we examined the data ingestion, or “extraction” segment of its ETL functionality.
Unlike Redshift or Databaricks, which do not provide a user-friendly GUI for non-developers, Talend provides an easy-to-use interface. Most functionality is handled by dragging and dropping nodes or modules into a workspace. One does not need to be a computer science or advanced data science professional to be able to use Talend.
Unfortunately, for enterprise level operations, at least with the trial version with which we experimented, it did not work with big data, which is highly problematic given the nature of the information we wish to process for large direct marketing organizations. More functionality appears to be available for purchase, but it was not possible to determine with their evaluation copy.
Talend, at least at the basic levels, does not have the power to handle real-time data processing. While technically it is possible, this requires a level of expertise which veers strongly away from the user base to whom they appear to be wanting to target. Non-technical professionals will find that dealing with real time data with Talend to be unwieldy.
Talend provides some strong compatibility with multiple file types. It provides excellent compatibility with virtually any data source, and importing data is not difficult.
Data Type Detection
While ingesting data, Talend has a component that will loosely be able to identify various file structures, such as email. However, Talend has no AI component, so it cannot identify if there are problems with previously uploaded files or file structures, so as a result will require a lot of tedious analysis and data cleaning.
One of Talend’s greatest strengths is an excellent trove of clearly written documentation, including many tutorials and case studies. To be able to use many of its features, there is no need to be a professional coder, and there is a decent community which appears able to assist in many situations.
Custom Coding Needed
While coding is not necessary in many cases, to be able to handle large datasets, particularly in real time will require an extensive amount of coding, as it is not essentially designed for this purpose
Duplicate File Checking
Talend does not have duplicate file checking
Talend does have a free trial version of its data extraction modules. It’s “big data” tools require a subscription to their cloud.
Real Time Reporting
As for whether Talend provides real-time reporting, this is questionable. While the documentation claims that it does, our experience was that it would not work well without purchasing the $100K version.
Overall, we found that Talend can be good for small operations with limited-size datasets. It does not, however, perform well for larger operations with real time data, without investing a considerable amount of time and money.
Given these three tools, we eventually chose to work with Amazon Redshift. The reasons behind this were that it worked well with the AWS platform, as well as with other services. It’s ability to handle real-time processing and in an easy-to-use structure made it a strong choice. While Amazon’s pricing structure can quickly get expensive, in comparison to the alternatives examined, it turned out to be much more cost efficient than the alternatives, and it was far easier to estimate how much it would cost.
In comparison, Databricks had some extremely fast processing power, however it was considerably more expensive than Redshift and required more custom coding. Talend, while an easy-to-use tool simply was not powerful enough without adopting the most expensive plan.
Features that we wished to have that were not part of the data ingestion process were able to be handled with either external software packages or with custom development. We will cover some of this in future articles in this series.
Appendix: Comparison Chart
|Integrate ZIP files||Yes||No||Yes|
|Integrate CSV, JSON files||Yes||Yes||Yes|
|Integrate with CRM system||No||No||Yes|
|Can detect email, mailing address, name, surname, zip||No||No||No|
|Detect other column sources (gender, etc)||No||No||No|
|Data quality report: (#columns, types, distribution)||With Code||With Code||No|
|Documented Integration API||Yes||Yes||Yes|
|Check previously uploaded files||No||No||No|
|Check for existing similar tables||No||No||No|
|Pricing Model||Cluster up-time||from $30000||1st year: 88190|
2nd year: 78000
3rd year: 102000
|Approx price per year (min 5 connection)||6,400$||30-40k||90k|