Get started

Create a free Wyzoo account to have access to all your reports anytime.

Already have an account?

Sign Up

10 min read

HPCC Systems

A review of the free HPCC Systems open source data analytics platform for use in marketing data operations.

Try It Now

HPCC systems is a powerful enterprise-level big data analytics platform developed by LexisNexis. HPCC, or “High Performance Computing Cluster” was released as open source and is available as a reusable framework to provide data scientists a way of handling enormous datasets, by using the networking power of every system running as nodes in a super-computer. As it is built off nodes, the platform is highly scalable. Smaller test environments can be created which can then easily be spun off into production in the cloud.

HPCC systems tools are divided into two main clusters. “Thor” is designed for handling big data workflows, including ETL processes, indexing, data cleansing, and more. “Roxie” is the name of its Data Delivery Engine, which is designed to handle high volumes of online data processing and data warehousing functions. Building smaller test environments can be easily scaled, as each node handles both server and agent processes.

Try it now!

Live Demo

Using HPCC Systems is a bit different than other analytics tools reviewed on this site. It contains no GUI component. To handle all aspects of the ETL process, each task is handled by writing small queries using an internally developed programming language, known as ECL (Enterprise Control Language), which is a Query/ETL Language designed specifically for the task of processing big data.

Features

HPCC Systems includes an ETL process, a set of data management and analytics features, including data profiling, data cleansing, snapshot data updates, and a scheduling component. The Roxie cluster contains a powerful query and built-in search engine enabling extremely fast processing of billions of data instances.

HPCC Systems also boast a wide range of predictive modeling tools, including linear regression, logistic regression, decision trees, and random forests.

While this system is quite large, we’ll show you a few basic aspects that may be of use to database marketers.

ECL

Using HPCC Systems is a bit different than other analytics tools reviewed on this site. It contains no GUI component. To handle all aspects of the ETL process, each task is handled by writing small queries using an internally developed programming language, known as ECL (Enterprise Control Language), which is a Query/ETL Language designed specifically for the task of processing big data.

ECL is a very high-level language; it mostly consists of a series of commands to get the results you want from your data. It is non-procedural. There are no assignment statements, and no variables. Definitions are designed exactly once and cannot be redefined. There is no executable code being written; all of that is handled by the compiler. They are essentially just a series of instructions.

Example definition:

IsSeniorCitizen := People.Age >=65;
SeniorAvgAge := AVE(People(IsSeniorCitizen), Age);

ECL Watch Console

Getting it running requires the installation of a Virtual Server (we used Oracle’s Virtual Box, as recommended) to set up a node. Access to it is provided through a web browser where one can manage all aspects including loading and managing data sources, publishing to services, event scheduling, and more.

Data sets can be loaded and “sprayed” into a node for processing purposes. Included within HPCC is their ECL IDE (Integrated Development Environment) which is where one creates workflows for managing data.

A simple workflow with an inline data set might look something like this:

Data can be displayed in a graphical format as well:

You’ll note that all of the data processing is handled in a code window. All preparation and processing data is handled this way. For example, converting all names in a data file into upper case would look something like this:

The output of this code shows as below:

Querying Data

Getting information from a data file is also relatively straightforward. Here is an example of querying data with a specific zip code, and the results:

Web Service Forms

In many cases you most likely will not be wanting to write out queries in code each time, so HPCC makes it possible to create basic web services to make it possible to query this information. By changing the query to use an input like so:

This will enable the web service to create a form like this:

Which will then generate readable output.

Wyzoo Star Ratings

4.0 Overall functionality useful to a direct marketer Overall functionality useful to a direct marketer 4.0

HPCC is an impressive and powerful data transformation and analysis tool for massive datasets. While it can work with smaller amounts of data, it scales rapidly; all that is required is spinning up a few more nodes. That said, from the point of view of most direct marketing operations, this might be a bit of overkill; the cost-value proposition here would be for time vs output. If one has a very large set of data that is expected to grow rapidly, and which needs massive real-time processing, this could be a very useful tool. However getting it up and running may take some time, so the value is dependent on the organization.

If you are working with billions of data points, it can be helpful and quite powerful, and its flexibility and scalability make it appealing for those organizations who expect to work at this level. For smaller organizations, the learning curve could be prohibitive.

2.0 2.0

There is virtually nothing “out of the box” in HPCC Systems. While getting it up and running isn’t much more time-consuming than other options, getting an understanding of how to use it is not something you could start with on day one. Even for data science professionals, expect to spend some time familiarizing yourself with the environment and the manuals.

3.0 3.0

HPCC Systems official forums are available directly on their website. While there is a presence, activity seems to be slow. (There have been no new posts for the past month at the time of this writing)

The Github repository itself appears to be fairly active (there are commits as recently as in the past week). There is no noticeable presence on Stackoverflow for either HPCC or ECL (other than a few sparse questions)

HPCC Systems Official Forum hpccsystems.com

Github:
github.com/hpcc-systems

  • Commits:

    23954

  • Contributors:

    42

  • Releases:

    517

  • Watch:

    44

  • Star:

    394

  • Fork:

    222

  • Commits/Contributors:

    570

2.0 2.0

Unless familiar with programming it’s not something that one could jump into right away. As mentioned, there is no GUI component; each process needs to be defined separately. To be able to simply create components, you need to learn an entirely new programming language.

That said, ECL is a very high level language; it consists mostly of a series of commands to get what you want it to provide you about your data. It is non-procedural – it is declarative. You ask it questions, it gives answers, so it is most certainly something that can be learned. However if one is not technically inclined, this tool may not suit your needs.

Summary: Key takeaway

HPCC Systems is by far one of the most powerful data analytics engines available. LexisNexis made itself known as a source for finding out information about anything and anybody very quickly, and this tool has the features which make it possible for other organizations to make use of this processing power. If as a direct marketing organization you need to process billions of data points regularly and quickly, and to have the ability to get needed information quickly, it could be highly effective.

However, the learning curve involved with using HPCC Systems could be a barrier for many companies. Those who do choose to go with this tool will be very impressed by its speed, power, and flexibility. The ECL language itself is intuitive and not difficult to learn, and if you are looking for tools to build your own systems, it could be an excellent choice.

Integrations

  • Spark
  • Pentaho
  • R
  • ECL for VS Code
  • JDBC Driver
  • Java API
  • ODBC Driver

Want more information about implementing HPCC Systems in your organization?

Other News and Reviews