Where Red Sqirl lies in the Big Data landscape


Today, Big Data is the platform of choice for storing, exploring, visualising and modelling your data.

In order to get where it is today, there have been a number of distinct generations of Big Data, each one advancing on from the one before it. The first generation simply gave us the ability to analyse Petabytes of data with tools like MapReduce. The next generation gave us more responsive tools for analysing data such as Spark, Impala and Prestodb. Now, this third generation sees the emergence of tools for moving data around such as Kafka, Kudu.

The Data Lake & Real Time Analytics

These tools have changed the way that data is analysed and how data solutions are now built. The emergence of Big Data has brought with it many new concepts, one of them being the Data Lake.

A Data Lake is a massive enterprise-wide data repository to which analysts can contribute to and cherry pick data they need, in a format best suited to the data. The Data Lake looks to solve the problem of data silos, eliminating dozens of independently managed data collections and creating one combined data collection. Data lakes have become essential to Big Data projects due to an increasing demand for data to be accessible and agile.

Another term becoming popular right now is Real-time analytics. Essentially, it means triggering an event that fulfils a prerequisite in real time. Although the term real-term is misleading, as you can’t actually analyse Big Data in real-time, but only act on it. Real-time analytics works by rather than analysing an entire base, the analytics instead relies on intelligently interacting with parts of the data lake, in order to perform actions on a user by user basis.

Real-time analytics relies heavily on periodic batch process analyses to continuously evaluate the impact of new data, and make the behaviour of each user's action evolve over time. Without these batch processes, the analytics being performed would not be on up to date data.

Data Pipelines

The key to any Big Data analytics job and an up to date Big Data warehouse are these periodic background processes, and if done well, a huge range of services can be built from them.

For example, you can perform ad-hoc analyses easily, you can maintain analytics jobs and upgrade/update them quickly etc. The method for creating these background processes is known as building data pipelines. Building data pipelines are an essential part of analysing data using Big Data techniques.

It's for this reason, that all the most popular Hadoop distributions - Cloudera, Hortonworks, MapR, all include a tool for periodic processing: Apache Oozie.

Apache Oozie is the tool that triggers processes based on time and data availability, Oozie supports any data format, language and is fault tolerant. Apache Oozie is, however, very difficult to use as there is a lot of overhead between implementing a process to run once and running it on a regular basis.

So we built Red Sqirl

Red Sqirl is a drag & drop analytics tool which can also build Oozie workflows in the background. With Red Sqirl you can build, deploy and maintain data pipelines easier than ever before using an intuitive drag and drop interface.