Data Lake

Data Lake
Data Lake2018-04-02T12:11:32+02:00

Making sense of data.

A data lake is a central point that holds an enormous amount of raw or refined data in native format until it is accessed. The term data lake is usually associated with Big Data in which our data is loaded into the Big Data system and then data science techniques are applied to the data. In this definition the data lake is not too dissimilar to a staging area in a data warehouse. In a staging area we make a copy of the data from the source systems. We transform and integrate this data downstream to populate our data warehouse. Traditionally, a staging area only has one consumer: the downstream processes to populate the data warehouse. The raw data reservoir on the other hand has multiple consumers including the ETL to populate the data warehouse. Other consumers could be sandboxes for self-service and advanced analytics, enterprise search engines, an MDM hub etc. One of the benefits of making the raw data available to more consumers is that we don’t hit our source systems multiple times. We can optionally audit and version the raw data by keeping an immutable history of changes. This might be useful for compliance reasons (GDPR, PCI DSS, PII, PHI, HIPAA, SEPA and many other).

Depending on the requirements, a typical organization will require both a data warehouse and a data lake as they serve different needs, and use cases.

Data lakes can also be used effectively without incorporating big data technologies. This term is increasingly being used to describe any large data pool in which the schema and data requirements are not defined until the data is queried.

Insight-driven companies make more money and develop more sustainable barriers to entry. They accomplish this by experimenting and continuously learning.

Vendors and analysts often combine the concept of the raw data reservoir with the idea of self-service analytics. Clasic scenarion for self-service is Advanced Analytics where question like “Which of our customers will churn?” come into action. This is a standard problem for predictive analytics. In the past, client tools such as Matlab, SPSS or even Excel were used to find answers. Employees pulled the required data onto their laptops and worked away. This approach is flawed for various reasons. It lacks collaboration features, puts the security of enterprise data at risk (think of lost laptops without data encryption), and the approach does not scale beyond small volumes of data. With a sandbox we can pull the required data into a web based environment and collaboratively go through the lifecycle of building, training, and productionising a predictive model.

The value of a data lake resides in the ability to develop solutions across data of all types – unstructured, semi-structured and structured.

Why Choose Us

Data Lakes are becoming more and more central to enterprise data strategies. Data Lakes address much greater data volumes and varieties, higher expectations from users, and the rapid globalization of economies.

  • Experience in different industries
  • Experts with many years of experience in data related projects
  • Awesome data projects that we would like to share with you