Open in App
Dr Mehmet Yildiz

How To Design Data Lakes

2021-01-10

As Big Data touches every part of our life. Knowing about tools and processes of Big Data for data and information professionals became vital. Data Analytics is one of the primary use cases for Data Lakes.

https://img.particlenews.com/image.php?url=29HVzw_0YGIISxR00

Photo by Fabio Rodrigues on Unsplash

In this post, I provide an overview of data lakes including its business value and design principles.

Data lakes can provide substantial business value propositions. One of the key business value proposition of data lakes comes from being able to perform advanced analytics very quickly for data coming from various real-time sources such as clickstreams, social media, system logs.

As data lakes are highly agile in deployment and easily configurable for use, these merits pose a compelling business value for organisations aiming to agile and continuous service delivery frameworks for data consumption.

Effective use of data lakes for speedy and and timely data process and consumption can help the business stakeholders to identify opportunities rapidly, make informed decisions, and act on their decisions expeditiously, and offer their products and services rapidly for the business customers.

A data lake is a dynamic data store which keeps the data for an extended period of time. It can be fed iteratively using multiple data sources as further clean data are discovered and transformed from various sources in various parts of the enterprise. For example, a data lake can store relational data from enterprise applications and non-relational data from IoT devices, social media streaming, and data from mobile applications and devices.

A data lake can be a single store of transformed enterprise data in the native format. These transformed data stores are usually reported, visualised, and analysed using advanced analytics.

https://img.particlenews.com/image.php?url=3ZoySD_0YGIISxR00

Photo by Philipp Schneidenbach on Unspl

A data lake can include structured, semi-structured and, unstructured data. Structured data is traditionally well-managed, relatively more straightforward, and not a big concern of the overall data management process. However, the challenge is related to semi-structured, and more importantly dealing with unstructured data. I plan to cover these concerns in a different article.

There are other several use cases for data lakes.The primary use case for data lakes is to take advantage of clean data store based on self-service without needing technical data professionals. This use case refers to the data consumers in departments or across the enterprise. These people need to use data for various purposes without needing the help of a team of data platforms and practice experts.

Real-time data analysis leveraging the data sources coming from various sources is a common use case.

From my recent digital transformation initiatives, auditing requirements for corporate compliance and centralisation of data was frequently mentioned use case for data lakes.

Another use case is related to the goals of having a complete view of customer data coming from multiple sources.

Architecting and designing data lakes require upfront planning for data types with substantial input from business stakeholders.

For example, if the purpose of data is unknown and not clearly stated by the business stakeholders, we may consider to keep data in raw format so that it can be used by data professionals in the future when it is needed.

In data lakes, instead of using schema, we usually store information using unique identifiers and metadata tags. For a data lake, if a schema is required, this can only be an on-read basis, rather than on write.

On-write is an essential requirement for a data warehouse which I plan to cover in a different article. In the meanwhile, to inform you technically, for a data lake, the schema only is created when reading the data from sources. A schematic structure can be applied to the data only when it is read. Use of on-read schema allows unstructured data to be stored in the database.

https://img.particlenews.com/image.php?url=1oUssR_0YGIISxR00

Photo by Yuyeung Lau on Unsplash

Another architectural consideration is keeping in mind that the data in the data lakes do not go through the ETL process.

ETL stands for Extract, Transform, Load. ETL is a procedure to copy data from data sources to data destinations.

From a storage architecture perspective, the data lakes are considered as the raw big data stores, not optimised or transformed for specific data consumers. The storage architecture for data lakes mandates the data stores to be based on low-cost storage units. This architectural consideration has a favourable impact from the business value point of view.

From the non-functional requirements perspective, the key architectural consideration for data lakes is scalability because data-growth is the main strategic focus of data lakes in business organisations. Hence, scalability coupled with capacity are critical success factors for architecting effective data lake solutions.

One of the critical challenges of data lakes is maintaining security. As we know , the data comes to the data lake in real-time from multiple uncontrolled sources. To address this challenge, a well-governing data security architecture specifically including access controls and semantic consistency must be in place. As architects, we need to engage security specialists to take required measures for addressing security concerns of the data lakes.

As architects, we don’t design data lakes. Data lake design is a specialist level activity usually conducted by an experienced data storage architect or product specialist leveraging skills of multiple data management specialists. However, the design for data lakes must comply with the architectural framework, principles, and guidelines.

Thank you for reading my perspectives.

Ref: Architecting Big Data & Analytics Solutions

If you enjoyed this story, you may check my other technology articles on News Break.

Importance of Protocols And Standards For IoT Solutions

I Solve The Mystery of IoT and Explain It In Plain Language

Edge Computing Is Not As Complicated & Scary As You May Think

My View Of Blockchain Is Different Because I Design It.

How To Deal With Big Data For Artificial Intelligence?

An Overview of Business Architecture For Entrepreneurs

Remarkable Leadership Traits for Technology Executives

Expand All
Comments / 0
Add a Comment
YOU MAY ALSO LIKE
Most Popular newsMost Popular

Comments / 0