No, data lake and data warehouse are not two words for the same thing; neither is data lake just an updated version of data warehouse. The only aspect common to them is that they are both data repositories.
Let’s explore what makes the two different.
Structure of Data
The data stored in a warehouse is structured, processed. The data stored in a lake can be structured, semi-structured, unstructured, or raw. The latter just stores data in any form, there is a set of rules for storing data in the former.
Storage Costs
Storage in Hadoop, which is what data lakes use, is cheaper than storage in data warehouses. Hadoop is open source, which makes licensing and community support free of cost. Moreover, Hadoop is made for low-cost commodity hardware. Although the cost of storage in warehouses has decreased massively over time, the labor required for structuring the data is still expensive.
Processing Approach
To enter data into a warehouse, you need to follow a schema-on-write approach; which means the data to be entered will have to be modeled and shaped. For a lake though, you don’t need to think twice before dumping data in it – just load it in whatever shape you feel like. Structure or model it when you want to retrieve and use it; that is called the schema-on-read approach.
Flexibility
Since there are no preordained rules for data lakes, any query, model or app can be modified easily. On the contrary, something located in a data warehouse will require time and effort for its structure to be changed, since other business processes are tied to it.
Safety
Since data warehouses are the traditional data repositories and have been around for ages, there is reliability in terms of safety. Data lake technology is relatively nascent, and has lesser prowess than warehouses when it comes to data security.
Users
As of now, it’s primarily the data scientists who are flocking to use data lakes, given the stage of maturity that it is at. Data warehouses are not possible for everyone due to the costs involved.
There is a difference of objectives between the two data repositories. And therefore, make a choice with the end goal in mind.
Have ideas contrary to the ones we stated here? We’d love to hear your opinion. Share them in the comments section.
4 Comments
I found your site from Google and also I need to claim it was a great locate.
Thanks!
Hi there! Such a good write-up, thank you!
I spent a great deal of time to locate something like this
This article is full of information about Data Lake, waiting for more like this.