Updated: Aug 5
The Data Lake concept is no more than a centralised data repository, where the information is physically copied to a single repository, to act as a single data source. With it, we have a pure vision of the data that has not been processed yet. This information is crucial for data scientists who can use it to perform their analyses through predictive modelling and analytical tools.
However, this architecture as two major disadvantages, one being the physical nature of the Data Lake that requires a big investment and second, the limited usage against the investment since it was designed uniquely for data scientists.
Data Virtualisation is a huge asset to help mitigate those disadvantages, for both situations where the Data Lake is a reality or if it does not exist.
In the first case, it allows us to use the raw data of the Data Lake to create views adapted to the needs of the full variety of users in the organisation, from traditional BI users, auditors and data scientists, which can be separated by roles, for instance, departments or profiles, guaranteeing that each role only has access to the data it really needs to create their analysis, reports or models.
On the other hand, in organisations where the investment in a Data Lake has not been done, the Data Virtualisation allows us to create a Logical Data Lake, integrating the raw data into a single virtual repository, creating a centralised catalogue of the several data sources without having a physical repository and all the processes that copy data from the sources to the Data Lake, reducing the time and the costs. This solution has the advantage of having always the precise information in real-time since we are accessing the most recent version of the data directly in the source.
In summary, Data Virtualisation brings numerous benefits for both scenarios, exploring a physical Data Lake and creating and exploring a Logical Data Lake, because it is a very flexible and agile solution since the metadata specifications to access, transform, integrate and clean the data can be defined once and reused several times and by different roles, without having to do any big developments or complex changes to supply the necessary data to each user.
by Carlos Calvão
Business Intelligence Consultant @Passio Consulting