Data Virtualisation for Data Scientists
Nowadays, with the increase of descriptive and predictive methods to improve decision-making processes, data scientists need to be able to speed up their work. On average, they spend 80% of their time on data preparation tasks, such as connecting to multiple sources, extracting, transforming, integrating, and cleaning data, and only 20% on analytics.
The Data Virtualisation solution is non-code-based, meaning it is possible to use drag-and-drop instead of code.
This is where Data Virtualisation can help:
Connecting to multiple data sources: these solutions already have pre-built connectors to several sources, making this part of the process quite fast.
Extracting data: all the data is read in real-time, there is no need to store it anywhere, but there is an alternative to store it using the cache if it might impact the source systems.
Transforming data: in the same way, as before all the transformations are done on the fly, so it is possible to test all transformations before fully implementing them, in this step, it is possible to remove outliers and to provide records with new values.
Integrating data: these solutions act as a virtual database, meaning, operations to integrate data such as joins, minus, intersect, merge, and union are available, making it possible to join data between an Excel file and a SQL Database, or a Web Service and a NoSQL Database.
Cleaning data: it is possible to use filters to only consider the data that is important at the moment, and later come back to it and perform other changes since one of the objectives of these solutions is to reuse as much as possible the work developed on the previous stages.
All of these steps can take an average of 30% of the Data Scientist's time, meaning 50% less time on data preparation tasks and 50% more time on analytics where most of the time should be.
Want to learn more about how can Data Virtualisation help you, contact us.
by Mariana Pinto
Data Virtualisation Consultant @Passio Consulting