First thing that comes to mind when people talk about data science is AI and Machine Learning. But data scientists spend a lot of time doing different types of work.
Here are the different jobs a data scientist does and whether these jobs can be automated or not.
Conversation with the Clients
Level of Automation: – Impossible
The most important job of a data scientist is talking to the client.
Most of the big enterprises have people who act as an intermediary between data scientists and clients. But it’s important that a data scientist himself identifies the need of the client because that removes the intermediary and helps reduce the risk of some information getting lost in translation.
Practically none of this can be automated.
Talking to people and having full conversations and taking in their ideas is a fully human process.
We are far away from robots and machines having full conversations and understanding the client’s ideas and interests.
The least technical part of data science is the most difficult to automate.
Data Preparation and Data Cleaning
Level of Automation- Partially Automated
Data Preparation and Data Cleaning takes about 80 percent of a data scientist’s time.
Data Preparation and Data Cleaning involves-
- Joining different types of data
- Removing the errors in the data
- Getting access to data
The majority of data preparation and cleaning chores can be completed by applying a small number of simple heuristics to the data until all issues are resolved.
- Check if the distribution of dates in a table are influenced by weekends, holidays, or other regularly occurring events that may be relevant.
- Check for mistakes and correct them if a table has category columns that were entered by hand.
- Check if a numerical column ever takes on a value that is outside of a reasonable range.
These heuristics are mostly simple, but the problem is that there are thousands of them and they are very subjective.
Automation of Data cleaning and Data preparation is possible. Simpler things have already been automated by use of popular libraries.
Automation of data preparation and cleaning can help the Data scientist easily rectify his or her errors and save time.
There are some tasks which cannot be automated as they involve interactions with clients.
Tools for automation of Data cleaning
There are many tools and libraries that can help Data Scientists clean the data. Tools for data cleaning can be easily found by just googling them
List of data cleaning tools are as follows-
- PRETTY PANDAS
Data Exploration and feature engineering
Level of automation- Partially with very good scope
Data Exploration and feature engineering is the next step after data cleaning and preparation
Data Exploration and feature engineering has two aspects
- Understanding Data is the aspect of Data Exploration.
- Understanding problems and applying data is the part of feature engineering.
Data exploration can be automated very nicely. Many tools have been designed in the automation of data exploration.
Tools for automation of Data Exploration and feature engineering
For data exploration-
Numpy and pandas can be used for analysis of data and seaborn, matplotlib and plotly can be used for visualizations.
These libraries combine, filter mostly all kinds of data with ease.
For feature engineering-
Feature tools can be used to generate new features.
GetML can be used which addresses some shortcomings of feature tools.
Level of automation- Fully automated
This is the part where all data turns into results.
This is the part on which data scientists spend most time studying in colleges and it’s the first part to get automated.
This can be further split into
Tools for automating modelling process are as follows-
- DataRobot Automated Machine Learning
- H20 AutoML
Automation in data science is very much possible. Some parts have been fully automated also. Automation helps data scientists in removing errors and helps them easily compile data. However, there are still some sectors of data science which require human conversation. Dealing with clients, understanding their needs is all done via human conversation. Automation is not possible in this department. Automation is the future and its complete incorporation into data science is inevitable. Automation can be very useful when used properly.