Automation in Data Science

Anurag Rathod

4 years ago

Data science is often associated with AI and machine learning, but data scientists also perform essential human-focused tasks.

1. Conversation with Clients

Level of Automation: Impossible

A key task of a data scientist is direct client communication. This helps eliminate miscommunication and align projects with expectations.

Currently, no machine can fully replicate human conversation or understand a client’s ideas, goals, and preferences.

2. Data Preparation and Cleaning

Level of Automation: Partially Automated

Data preparation and cleaning consumes about 80% of a data scientist’s time. This stage involves:

Joining different types of data

Removing errors

Gaining access to data

Many repetitive cleaning tasks can be automated using heuristics.

Examine date distributions to check for weekends, holidays, or recurring events

Review manually entered category columns and correct errors

Inspect numerical columns to ensure values are within reasonable ranges and flag anomalies

Tools for Automating Data Cleaning

Data scientists can speed up cleaning tasks with tools like:

DORA

DataCleaner

Pretty Pandas

Tabulate

Scrubadub

Arrow

Beautifier

FTFY

3. Data Exploration and Feature Engineering

Level of Automation: Partially Automated, High Potential

After cleaning data, data scientists explore it and perform feature engineering. This step has two components:

Data Exploration: Understanding the data patterns

Feature Engineering: Applying domain knowledge to create meaningful features

Many aspects of exploration can be automated using libraries like NumPy and Pandas for analysis and Seaborn, Matplotlib, or Plotly for visualization.

4. Modeling

Level of Automation: Fully Automated

Modeling transforms clean, structured data into actionable results. Modeling includes:

Model construction

Validation

Hyperparameter optimization

Tools for Automating Modeling

Popular automated modeling tools include:

Run:AI

AutoKeras

Auto-WEKA

DataRobot Automated Machine Learning

H2O AutoML

MLBox

auto-sklearn

These tools allow data scientists to quickly train, validate, and optimize models with minimal manual intervention.

Conclusion

Automation in data science is increasingly feasible and widely adopted. It helps data scientists reduce errors, clean data more efficiently, and streamline the modeling process.