Data Science workflow: how to structure your next data science project

So you have decided to walk the path of Data Science. That’s excellent news! Data Science is the way for you to take your analytics to a whole new level, bringing you tremendously accurate insights that will help your business grow.

But, where to start?

First, you have to set up a workflow that will determine the different phases of the project. Using a well-defined data science workflow is useful in that it provides a simple way to remind all data science team members of the work to be done to do a data science project.

Data Science

The Data Science workflow has four well-defined phases:

  1. Preparation Phase
  2. Analysis Phase
  3. Reflection Phase
  4. Dissemination Phase


Preparation Phase

Before any analysis can be done, the data scientist must first acquire the data and then reformat it into a form that is compatible with the data science technology to be used.

The obvious first step in any data science workflow is to acquire the data to analyze.  Data can be acquired from a variety of sources. such as:

  • Online repositories such as public websites (e.g., U.S. Census data sets).

  • On-demand from online sources via an API (e.g., the Bloomberg financial data stream).

  • Automatically generated by physical apparatus, such as scientific lab equipment attached to computers.

  • Generated by computer software, such as logs from a web server or classifications produced by a machine learning algorithm.

  • Manually entered into a spreadsheet or text file by a human.

Raw data is probably not in a convenient format for a programmer to run a particular analysis, often due to the simple reason that it was formatted by somebody else without that programmer’s analysis in mind.  A related problem is that raw data often contains semantic errors, missing entries, or inconsistent formatting, so it needs to be “cleaned” prior to analysis.

Programmers reformat and clean data either by writing scripts or by manually editing data in, say, a spreadsheet.

Analysis Phase

The core activity of data science is the analysis phase: writing, executing, and refining computer programs to analyze and obtain insights from data. We will refer to these kinds of programs as data analysis scripts, since data scientists often prefer to use interpreted “scripting” languages such as Python, Perl, R, and MATLAB. However, they also use compiled languages such as C, C++, and Fortran when appropriate.

Data Science

Reflection Phase

Data scientists frequently alternate between the analysis and reflection phases while they work. The reflection phase involves thinking and communicating about the outputs of analyses. It may consist of taking notes and sharing them in meetings with other team members in order to compare and contrast, considering alternatives and organizing the insights obtained in the process.

Dissemination Phase

The final phase of data science is disseminating results, most commonly in the form of written reports such as internal memos, slideshow presentations, business/policy white papers, or academic research publications. The main challenge here is how to consolidate all of the various notes, freehand sketches, emails, scripts, and output data files created throughout an experiment to aid in writing. It takes a very organized team to make this phase work properly, since a hefty amount of data will be obtained in different forms.

This is a very brief overview of the Data Science workflow, if you want to know more about this subject, we are more than happy to help you out, let’s have a talk! Thank you for reading.

We are close to you
Follow us!

© emcor software all rights reserved 2021.