Skip to content

Data Science Process

First we need to pre-process it into a usable form

  • For traditional data
    • class labelling is done (categorical vs numerical)
    • data cleaning is done
    • missing valued are dealt with
    • case specific steps are also done, like balancing and shuffling datasets (like changing the schema of database, etc)
    • e.g. - basic customer data, historical stock price data
    • Tools:
      • Programming: R, Python, SQL, Matlab
      • Software: Excel, IBM SPSS
    • Who does this:
      • DATA ARCHITECT
      • DATA ENGINEER
      • DATABASE ADMINISTRATOR
  • For big data
    • preprocessing is done same as with traditional data
    • case specific steps are also done, like text data mining
    • e.g. - social media, financial trade data
    • Tools:
      • Programming: R, Python, Java, Scala
      • Software: hadoop, HBASE, mongoDB
    • Who does this:
      • BIG DATA ARCHITECT
      • BIG DATA ENGINEER

Now the data is processed in usable form, we use data to create reports and dashboards to gain business insights

  • What is done?
    • data is analyzed, and info is extracted from it in form of:
      • KPI
      • metrics
      • reports
      • dashboards
  • e.g. - price optimization, inventory management
  • Tools:
    • Programming: R, Python, SQL, Matlab
    • Tools: Excel, PowerBI, SAS, Qlik, tableau
  • Who does this:
    • BI ANALYST
    • BI CONSULTANT
    • BI DEVELOPER

Now we do Predictive Analysis on the data. We can either use traditional techniques or machine learning

  • Using traditional techniques advance statistical methods
    • What is used:
      • Regression
      • Logistic Regression
      • Clustering
      • Factor Analysis
      • Time Series
    • e.g - User Experience, Sales Forecasting
    • Tools:
      • Programming: R, Python, Matlab
      • Software: Excel, IBM SPSS, EViews, STATA
    • Who does this
      • DATA SCIENTIST
      • DATA ANALYST
  • Using Machine Learning
    • What is used:
      • Supervised Learning
        • SVM
        • NN
        • deep learning
        • random forests
        • bayesian networks
      • Unsupervised Learning
        • K-means
        • deep learning
      • Reinforcement learning
    • e.g. fraud detection, client retention
    • Tools:
      • Programming: R, Python, Matlab, java, js, c, scala, c++
      • Software: Microsoft Azure, rapidminer
    • Who does this:
      • DATA SCIENTIST
      • MACHINE LEARNING ENGINEER