Data Science Process
- For traditional data
- class labelling is done (categorical vs numerical)
- data cleaning is done
- missing valued are dealt with
- case specific steps are also done, like balancing and
shuffling datasets (like changing the schema of database, etc)
- e.g. - basic customer data, historical stock price data
- Tools:
- Programming: R, Python, SQL, Matlab
- Software: Excel, IBM SPSS
- Who does this:
- DATA ARCHITECT
- DATA ENGINEER
- DATABASE ADMINISTRATOR
- For big data
- preprocessing is done same as with traditional data
- case specific steps are also done, like text data mining
- e.g. - social media, financial trade data
- Tools:
- Programming: R, Python, Java, Scala
- Software: hadoop, HBASE, mongoDB
- Who does this:
- BIG DATA ARCHITECT
- BIG DATA ENGINEER
- What is done?
- data is analyzed, and info is extracted from it in form of:
- KPI
- metrics
- reports
- dashboards
- e.g. - price optimization, inventory management
- Tools:
- Programming: R, Python, SQL, Matlab
- Tools: Excel, PowerBI, SAS, Qlik, tableau
- Who does this:
- BI ANALYST
- BI CONSULTANT
- BI DEVELOPER
Now we do Predictive Analysis on the data. We can either use traditional techniques or machine learning
- Using traditional techniques advance statistical methods
- What is used:
- Regression
- Logistic Regression
- Clustering
- Factor Analysis
- Time Series
- e.g - User Experience, Sales Forecasting
- Tools:
- Programming: R, Python, Matlab
- Software: Excel, IBM SPSS, EViews, STATA
- Who does this
- DATA SCIENTIST
- DATA ANALYST
- Using Machine Learning
- What is used:
- Supervised Learning
- SVM
- NN
- deep learning
- random forests
- bayesian networks
- Unsupervised Learning
- Reinforcement learning
- e.g. fraud detection, client retention
- Tools:
- Programming: R, Python, Matlab, java, js, c, scala, c++
- Software: Microsoft Azure, rapidminer
- Who does this:
- DATA SCIENTIST
- MACHINE LEARNING ENGINEER