HAD7001H-F2

Introduction to Data Science

Course Description

Are you ready to start your journey to becoming a Data Scientist?

This course will employ Python to cover the entire end-to-end data science pipeline—from Data Acquisition to Data Preparation, Exploratory Data Analysis, Data Modeling and Evaluation, and Interpretation and Reporting of Findings. We will spend some time on the theoretical concepts related to data science. However, the majority of the course will focus on applying different techniques to real data and interpreting the results. The course is ideally suited for absolute beginners in Data Science and Machine Learning. While a background in Python programming and data science can help speed up learning, we will cover the basics of Python programming.

Learning Goals

  • Guide students, regardless of prior Python or statistics background, from a basic level to performing advanced data science techniques using Jupyter Notebook.
  • Equip students to use Python for performing various data analysis, visualization, and modeling tasks.
  • Introduce essential statistical and machine learning concepts in a practical manner, enabling students toapply these concepts to real-world data analysis and interpretation.
  • Provide students with a strong foundation in key data science techniques.
  • Enable students to determine the most suitable data science techniques to address their research questions, apply them to their data, and interpret the results.
  • Prepare students to contribute to the rapidly expanding field of data science by establishing a solid understanding of its core principles and foundations.
  • Prepare students to understand foundational machine learning methods for making data actionable in policy making.
  • Emphasize the mathematical and statistical foundations necessary for understanding data science.
  • Expose students to real-world datasets and their applications through practical assignments and projects

Recommended (not required) Textbooks

  • Nawaz, M. W. Data Science Crash Course for Beginner: Fundamental and practices with python.
  • Igual, L., & Segu ́ı, S. (2024). Introduction to data science. In Introduction to Data Science: A PythonApproach to Concepts, Techniques and Applications. Cham: Springer International Publishing.
  • Baig, M. R., Govindan, G., & Shrimali, V. R. (2021). Data Science for Marketing Analytics: A practical guide to forming a killer marketing strategy through data analysis with Python. Packt Publishing Ltd.
  • Ou, G., Zhu, Z., Dong, B., & Weinan, E. (2023). Introduction to data science. World Scientific.
  • McKinney, W. (2022). Python for data analysis. “O’Reilly Media, Inc.”.
  • Grus, J. (2019). Data science from scratch: first principles with python. O’Reilly Media.

Instructor

Jude Kong

Jude Kong

Accepting Students

Director of the Artificial Intelligence and Mathematical Modelling Lab (AIMM Lab)

Evaluation Breakdown

5% of final grade
35% of final grade
10% of final grade
10% of final grade
40% of final grade

Tentative Schedule

DatesWeekContent
09/08-09/14 11. Introduction to Data Science and Decision Making
1.1. Applications of Data Science
1.2. What This Course Is About
1.3. The Data Science Pipeline
1.4. Python Installation and Libraries for Data Science

2. Data Acquisition
2.1. Loading Data into Memory
2.2. Sampling Data
2.3. Reading from Files
2.4. Getting Data from the Web/Social Media
09/15-09/2123. Data Preparation
3.1. Pandas for Data Preparation
3.1.1. Putting Data Together
3.1.2. Concatenating Data\Merging Data
3.1.3. Combining Data
3.2. Data Transformation
3.2.1. Removing Unwanted Data and Duplicates
3.2.2. Handling Outliers
3.2.3. Handling Missing or Invalid Data
3.2.4. Data Mapping
3.2.5. Discretization and Binning
3.2.6. Selection of Data
 
09/22-09/2834. Exploratory Data Analysis
4.1. Important Statistics for Data Science
4.2. Plots and Charts
4.2.1. Line Plot
4.2.2. Scatter Plot
4.2.3. Box Plots
4.2.4. Bar Chart
4.2.5. Pie Charts
4.3. Testing Assumptions about Data
4.3.1. Checking Assumption of Normal Distribution of Features
4.3.2. Checking Independence of Features
4.4. Selecting Important Features/Variables
09/29-10/0545. Data Modeling and Evaluation Using Machine Learning
5.1. Basic Machine Learning Terminology
5.1.1. Supervised and Unsupervised Learning
5.1.2. Training and Test Data
5.1.3. Cross-Validation
5.1.4. Error, Loss Function, and Cost Function
5.1.5. Learning Parameters of a Model
5.1.6. Feature Selection and Extraction
5.1.7. Underfitting and Overfitting
 
10/06-10/125Thanksgiving Day/Reading week (University closed)
10/13-10/1965.2. Supervised Learning: Regression
10/20-10/2675.3. Supervised Learning: Classification
5.3.1. Logistic Regression
5.3.2. Nearest Neighbor Classification
5.3.3. Naïve Bayes’ Classification
10/27-11/0285.3.4. Decision Trees
5.3.5. Ensemble Method: Random Forests
11/03-11/0995.4. Unsupervised Learning
5.4.1. Clustering: K-Means Clustering
5.4.2. Dimensionality Reduction: Principal Component Analysis (PCA)
11/10-11/16105.5. Evaluating the Performance of the Trained Model
5.5.1. Holdout
5.5.2. Cross-Validation
5.5.3. Model Evaluation Metrics: Classification Accuracy, Confusion Matrix, Area Under the Curve (AUC), Regression Metrics
11/17-11/23116. Interpretation and Reporting of Findings
6.1. Confusion Matrix
6.2. Receiver Operating Characteristic (ROC) Curve
6.3. Precision-Recall Curve
6.4. Regression Metrics
7. Further Avenues
7.1 Sentiment Analysis
11/24-11/3012Project presentations
12/01-12/0713Project presentations