CS 4300/5300 | project

Semester-long course project

The course lectures and activities provided in the ‘Neural networks using Tensorflow’ (see the nn-tf tab) module will help you learn all the background to complete this semester-long course project. The overall goal of working on this semester long project is to learn the foundations of using Tensorflow/Keras to build, train, and evaluate feed-forward neural networks on a standard tabular data that can be framed as a classification or a regression problem. If you are learning machine learning for the first time, a binary classification problem will probably be easier (not a regression problem). A problem is ‘binary classification’ if your output column is 0 or 1. If your output column has continuous values, even if they are between 0 and 1, it is a regression problem, not classification. While working on the project you will compare the performance (accuracy, speed, etc.) of various neural network models including a basic single layer model. For a regression problem, the single layer model is a linear regression model, and for a binary classification problem, the single layer model is a logistic regression model. Once again, please note that ‘logistic regression’ is a ‘classification’ technique. You will also learn to investigate “what”, “how”, and “why” a model makes predictions.

You will work on your projects individually (i.e. group submissions are not allowed). Also, your reports for all phases (including the final report) should be prepared using Overleaf. Non-overleaf submissions will receive a 0 (zero). If you have accessibility needs please email me and I will waive this Overleaf requirement. This semester-long project is divided into various phases (please see below). In each phase you are expected to submit the following three items: i) an HTML version of your Python Notebook (along with the outputs of the cells, ii) A PDF report generated from your Overleaf project, and iii) a link to your Overleaf project. If you are using Google Colab, please convert the notebook to .html files and submit the .html file, for example using htmltopdf. The PDF report should describe your findings addressing all the requirements of the project phases. The final report has a limitation on the number of pages but all other reports can be along as you want. Below is the list of the phases along with the expectations in each phase.

Phase 1: Data analysis & preparation

Before working on this phase, please practice “Activities 1 and 2” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf tab). You may also find this short lecture on ‘how to clean a tabular dataset for machine learning’ useful to learn how to clean your data. The most important task in this phase is to find a dataset for your project. You can pick a dataset from the UCI ML database or Kaggle. Once you have found a dataset you like, the next step is to load the data in a Python Notebook and normalize your data. In your report’s introduction section please discuss why you chose to work on this project and also explain the problem you plan to solve. You should also mention the source of your dataset. In your notebook please visualize/plot the distributions of each input feature and discuss the range of the values (min, max, mean, median, etc.). For example, plot histograms showing distribution of each input features. Selected visualizations should be included in the report.

You should also discuss the distribution of the output labels. Please check if the data is imbalanced by calculating what percentage of the output labels are 0 and what percentage are 1. If your dataset is heavily imbalanced (for example, 1% vs 99%) it may be a better idea to choose a different dataset. In the case of regression, check if the values are uniformly distributed or not by plotting the distribution of the output variable. Also, in your notebook, you should normalize your data and discuss this in your report. Here is an example report. There are three restrictions when choosing a dataset. First, you are not allowed to pick a time-series dataset. An example time series data is a stock price dataset. You are also not allowed to select a dataset consisting of image inputs or text inputs (natural language processing datasets). Second, please do not choose the “Iris flower dataset”, “Pima diabetes dataset”, or the “Wine quality dataset”. I already discuss these in my lectures. The last restriction is that your tabular should have at least around a 1000 rows and at least 3 features (columns) in addition to the output column. If you use a dataset with more than 100,000 rows, please email me right before the last week of the semester to claim an extra bonus point.

Phase 2: Build a model to overfits the entire dataset

Before working on this phase, please practice “Activities 3 through 6” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf tab). The main goal in this phase is to experiment and find what network size is needed to ‘overfit’ the entire dataset at your hand. In other words, we want to determine how big architecture we need to overfit the data. The place to start is to use ‘logistic regression’ model and train for as many epochs as needed to obtain as high precision as possible. After training hundreds of epochs if you observe that the accuracy is not increasing then it implies that the number of neurons in your model (only one) may not be sufficient for overfitting. The next step is to grow your model into a multi-layer model and add a few neurons (say only 2) in the input layer. This way your model will have ‘2 + 1 = 3’ neurons in total. If you accuracy still does not each a 100% or close to 100% you can continue to increase the number of layers and number of neurons. Once you have obtained 100% accuracy (or around 100%) your experiments for this phase are complete. The results of this experiment also inform us that our final model (in subsequent phases) should be smaller than this model. Small here refers to number of layers and number of neurons in each layer.

Phase 3: Model selection & evaluation

Before working on this phase, please practice “Activities 7, 8 and 9” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf tab). The main goal in this phase is to obtain highest possible accuracy on the validation set after splitting your data into training set and validation set. Please make sure to shuffle your rows before splitting. As your baseline model, i.e., the model with minimum accuracy, you can test the accuracy on the validation set using a ‘logistic regression’ model. Then you can gradually grow your model into a multi-layered model and investigate if larger models deliver higher accuracy on the validation set. Please note that your model should be smaller than the model in the previous phase. As you explore various network architectures, please note the accuracies of these models to include in your report. You can summarize your findings in the form of a table and the table should contain the accuracy and loss on the training set and the validation set. You can also include other parameters such as number of epochs, number of neurons, total number of parameters, etc. Also remember to select one model as your best performing model, i.e., the model that delivers highest accuracy on the validation set. Your report should also include learning curves of your experiment. Additionally, you should also evaluate your models using other metrics besides precision; for example recall, precision, and F1 score. Here is an example report.

[FOR GRADUATE STUDENTS ONLY] In addition to the requirements above, graduate students are required to do the following two tasks to receive full points: 1) discuss what architecture (how big) you do need to overfit when you have output as additional input feature, 2) code a function that represents your model. Once you have finished coding your model, please build your own function/method that serves as a prediction model. Afterwards, please verify that predictions you obtain are same as the one you obtained using your trained model. The lecture on Linear regression with two input variables will be helpful to complete this task.

Phase 4: Feature importance and reduction

Before working on this phase, please practice “Activity 10” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf tab). The key activity in this phase is to study the importance of the input features by iteratively removing them. In other words, you can first train various models with only one feature at a time to learn how predictive each feature is. Once the significance of each feature is known, you can remove the most unimportant feature (i.e., remove the column), retrain the model, and observe the accuracy. You can iteratively repeat the process removing more and more unimportant features. The overall objective is to identify non-informative input features and remove them from the dataset. Finally, you can compare your feature-reduced model with the original model with all input features and discuss the difference in accuracy. Here is an example report.

Phase 5: Final report

Please submit a PDF of your final report. It should contain the important findings in each phase of your project. Your report should not be very long; 10/12 pages at most. Your tables and figures should be numbered and captioned (labelled) appropriately. Please resize the figures appropriately and ensure that none of your figures flow outside of the border. If you are copying images from a Notebook, please remember to turn off the ‘dark mode’ in Notebook before you copy images/plots. In a notebook in dark more, the labels and ticks in images are difficult to notice. Your report should include abstract and conclusion (each 250 words minimum). Please also submit a link to your final Notebook. Optionally, you are welcome to host your project (and report) at Github (i.e., no extra points for hosting).

D. Sample final reports

For your reference, I have listed below some final reports by students who took this course in earlier semesters. Please understand that these examples are only meant to be references and your focus should be on meeting the requirements mentioned above instead of preparing a report similar to these example reports.

EPL game result prediction by Bikash Shrestha - report
Prediction of housing prices by Syeda Afrah Shamail - report
Factors in tennis by John Soderstrom - report
Predicting pulsar stars by Duc Ngo - report
Predicting MLB runs scored by Miguel Corona - report