AI

Final Course Project

Resources

Hands-on Tutorial on Neural Networks
UCI ML database
An Overleaf Report Template
NN-SVG (for creating NN architecture drawings)
A template for writing your peer-review (and some tips)
UMSL Writing Center
Google Doc and Use of process Feedback

Overview and requirements

The course lectures and activities provided in the ‘Neural networks using Tensorflow’ tutorial (see the nn-tf tab) module will help you learn all the background to complete this course project. The overall goal of working on this course project is to learn the foundations of using Tensorflow/Keras to build, train, and evaluate feed-forward neural networks on a standard tabular data that can be framed as a classification or a regression problem. If you are learning machine learning for the first time, a binary classification problem will probably be easier (not a regression problem). A problem is ‘binary classification’ if your output column is 0 or 1. If your output column has continuous values, even if they are between 0 and 1, it is a regression problem, not classification. While working on the project you will compare the performance (accuracy, speed, etc.) of various neural network models including a basic single layer model. For a regression problem, the single layer model is a linear regression model, and for a binary classification problem, the single layer model is a logistic regression model. Once again, please note that ‘logistic regression’ is a ‘classification’ technique. You will also learn to investigate “what”, “how”, and “why” a model makes predictions.

You will work on your project individually (i.e. group submissions are not allowed). Also, your reports for all phases (including the final report) should be prepared using Overleaf. Non-overleaf submissions will receive a 0 (zero). If you have accessibility needs please email me and I will waive this Overleaf requirement. For doing the experiments, you are NOT allowed to use existing libraries that directly build models for you including: XGBoost(), RandomForestClassifier(), LogisticRegression(). You are also discouraged to use external library methods such as “from sklearn.preprocessing import LabelEncoder”. You are welcome to use Google docs and/or the free version of Grammarly to revise your writing.

Those who help others in the discussion board of each project phase may receive extra points.

The project is divided into various phases (please see below). For each phase you are expected to submit the following four items:

an HTML version of your ‘annotated’ Python Notebook (along with the outputs of the cells),
a PDF report generated from your Overleaf project,
a link to your Overleaf project, and
a short video recording, of around one minutes, summarizing what you accomplished in the particular phase.

All of these four items will be visible to the rest of the class.

If you are using Google Colab, please convert the notebook to .html files and submit the .html file, for example using htmltopdf or using the command below within a cell of your Google Colab notebook. The PDF report should describe your findings addressing all the requirements of the project phases. The final report has a limitation on the number of pages but all other reports can be as long as you want.

%%shell
jupyter nbconvert --to html /content/Phase2.ipynb

Below is the list of the phases along with the expectations in each phase.

Phase 1: Data analysis & preparation

Before working on this phase, please practice “Activities 1 and 2” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf link). You may also find this short lecture on ‘how to clean a tabular dataset for machine learning’ useful to learn how to clean your data. The most important task in this phase is to find a dataset for your project. You can pick a dataset from the UCI ML database or Kaggle. Once you have found a dataset you like, the next step is to load the data in a Python Notebook and normalize your data. In your report’s introduction section please discuss why you chose to work on this project and also explain the problem you plan to solve. You should also mention the source of your dataset. In your notebook please visualize/plot the distributions of each input feature and discuss the range of the values (min, max, mean, median, etc.). For example, plot histograms showing distribution of each input features. Selected visualizations should be included in the report.

You should also discuss the distribution of the output labels. Please check if the data is imbalanced by calculating what percentage of the output labels are 0 and what percentage are 1. If your dataset is heavily imbalanced (for example, 1% vs 99%) it may be a better idea to choose a different dataset. In the case of regression, check if the values are uniformly distributed or not by plotting the distribution of the output variable. Also, in your notebook, you should normalize your data and discuss this in your report.

There are three restrictions when choosing a dataset. First, you are not allowed to pick a sequence/time-series dataset. Examples of sequence data include stock prices and weather data. You are also not allowed to select a dataset consisting of image inputs or text inputs (natural language processing datasets). Second, please do not choose the “Iris flower dataset”, “Pima diabetes dataset”, or the “Wine quality dataset”. I already discuss these in my lectures. The last restriction is that your tabular data should have at least around a 1000 rows and at least 3 features (columns) in addition to the output column. If you use a dataset with more than 100,000 rows, please email me right before the last week of the semester to claim an extra bonus point.

Here is an example report.

Phase 2: Build a model to overfit the entire dataset

Before working on this phase, please practice “Activities 3 and 4” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf link). The main goal in this phase is to experiment and find what network size is needed to ‘overfit’ the entire dataset at your hand. For this phase, please do not split your data into training and validation. In other words, we want to determine how big architecture we need to overfit the data. The place to start is to use ‘logistic regression’ model and train for as many epochs as needed to obtain as high accuracy as possible. After training hundreds of epochs if you observe that the accuracy is not increasing then it implies that the number of neurons in your model (only one) may not be sufficient for overfitting. The next step is to grow your model into a multi-layer model and add a few neurons (say only 2) in the input layer. This way your model will have ‘2 + 1 = 3’ neurons in total. If your accuracy still does not each a 100% or close to 100% you can continue to increase the number of layers and number of neurons. Once you have obtained 100% accuracy (or around 100%) your experiments for this phase are complete. The results of this experiment also inform us that our final model (in subsequent phases) should be smaller than this model. Small here refers to number of layers and number of neurons in each layer.

Phase 3: Model selection & evaluation

Before working on this phase, please practice “Activities 5, 6, 7, and 8” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf link). The main goal in this phase is to obtain highest possible accuracy on the validation set after splitting your data into training set and validation set. Please shuffle your rows before splitting. As your baseline model, i.e., the model with minimum accuracy, you can test the accuracy on the validation set using a ‘logistic regression’ model. Then you can gradually grow your model into a multi-layered model and investigate if larger models deliver higher accuracy on the validation set. Please note that your model should be smaller than the model in the previous phase. As you explore various network architectures, please note the accuracies of these models to include in your report. You can summarize your findings in the form of a table and the table should contain the accuracy and loss on the training set and the validation set (see below). You can also include other parameters such as number of epochs, number of neurons, total number of parameters, etc. Also remember to select one model as your best performing model, i.e., the model that delivers highest accuracy on the validation set. Your report should also include learning curves of your experiments. Additionally, you should also evaluate your models using other metrics besides precision; for example recall, precision, and F1 score. Please note that your submission for this phase is ineligible for points if you do not use “model checkpointing” in your code. You are discouraged to use external library methods such as “from sklearn.model_selection import train_test_split”.

Here is an example table that your report should include. The accuracy of the random baseline classifier is the percentage of the largest class. A neural network model with 16 neurons in the first layer, 8 layers in the second layer and 1 neuron in the last layer can be written as 16-8-1.

Model	Acc. on Training Set	Acc. on Validation Set
Random baseline classifier	0%	0%
Logistic regression model	0%	0%
Neural network model (64-32-16-8-1)	0%	0%
Neural network model (32-16-8-1)	0%	0%
Neural network model (16-8-1)	0%	0%
Neural network model (8-1)	0%	0%
Neural network model (4-1)	0%	0%
Neural network model (2-1)	0%	0%

[FOR GRADUATE STUDENTS ONLY] In addition to the requirements above, graduate students are required to do the following two tasks to receive full points: 1) discuss what architecture (how big) you do need to overfit when you have output as additional input feature, 2) code a function that represents your model. Once you have finished coding your model, please build your own function/method that serves as a prediction model. Afterwards, please verify that predictions you obtain are same as the one you obtained using your trained model. The lecture on Linear regression with two input variables will be helpful to complete this task. Here is a draft:

# This code does not run; it is only meant to serve as an example
def my_prediction_function(model, data):
  w = [None]*numOfFeatures
  for i in range(numOfFeatures):
     w[i] = model.layers[numOfLayers-1].get_weights()[0][i]
  bias = model.layers[numOfLayers-1].get_weights()[1]   
  z = 0
  for i in range(numOfFeatures):
    z = z + features[numOfLayers-2][:,i]*w[i]
  z = z + bias
  result = 1/(1+np.exp(-z))
  return result 

Here is an example report.

Phase 4: Feature importance and reduction

Before working on this phase, please practice “Activity 9” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf link). The key activity in this phase is to study the importance of the input features by iteratively removing them. You must continue to use model checkpointing in this phase. Here are the steps involved:

If you have 10 input features/columns, train 10 models where each model only receives one feature at a time. For example, if age, BMI, and blood pressure are your only three input features, you train three models: one that only take age as input, another that only takes BMI as the input, and the last one that takes only blood pressure as the input. The validation accuracy of these three models will indicate the relative importance of the three features. You should plot these validation accuracies in the form of a bar diagram. If all your accuracies are more than 80%, your plot’s y-axis should be limited to 80-100.
From the previous step you have the significance/important of each feature. The feature that yields the highest accuray is the most important feature.
Starting with the most unimportant feature, remove one feature at a time (without replacement) and train various models. You can iteratively repeat the process removing more and more unimportant features. For example, if BMI is the most important feature and blood pressure is the least important one, you would train two models: one without blood pressure, and one without blood pressure and age. Plot the validation dataset accuracy of all the models that you tested. The overall objective is to identify non-informative input features and remove them from the dataset. Finally, you can compare your feature-reduced model with the original model with all input features and discuss the difference in accuracy.

Here is an example report.

For bonus points: Use model-agnostic methods such as LIME or Shapley values to derive feature importance.

Phase 5: Final report

Please submit a PDF of your final report. It should contain the important findings in each phase of your project, except for Phase II. Once again, your report should NOT include the results of your Phase II. This can be confusing to the readers of your report. If you do include, please clearly mention that these results are ‘for the training set when trained using the same training set’. Your report should not be very long; 10/12 pages at most. Your tables and figures should be numbered and captioned (labelled) appropriately. Please resize the figures appropriately and ensure that none of your figures flow outside of the border. If you are copying images from a Notebook, please remember to turn off the ‘dark mode’ in Notebook before you copy images/plots. In a notebook in dark more, the labels and ticks in images are difficult to notice. Your report should include abstract and conclusion (each 250 words minimum). Please also submit a link to your final Notebook. Optionally, you are welcome to host your project (and report) at Github (i.e., no extra points for hosting). Your final/best model should also be evaulated using ROC and AUC.

Sample final reports

For your reference, I have listed below some final reports by students who took this course in earlier semesters. Please understand that these examples are only meant to be references and your focus should be on meeting the requirements mentioned above instead of preparing a report similar to these example reports.

EPL game result prediction by Bikash Shrestha - report
Prediction of housing prices by Syeda Afrah Shamail - report
Factors in tennis by John Soderstrom - report
Predicting pulsar stars by Duc Ngo - report
Predicting MLB runs scored by Miguel Corona - report