The course lectures and activities provided in the ‘Neural networks using Tensorflow’ tutorial (see the nn-tf tab) module will help you learn all the background to complete this course project. The overall goal of working on this course project is to learn the foundations of using Tensorflow/Keras to build, train, and evaluate feed-forward neural networks on a standard tabular data that can be framed as a classification or a regression problem. If you are learning machine learning for the first time, a binary classification problem will probably be easier (not a regression problem). A problem is ‘binary classification’ if your output column is 0 or 1. If your output column has continuous values, even if they are between 0 and 1, it is a regression problem, not classification. While working on the project you will compare the performance (accuracy, speed, etc.) of various neural network models including a basic single layer model. For a regression problem, the single layer model is a linear regression model, and for a binary classification problem, the single layer model is a logistic regression model. Once again, please note that ‘logistic regression’ is a ‘classification’ technique. You will also learn to investigate “what”, “how”, and “why” a model makes predictions.
You will work on your project individually (i.e. group submissions are not allowed). Also, your reports for all phases (including the final report) should be prepared using Overleaf. Non-overleaf submissions will receive a 0 (zero). If you have accessibility needs please email me and I will waive this Overleaf requirement. For doing the experiments, you are NOT allowed to use existing libraries that directly build models for you including: XGBoost(), RandomForestClassifier(), LogisticRegression(). You are also discouraged to use external library methods such as “from sklearn.preprocessing import LabelEncoder”. You are welcome to use Google docs and/or the free version of Grammarly to revise your writing.
Those who help others in the discussion board of each project phase may receive extra points.
The project is divided into various phases (please see below). For each phase you are expected to submit the following four items:
All of these four items will be visible to the rest of the class.
If you are using Google Colab, please convert the notebook to .html
files and submit the .html
file, for example using htmltopdf or using the command below within a cell of your Google Colab notebook. The PDF report should describe your findings addressing all the requirements of the project phases. The final report has a limitation on the number of pages but all other reports can be as long as you want.
%%shell
jupyter nbconvert --to html /content/Phase2.ipynb
Below is the list of the phases along with the expectations in each phase.
Before working on this phase, please practice “Activities 1 and 2” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf link). You may also find this short lecture on ‘how to clean a tabular dataset for machine learning’ useful to learn how to clean your data. The most important task in this phase is to find a dataset for your project. You can pick a dataset from the UCI ML database or Kaggle. Once you have found a dataset you like, the next step is to load the data in a Python Notebook and normalize your data. In your report’s introduction section please discuss why you chose to work on this project and also explain the problem you plan to solve. You should also mention the source of your dataset. In your notebook please visualize/plot the distributions of each input feature and discuss the range of the values (min, max, mean, median, etc.). For example, plot histograms showing distribution of each input features. Selected visualizations should be included in the report.
You should also discuss the distribution of the output labels. Please check if the data is imbalanced by calculating what percentage of the output labels are 0 and what percentage are 1. If your dataset is heavily imbalanced (for example, 1% vs 99%) it may be a better idea to choose a different dataset. In the case of regression, check if the values are uniformly distributed or not by plotting the distribution of the output variable. Also, in your notebook, you should normalize your data and discuss this in your report.
There are three restrictions when choosing a dataset. First, you are not allowed to pick a sequence/time-series dataset. Examples of sequence data include stock prices and weather data. You are also not allowed to select a dataset consisting of image inputs or text inputs (natural language processing datasets). Second, please do not choose the “Iris flower dataset”, “Pima diabetes dataset”, or the “Wine quality dataset”. I already discuss these in my lectures. The last restriction is that your tabular data should have at least around a 1000 rows and at least 3 features (columns) in addition to the output column. If you use a dataset with more than 100,000 rows, please email me right before the last week of the semester to claim an extra bonus point.
Here is an example report.
Before working on this phase, please practice “Activities 3 and 4” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf link). The main goal in this phase is to experiment and find what network size is needed to ‘overfit’ the entire dataset at your hand. For this phase, please do not split your data into training and validation. In other words, we want to determine how big architecture we need to overfit the data. The place to start is to use ‘logistic regression’ model and train for as many epochs as needed to obtain as high accuracy as possible. After training hundreds of epochs if you observe that the accuracy is not increasing then it implies that the number of neurons in your model (only one) may not be sufficient for overfitting. The next step is to grow your model into a multi-layer model and add a few neurons (say only 2) in the input layer. This way your model will have ‘2 + 1 = 3’ neurons in total. If your accuracy still does not each a 100% or close to 100% you can continue to increase the number of layers and number of neurons. Once you have obtained 100% accuracy (or around 100%) your experiments for this phase are complete. The results of this experiment also inform us that our final model (in subsequent phases) should be smaller than this model. Small here refers to number of layers and number of neurons in each layer.
Before working on this phase, please practice “Activities 5, 6, 7, and 8” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf link). The main goal in this phase is to obtain highest possible accuracy on the validation set after splitting your data into training set and validation set. Please shuffle your rows before splitting. As your baseline model, i.e., the model with minimum accuracy, you can test the accuracy on the validation set using a ‘logistic regression’ model. Then you can gradually grow your model into a multi-layered model and investigate if larger models deliver higher accuracy on the validation set. Please note that your model should be smaller than the model in the previous phase. As you explore various network architectures, please note the accuracies of these models to include in your report. You can summarize your findings in the form of a table and the table should contain the accuracy and loss on the training set and the validation set (see below). You can also include other parameters such as number of epochs, number of neurons, total number of parameters, etc. Also remember to select one model as your best performing model, i.e., the model that delivers highest accuracy on the validation set. Your report should also include learning curves of your experiments. Additionally, you should also evaluate your models using other metrics besides precision; for example recall, precision, and F1 score. Please note that your submission for this phase is ineligible for points if you do not use “model checkpointing” in your code. You are discouraged to use external library methods such as “from sklearn.model_selection import train_test_split”.
Here is an example table that your report should include. The accuracy of the random baseline classifier is the percentage of the largest class. A neural network model with 16 neurons in the first layer, 8 layers in the second layer and 1 neuron in the last layer can be written as 16-8-1.
Model | Acc. on Training Set | Acc. on Validation Set |
---|---|---|
Random baseline classifier | 0% | 0% |
Logistic regression model | 0% | 0% |
Neural network model (64-32-16-8-1) | 0% | 0% |
Neural network model (32-16-8-1) | 0% | 0% |
Neural network model (16-8-1) | 0% | 0% |
Neural network model (8-1) | 0% | 0% |
Neural network model (4-1) | 0% | 0% |
Neural network model (2-1) | 0% | 0% |
[FOR GRADUATE STUDENTS ONLY] In addition to the requirements above, graduate students are required to do the following two tasks to receive full points: 1) discuss what architecture (how big) you do need to overfit when you have output as additional input feature, 2) code a function that represents your model. Once you have finished coding your model, please build your own function/method that serves as a prediction model. Afterwards, please verify that predictions you obtain are same as the one you obtained using your trained model. The lecture on Linear regression with two input variables will be helpful to complete this task. Here is a draft:
# This code does not run; it is only meant to serve as an example
def my_prediction_function(model, data):
w = [None]*numOfFeatures
for i in range(numOfFeatures):
w[i] = model.layers[numOfLayers-1].get_weights()[0][i]
bias = model.layers[numOfLayers-1].get_weights()[1]
z = 0
for i in range(numOfFeatures):
z = z + features[numOfLayers-2][:,i]*w[i]
z = z + bias
result = 1/(1+np.exp(-z))
return result
Here is an example report.
Before working on this phase, please practice “Activity 9” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf link). The key activity in this phase is to study the importance of the input features by iteratively removing them. You must continue to use model checkpointing in this phase. Here are the steps involved:
Here is an example report.
For bonus points: Use model-agnostic methods such as LIME or Shapley values to derive feature importance.
Please submit a PDF of your final report. It should contain the important findings in each phase of your project, except for Phase II. Once again, your report should NOT include the results of your Phase II. This can be confusing to the readers of your report. If you do include, please clearly mention that these results are ‘for the training set when trained using the same training set’. Your report should not be very long; 10/12 pages at most. Your tables and figures should be numbered and captioned (labelled) appropriately. Please resize the figures appropriately and ensure that none of your figures flow outside of the border. If you are copying images from a Notebook, please remember to turn off the ‘dark mode’ in Notebook before you copy images/plots. In a notebook in dark more, the labels and ticks in images are difficult to notice. Your report should include abstract and conclusion (each 250 words minimum). Please also submit a link to your final Notebook. Optionally, you are welcome to host your project (and report) at Github (i.e., no extra points for hosting). Your final/best model should also be evaulated using ROC and AUC.
For your reference, I have listed below some final reports by students who took this course in earlier semesters. Please understand that these examples are only meant to be references and your focus should be on meeting the requirements mentioned above instead of preparing a report similar to these example reports.