AI

AI Course Project

Resources

Overview and requirements

The lectures and activities in the ‘Hands-on Tutorial on Neural Networks’ will give you all the background you need to complete this course project. The main goal of the project is to help you learn the basics of using TensorFlow/Keras to build, train, and evaluate feed-forward neural networks on standard tabular datasets—either for classification or regression tasks.

If you’re new to machine learning, starting with a binary classification problem might be easier than tackling a regression problem. A task is considered binary classification when the output column contains just two values, 0 and 1. If your output column has continuous values—even if they only fall between 0 and 1—it’s a regression problem, not classification.

As you work on the project, you’ll compare the performance of different neural network models, including a simple single-layer model. For regression tasks, this model functions as linear regression. For binary classification, it’s equivalent to logistic regression. Just to clarify—logistic regression is a classification method, not a regression method, despite the name!

You’ll also explore the “what,” “how,” and “why” behind your model’s predictions to better understand its behavior.

You’ll be working on this project individually—no group submissions are allowed. All your reports, including the final one, must be prepared using Overleaf. Reports not made in Overleaf will receive a zero. If you have any accessibility concerns or challenges with using Overleaf, please email me, and I’ll be happy to waive that requirement.

When it comes to running your experiments, you cannot use libraries that directly build models for you—this includes libraries like XGBoost(), RandomForestClassifier(), or LogisticRegression(). You’re also discouraged from using external preprocessing tools like LabelEncoder from sklearn.

To help you polish your writing, feel free to use Google Docs or the free version of Grammarly.

Using generative AI tools like ChatGPT or Gemini

You are not allowed to use generative AI tools to complete the entire project all at once. However, you are encouraged to use any online resources—including generative AI—to help you understand concepts and debug your code. If you use generative AI at any point, you must document it by including links to your conversations with the tool. You must also reflect upon the role these AI tools played in helping you complete the project. If you used generative AI, write one paragraph reflecting on the role it played and the lessons you learned. If you didn’t use AI, write a short paragraph explaining why you chose not to use generative AI at all.

Project submission guidelines

The project is divided into multiple phases (details below). For each phase, you are expected to submit the following four items:

  1. An HTML version of your ‘annotated’ Python notebook (including all cell outputs),
  2. A PDF report generated from your Overleaf project,
  3. A link to your Overleaf project, and
  4. A short video recording (~1 minute) summarizing what you accomplished in that phase.

NOTE: Do not ZIP the files. Submit them separately.

All four items will be shared with the rest of the class.

If you’re using Google Colab, please convert your notebook to an .html file and submit that version. You can do this using tools like htmltopdf, or by running the following command in a cell within your Colab notebook:

%%shell
jupyter nbconvert --to html /content/Phase2.ipynb

The project will conclude with an in-person oral exam (VIVA). You are expected to understand all the contents in your submission, including your code, and be able to explain it instantly.

Below is the list of the phases along with the expectations in each phase.

Phase 1: Data analysis & preparation

Before working on this phase, practice “Activities 1 and 2” in the ‘Neural networks using Tensorflow’ crash course. You may also find this short lecture on ‘how to clean a tabular dataset for machine learning’ useful to learn how to clean your data. The most important task in this phase is to find a dataset for your project. You can pick a dataset from the UCI ML database or Kaggle. Once you have found a dataset you like, the next step is to load the data in a Python Notebook and normalize your data. In your report’s introduction section please discuss why you chose to work on this project and also explain the problem you plan to solve. You should also mention the source of your dataset. Visualize/plot the distributions of each input feature and discuss the range of the values (min, max, mean, median, etc.). For example, plot histograms showing distribution of each input features. Selected visualizations should be included in the report.

You should also discuss the distribution of the output labels. Please check if the data is imbalanced by calculating what percentage of the output labels are 0 and what percentage are 1. If your dataset is heavily imbalanced (for example, 1% vs 99%) it may be a better idea to choose a different dataset. In the case of regression, check if the values are uniformly distributed by plotting the distribution of the output variable. Also, in your notebook, you should normalize your data and discuss this in your report.

There are three restrictions when choosing a dataset. First, you are not allowed to pick a sequence/time-series dataset. Examples of sequence data include stock prices and weather data. You are also not allowed to select a dataset consisting of image inputs or text inputs (natural language processing datasets). Second, please do not choose the “Iris flower dataset”, “Pima diabetes dataset”, or the “Wine quality dataset”. I already discuss these in my lectures. The last restriction is that your tabular data should have at least around a 1000 rows and at least 3 features (columns) in addition to the output column. If you use a dataset with more than 100,000 rows, please email me right before the last week of the semester to claim extra bonus points.

After completing this phase, you are expected to submit all the four items (see above). Here is an example report. This is only an example, NOT a template that you should follow.

Why are you doing this assignment?

This assignment provides hands-on experience in data analysis and preparation, crucial steps in machine learning.

Why this matters for your career?

Skills to analyze and prepare data are highly sought after in various industries. Practicing these skills will allow you to add them to your CV. If you pick a dataset on a topic of your interest, it will help you create stories that you can share with interviewers.

Some ideas to use generative AI

Generative AI can help you ensure that the dataset you have picked meets the requirements I have outlined above. It may also suggest data cleaning techniques beyond what I cover. It may also assist in debugging your code related to data analysis and preparation. As you use these AI tools, note how they are helping you in your pursuit of searching for a dataset or cleaning it.

Phase 2: Predicting without using ANY machine learning

The main goal in this phase is to step back from machine learning entirely and attempt to predict your output using only traditional programming logic and observational analysis. I would like to see how far you can get without any ML concepts, tools, or libraries. This will serve as a baseline, demonstrating what can be achieved with simple heuristics before we dive deep into complex models.

You are expected to develop a Python function that takes your input features (let’s assume they are x, y, and z for now, but use your actual feature names) and returns a single numerical output between 0 and 1. This output should represent your “probability” or “likelihood” of the outcome you’re trying to predict.

Analyze the data

Go back to your dataset from Phase 1. Try to identify patterns or rules that seem to govern the relationship between your input features and the output. Think about:

Develop your prediction function

Write a Python function, called predict_non_ml(x, y, z), that encapsulates these observations and rules. It must return a floating-point number between 0 and 1. Remember, no machine learning algorithms or model training allowed. You can use if/else statements, basic arithmetic, and mathematical functions (min, max, abs, etc.).

Explain your logic

In your report, clearly articulate the reasoning behind your predict_non_ml function. What patterns did you observe? Why did you choose specific weights, thresholds, or rules? How did your “Excel analysis” or manual inspection lead to this specific function?

Evaluate

Apply your predict_non_ml function to the entire dataset. Discuss how well your simple function’s predictions align with the actual outputs. Are there cases where it works well? Where does it fail significantly?

Reflect

In your report, discuss the limitations of this non-ML approach. What challenges did you face? Why is it difficult to achieve high “accuracy” or robust predictions with these methods compared to what you anticipate ML might offer?

Why are you doing this assignment?

This assignment forces you to think fundamentally about data relationships and problem-solving without relying on pre-built machine learning abstractions. It highlights the ingenuity required for rule-based systems and sets a clear benchmark against which the performance of your machine learning models (in later phases) can be truly appreciated.

Why this matters for your career?

Understanding the difference between heuristic-based solutions and data-driven machine learning solutions helps you identify when a simple, explainable rule might suffice versus when the complexity and power of ML are truly necessary. This foundational understanding is invaluable for designing efficient and appropriate solutions in real-world scenarios.

Some ideas to use generative AI

I do not encourage you to use generative AI in this phase except for the following:

  1. Debugging your code
  2. Understanding the concepts involved
  3. Ensuring that you approach does not involve machine learning

Phase 3: Build a model to overfit the entire dataset

The main goal in this phase is to experiment and find what network size is needed to ‘overfit’ the entire dataset at your hand. For this phase, please do not split your data into training and validation. In other words, we want to determine how big architecture we need to overfit the data. The place to start is to use ‘logistic regression’ model and train for as many epochs as needed to obtain as high accuracy as possible. After training hundreds of epochs if you observe that the accuracy is not increasing then it implies that the number of neurons in your model (only one) may not be sufficient for overfitting. The next step is to grow your model into a multi-layer model and add a few neurons (say only 2) in the input layer. This way your model will have ‘2 + 1 = 3’ neurons in total. If your accuracy still does not reach a 100% or close to 100% you can continue to increase the number of layers and number of neurons. Once you have obtained 100% accuracy (or around 100%) your experiments for this phase are complete. The results of this experiment also inform us that our final model (in subsequent phases) should be smaller than this model. Small here refers to number of layers and number of neurons in each layer.

Phase 4: Model selection & evaluation

Before working on this phase, practice “Activities 5, 6, 7, and 8” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf link). The main goal in this phase is to obtain highest possible accuracy on the validation set after splitting your data into training set and validation set. Shuffle your rows before splitting. As your baseline model, i.e., the model with minimum accuracy, you can test the accuracy on the validation set using a ‘logistic regression’ model. Then you can gradually grow your model into a multi-layered model and investigate if larger models deliver higher accuracy on the validation set. Please note that your model should be smaller than the model in the previous phase. As you explore various network architectures, please note the accuracies of these models to include in your report. You can summarize your findings in the form of a table and the table should contain the accuracy and loss on the training set and the validation set (see below). You can also include other parameters such as number of epochs, number of neurons, total number of parameters, etc. Also remember to select one model as your best performing model, i.e., the model that delivers highest accuracy on the validation set. Your report should also include learning curves of your experiments. Additionally, you should also evaluate your models using other metrics besides precision; for example recall, precision, and F1 score. Please note that your submission for this phase is ineligible for points if you do not use “model checkpointing” in your code. You are discouraged to use external library methods such as “from sklearn.model_selection import train_test_split”.

Here is an example table that your report should include. The accuracy of the random baseline classifier is the percentage of the largest class. A neural network model with 16 neurons in the first layer, 8 layers in the second layer and 1 neuron in the last layer can be written as 16-8-1.

Model Acc. on Training Set Acc. on Validation Set
Random baseline classifier 0% 0%
Logistic regression model 0% 0%
Neural network model (64-32-16-8-1) 0% 0%
Neural network model (32-16-8-1) 0% 0%
Neural network model (16-8-1) 0% 0%
Neural network model (8-1) 0% 0%
Neural network model (4-1) 0% 0%
Neural network model (2-1) 0% 0%

Additional task ONLY for graduate students

In addition to the requirements above, graduate students are required to do the following two tasks to receive full points: 1) discuss what architecture (how big) you do need to overfit when you have output as additional input feature, 2) code a function that represents your model. Once you have finished coding your model, please build your own function/method that serves as a prediction model. Afterwards, please verify that predictions you obtain are same as the one you obtained using your trained model. The lecture on Linear regression with two input variables will be helpful to complete this task. Here is a draft:

# This code does not run; it is only meant to serve as an example
def my_prediction_function(model, data):
  w = [None]*numOfFeatures
  for i in range(numOfFeatures):
     w[i] = model.layers[numOfLayers-1].get_weights()[0][i]
  bias = model.layers[numOfLayers-1].get_weights()[1]   
  z = 0
  for i in range(numOfFeatures):
    z = z + features[numOfLayers-2][:,i]*w[i]
  z = z + bias
  result = 1/(1+np.exp(-z))
  return result 

Here is an example report.

Phase 5: Feature importance and reduction

Before working on this phase, please practice “Activity 9” in the ‘Neural networks using Tensorflow’ crash course (see nn-tf link). The key activity in this phase is to study the importance of the input features by iteratively removing them. You must continue to use model checkpointing in this phase. Here are the steps involved:

  1. If you have 10 input features/columns, train 10 models where each model only receives one feature at a time. For example, if age, BMI, and blood pressure are your only three input features, you train three models: one that only take age as input, another that only takes BMI as the input, and the last one that takes only blood pressure as the input. The validation accuracy of these three models will indicate the relative importance of the three features. You should plot these validation accuracies in the form of a bar diagram. If all your accuracies are more than 80%, your plot’s y-axis should be limited to 80-100.
  2. From the previous step you have the significance/important of each feature. The feature that yields the highest accuray is the most important feature.
  3. Starting with the most unimportant feature, remove one feature at a time (without replacement) and train various models. You can iteratively repeat the process removing more and more unimportant features. For example, if BMI is the most important feature and blood pressure is the least important one, you would train two models: one without blood pressure, and one without blood pressure and age. Plot the validation dataset accuracy of all the models that you tested. The overall objective is to identify non-informative input features and remove them from the dataset. Finally, you can compare your feature-reduced model with the original model with all input features and discuss the difference in accuracy.

Here is an example report.

For bonus points: Use model-agnostic methods such as LIME or Shapley values to derive feature importance.

Phase 6: Final report

Final report submission guidelines

Please submit a PDF of your final report. It should summarize the key findings from each phase of your projectexcept Phase 2. Do not include any discussion or results from Phase 2 in your final report, as this may confuse your readers.

Formatting and content requirements:

Please also submit a link to your final annotated notebook.

Optionally, you’re welcome to host your project and report on GitHub, but note that this is not required and does not carry extra credit.

Finally, your best/final model should be evaluated using ROC and AUC.

Sample final reports

For your reference, here are some final reports submitted by students in previous semesters. These examples are meant to serve as references only—please don’t model your report directly after them.

Your focus should be on meeting the requirements outlined above, not on copying the structure or content of these examples.