llm-workshop-2025

Predictive Modeling with LLMs

Session Objectives. This session will focus on how Large Language Models (LLMs) like ChatGPT can assist in building and evaluating predictive models after the exploratory data analysis stage. Participants will learn how to use LLMs to generate model-ready features from both text and tabular data, select appropriate predictive modeling techniques such as classification or regression, and interpret model outputs to guide further improvements.

Session topics

⏳	Topic
5 min	Introduction `sankalpa`
15 min	Live demo and practice of text classification `sankalpa`
-	Demo of tabular data classification `sankalpa`

Steps for building a model to classify text

Download the Corona NLP dataset file.
Open Google Colab
Upload the file in Google Colab
Click the folder icon on the left sidebar of your Colab Notebook to open the file explorer.
Navigate to the desired directory. If you want to upload to a specific subdirectory, navigate into it. Otherwise, files will be uploaded to the current working directory (e.g. /content/).
Upload the file. (Drag and Drop / Use the upload button)
Upload the data in csv format into Gemini section in Colab (or any other AI tool) and ask it read the data.
```
a. I have a CSV file named "Corona_NLP_test.csv" that contains two important columns: 'OriginalTweet' (text) and 'Sentiment' (labels). Can you generate Python code using pandas to load the file and display the first few rows?
b. Write Python code to preprocess the 'OriginalTweet' column for sentiment prediction.
Steps should include:
- Lowercasing text
- Removing punctuation and stopwords
- Tokenizing text
- Converting text into numerical features using TF-IDF
c. Write Python code using scikit-learn to train a text classification model that predicts 'Sentiment' from the 'OriginalTweet' text. Use TF-IDF features.
d. Add code to evaluate the model using accuracy, confusion matrix, and classification report.
e. Write code to use the trained model to predict sentiment for these new tweets:
["People are staying indoors because of the coronavirus outbreak and everyone is panicking."]
```
Description of the prompt: We use this prompt to read the raw data from CSV file. We print the first few rows, to confirm if the file has been properly read. We then use some preprocessing steps which collectively clean and transform raw text data, making it more consistent, less noisy and in a format suitable for effective analysis and model training. After preprocessing, we ask the LLM to generate a code for building the model which might help us accurately predict the sentiment of a tweet. After the model has been successfully trained, we have to see how it works. So our prompt also asks to check the performance of the model using some common evaluation metrics, so you can see how smart your model has become. We now try to see the sentiment of some of the new tweets might be.

Prompts for Advanced Predictive Modeling

Now we want to make the model more responsive

What was the prediction? Can it be that a particular word is responsible for the tweet’s sentiment? Lets analyze by shortening the sentence and by removing words, and seeing what the model predicts.

Iteratively shorten the sentence by removing words from the end, one at a time. After each shortened version, predict the sentiment again. Show the process in a table with two columns:
- "Shortened Sentence"
- "Predicted Sentiment"

Explore and identify the critical words in a sentence that predicts the sentiment.

Randomly replace words one at a time to identify the key words that lead to predicted sentiment.

The session will include a live demo showcasing LLM-assisted predictive workflows, followed by a hands-on activity where participants apply these ideas using a sample dataset.

Activity

Repeat the advanced predictive modeling techniques above for the tweet below:

"I love the new policy"