Training distilBERT to Predict Podcast Ratings
In this post we will use the Hugging Face API and PyTorch to fine-tune distilBERT on podcast reviews. We will train it to predict the rating from the title and body of the review. By converting ratings to sentiment, this also gives us a sentiment classifier. We will use Ray Tune for hyperparameter search and we will also evaluate the model in various ways.
- 1. Data Cleaning
- 2. Hyperparameter Search
- 3. Fine-Tuning distilBERT
- 4. Evaluating on the Test Set
- 5. Evaluating the Models on some Interesting Reviews
- 6. On Model Confidence
In a previous notebook we compared the performance of two methods to classify podcast reviews by sentiment. The VADER polarity score and a distilBERT transformer fine-tuned on the SST2 dataset, which consists of sentences from movie reviews.
In this notebook we will use the Hugging Face API and PyTorch to fine-tune the base distilBERT on the podcast reviews. We will train it to predict the rating given the title and body of the review. By converting the rating to sentiment, this also gives us a sentiment classifier.
Once we have trained our model, we will compare its performance with the "ready to use" model trained on SST2. Specifically we will compute accuracy and recall, and also visualize the distributions of predicted probabilities.
By base distilBERT we mean the model that has been pretrained only on two general language tasks (as opposed to sentiment analysis): predicting masked words in a sentence and predicting whether two sentences are adjacent. (Additionally the outputs of the BERT model from which it is distilled are used but we won't go into the details of knowledge distillation). Fine-tuning distilBERT for sentiment classification consists of adding a classification layer at the end of the transformer and then training this slightly modified transformer for sentiment classification (with a small learning rate). This is called transfer learning.
As part of the training process we will use Ray Tune to find good hyperparameters.
Finally, we will compare the predictions of our model with the model trained on SST2 on some reviews we will "hold out" of the training set. We picked those reviews because VADER was having a particularly hard time with them and they seemed interesting examples to test what the models have learned about podcast reviews.
Summary of results:
- Training distilBERT for about two epochs on 80,000 podcast reviews results in a sentiment prediction accuracy of $0.883$ on a test set of 5000 reviews. The accuracy of the distilBERT fine-tuned on SST2 on the same test set is $0.815$. The training, evaluation and test sets were constructed in such a way that all ratings are represented equally.
- Comparing the two models on some interesting reviews held out of the training set, it appears that our model learned to classify some difficult cases which are particular to the context of podcast reviews. For example, reviews of horror themed podcasts use language that would be indicative of negative sentiment in other contexts but are actually expressing approval of the show in this context.
- We measured model learning beyond the accuracy and training/evaluation loss: One observation is that the recall for positive and negative reviews gets more balanced over time, even as the accuracy and loss plateau. Another aspect we note is that the model gets more confident over time, i.e. distribution of output probabilities became more and more concentrated. This is a symptom of overfitting.
1. Data Cleaning
In a previous notebook we processed the reviews data but it is still a noisy dataset! We will do the following:
- Some reviews appear to be spam, which is why we will remove reviews by users with suspiciously high review counts.
- We will also exclude some podcasts for kids because a majority of the reviews for those podcasts aren't really reviews. Instead, children appear to be using the reviews as a forum in which to post jokes.
- Finally, will remove repeat reviews (reviews from the same user for the same podcast) to make sure there is no data leakage from the test set to the training set. I'm not sure why there are repeat reviews but I suspect that they are edited reviews. The reason we need to exclude them is that the review content is often very similar and the rating is usually the same.
Special holdout dataset: As mentioned, we will exclude a couple of reviews (on which we want to evaluate the models at the end) from the training set to make sure they haven't been memorized by the model (their indices are in holdout_ids
). This is separate from the evaluation and test sets and not intended to be statistically significant, just to illustrate what the model has learned.
reviews_raw = pd.read_pickle(os.path.join(PATH, 'data/reviews_raw_sentiment.pkl'))
def remove_spammers(reviews, max_reviews=135):
'Remove users with suspiciously high review count.'
mask = reviews.groupby('user_id')['podcast_id'].transform('count') <= max_reviews
return reviews[mask]
def keep_only_latest_rating(ratings):
'Remove repeat reviews, keeping the latest. Also sorts the ratings by date.'
return ratings.sort_values(by='created_at', ascending=False).drop_duplicates(subset=['podcast_id', 'user_id'])
holdout_ids = = [956562, 49428, 15130, 212768, 123052, 283, 973, 1516, 2566, 14947, 922494, 9, 10, 76, 11204, 11211, 48339]
kids_podcasts = ['Wow in the World', 'Story Pirates', 'Pants on Fire', 'The Official Average Boy Podcast', 'Despicable Me', 'Rebel Girls', 'Fierce Girls', 'Like and Subscribe: A podcast about YouTube culture', 'The Casagrandes Familia Sounds', 'What If World - Stories for Kids', 'Good Night Stories for Rebel Girls', 'Gird Up! Podcast', 'Highlights Hangout', 'Be Calm on Ahway Island Bedtime Stories', 'Smash Boom Best', 'The Cramazingly Incredifun Sugarcrash Kids Podcast']
reviews = (
reviews_raw.query('name not in @kids_podcasts')
.query('index not in @holdout_ids')
.pipe(remove_spammers)
.pipe(keep_only_latest_rating)
)
The classifier will expect the labels (targets) to start at 0, which is why we need to create a labels column which shifts the ratings by one.
reviews['labels'] = reviews['rating'] - 1
Now we create validation and test sets, in such a way that they both have around 1000 reviews for each star rating (uniform distribution of star ratings). We do this to ensure that the accuracy metric treats all star ratings equally.
reviews_val_test = (
reviews.groupby('labels')
.sample(n=2000)
)
reviews_train = reviews.query('index not in @reviews_val_test.index')
reviews_val, reviews_test = train_test_split(reviews_val_test, test_size=0.5)
reviews_val['labels'].value_counts()
reviews_test['labels'].value_counts()
reviews_train['labels'].value_counts()
The data has a very high skew towards 5 star ratings. We will create a training set which contains the same amount of reviews for each rating value, to make sure the model treats each rating class equally, so to speak. We did the same for the evaluation and test splits.
reviews_train_equal = (
reviews_train.groupby('labels')
.sample(n=16_000)
.sample(frac=1) #shuffle rows
)
Now we pickle the train, evaluation and test sets to ensure reproducibility. We took care to set seeds for NumPy and PyTorch at the beginning of the notebook but it is best to be careful, particularly in a notebook were cells could be run multiple times or out of order.
reviews_train_equal.to_pickle(os.path.join(PATH, 'data/reviews_train_equal.pkl'))
reviews_val.to_pickle(os.path.join(PATH, 'data/reviews_val.pkl'))
reviews_test.to_pickle(os.path.join(PATH, 'data/reviews_test.pkl'))
reviews_train_equal = pd.read_pickle(os.path.join(PATH, 'data/reviews_train_equal.pkl'))
reviews_val = pd.read_pickle(os.path.join(PATH, 'data/reviews_val.pkl'))
reviews_test = pd.read_pickle(os.path.join(PATH, 'data/reviews_test.pkl'))
reviews_train_equal['labels'].value_counts()
Now we tokenize the datasets, which needs to be done before we feed them to the model. We saw in the previous notebook that under $3\%$ of reviews result in sequences of more than 256 tokens, which is why we set that as the max_length
.
train_dataset_equal = Dataset.from_dict(reviews_train_equal[['demojized review', 'labels']])
val_dataset = Dataset.from_dict(reviews_val[['demojized review', 'labels']])
# We omit 'labels' in the test_dataset because otherwise we would get an error
# when evaluating the model fine tuned on SST2 with only 2 labels, instead of 5
test_dataset = Dataset.from_dict(reviews_test[['demojized review']])
dataset_dict = DatasetDict({'train_equal':train_dataset_equal, 'validation':val_dataset, 'test':test_dataset})
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
def tokenize_function(data, tokenizer, truncation=True, max_length=256):
return tokenizer(data['demojized review'], truncation=truncation, max_length=max_length)
tokenized_datasets = (
dataset_dict.map(partial(tokenize_function, tokenizer=tokenizer), batched=True)
.remove_columns(['demojized review'])
)
tokenized_datasets.set_format('torch')
2. Hyperparameter Search
Now we are ready to do the hyperparameter search using Hugging Face and Ray Tune. We will perform a random search over the batch sizes 8, 16 and 32, as well as learning rates between $10^{-5}$ and $10^{-4}$. This roughly agrees with the recommended parameters in the original paper (Appendix A.3). They also recommend the epoch numbers 2, 3 and 4 but we will only use 2 epochs because that takes a long time already (and many Colab compute units 😬).
We also use an ASHA scheduler to terminate less promising trials, although in retrospect I'm not sure that is a good idea (see below). That said, with the scheduler it already took me 3 hours with a "premium GPU" on Colab and from the results it looks like the hyperparameter choice does not make a big difference (within a reasonable range).
The following function will evaluate the model during the training. It computes the accuracy and the recall. The recall is computed for every rating class and thus consists of 5 numbers.
def compute_metrics(eval_preds):
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
accuracy = accuracy_score(labels, predictions)
recall = recall_score(
y_true=labels,
y_pred=predictions,
labels=[0, 1, 2, 3, 4],
average=None,
)
metric_names = [f'recall_{n}_stars' for n in range(1, 6)] + ['accuracy']
return dict(zip(metric_names, list(recall) + [accuracy]))
training_args = TrainingArguments(
output_dir='hugging-face-trainers',
num_train_epochs=2,
eval_steps=500,
evaluation_strategy='steps',
save_strategy='no',
disable_tqdm=True,
)
def get_model():
return AutoModelForSequenceClassification.from_pretrained(
'distilbert-base-uncased',
num_labels=5,
ignore_mismatched_sizes=True,
)
trainer = Trainer(
model=None,
model_init=get_model,
args=training_args,
train_dataset=tokenized_datasets['train_equal'],
eval_dataset=tokenized_datasets['validation'],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
scheduler = ASHAScheduler(
metric="eval_accuracy",
mode="max",
grace_period=4,
reduction_factor=4,
)
def hp_space(trial):
return {
'learning_rate': tune.loguniform(1e-5, 1e-4),
'per_device_train_batch_size': tune.choice([8, 16, 32]),
}
reporter = JupyterNotebookReporter(
parameter_columns={
'learning_rate': 'lr',
'per_device_train_batch_size': 'train_bs/gpu',
},
metric_columns=[
'eval_accuracy', 'eval_loss', 'epoch',
'eval_recall_1_stars', 'eval_recall_2_stars', 'eval_recall_3_stars',
'eval_recall_4_stars', 'eval_recall_5_stars'
]
)
best_run = trainer.hyperparameter_search(
hp_space=hp_space,
backend='ray',
direction='maximize',
n_trials=15,
resources_per_trial={
'cpu': 1,
'gpu': 1/3,
},
scheduler=scheduler,
checkpoint_score_attr='training_iteration',
progress_reporter=reporter,
local_dir=os.path.join(PATH, 'models'),
name='hp_search_5class_uniform_ratings',
log_to_file=True,
)
best_run
It seems that with higher batch size the accuracy is often better but at the cost of having higher recall for 1 and 5 stars and worse recall for the intermediate ratings. However, this might be an artifact of the ASHA early stopping, which appears to favor the larger batch sizes. This is probably because they converge more quickly at first given that the model gets to "see" more examples at each step. What makes me think this is indeed the case is: 1) all the completed trials except one have batch size 32 2) the only smaller batch trial that wasn't stopped early ended up having the highest accuracy (it was also the first trial and thus not subject to being stopped early).
In any case, the differences in accuracy are very small and it's not worth it to repeat the costly hyperparameter search.
3. Fine-Tuning distilBERT
We will use the best parameters we found above: batch size 8 and learning rate $2.44\cdot10^{-5}$. Note that because we will train with 4 epochs instead of the 2 epochs we used in the hyperparameter search, the learning rate will actually be s little higher for longer at the beginning, given that we are using a linear learning rate scheduler (the default).
base_model = AutoModelForSequenceClassification.from_pretrained(
PRETRAINED,
num_labels=5,
ignore_mismatched_sizes=True,
)
training_args = TrainingArguments(
output_dir=os.path.join(PATH, 'models/best-run'),
learning_rate=2.4e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=32,
num_train_epochs=4,
evaluation_strategy='steps',
)
trainer = Trainer(
model=base_model,
args=training_args,
train_dataset=tokenized_datasets['train_equal'],
eval_dataset=tokenized_datasets['validation'],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()