In a previous notebook we compared the performance of two methods to classify podcast reviews by sentiment. The VADER polarity score and a distilBERT transformer fine-tuned on the SST2 dataset, which consists of sentences from movie reviews.

In this notebook we will use the Hugging Face API and PyTorch to fine-tune the base distilBERT on the podcast reviews. We will train it to predict the rating given the title and body of the review. By converting the rating to sentiment, this also gives us a sentiment classifier.

Once we have trained our model, we will compare its performance with the "ready to use" model trained on SST2. Specifically we will compute accuracy and recall, and also visualize the distributions of predicted probabilities.

By base distilBERT we mean the model that has been pretrained only on two general language tasks (as opposed to sentiment analysis): predicting masked words in a sentence and predicting whether two sentences are adjacent. (Additionally the outputs of the BERT model from which it is distilled are used but we won't go into the details of knowledge distillation). Fine-tuning distilBERT for sentiment classification consists of adding a classification layer at the end of the transformer and then training this slightly modified transformer for sentiment classification (with a small learning rate). This is called transfer learning.

As part of the training process we will use Ray Tune to find good hyperparameters.

Finally, we will compare the predictions of our model with the model trained on SST2 on some reviews we will "hold out" of the training set. We picked those reviews because VADER was having a particularly hard time with them and they seemed interesting examples to test what the models have learned about podcast reviews.

Summary of results:

  • Training distilBERT for about two epochs on 80,000 podcast reviews results in a sentiment prediction accuracy of $0.883$ on a test set of 5000 reviews. The accuracy of the distilBERT fine-tuned on SST2 on the same test set is $0.815$. The training, evaluation and test sets were constructed in such a way that all ratings are represented equally.
  • Comparing the two models on some interesting reviews held out of the training set, it appears that our model learned to classify some difficult cases which are particular to the context of podcast reviews. For example, reviews of horror themed podcasts use language that would be indicative of negative sentiment in other contexts but are actually expressing approval of the show in this context.
  • We measured model learning beyond the accuracy and training/evaluation loss: One observation is that the recall for positive and negative reviews gets more balanced over time, even as the accuracy and loss plateau. Another aspect we note is that the model gets more confident over time, i.e. distribution of output probabilities became more and more concentrated. This is a symptom of overfitting.

1. Data Cleaning

In a previous notebook we processed the reviews data but it is still a noisy dataset! We will do the following:

  • Some reviews appear to be spam, which is why we will remove reviews by users with suspiciously high review counts.
  • We will also exclude some podcasts for kids because a majority of the reviews for those podcasts aren't really reviews. Instead, children appear to be using the reviews as a forum in which to post jokes.
  • Finally, will remove repeat reviews (reviews from the same user for the same podcast) to make sure there is no data leakage from the test set to the training set. I'm not sure why there are repeat reviews but I suspect that they are edited reviews. The reason we need to exclude them is that the review content is often very similar and the rating is usually the same.

Special holdout dataset: As mentioned, we will exclude a couple of reviews (on which we want to evaluate the models at the end) from the training set to make sure they haven't been memorized by the model (their indices are in holdout_ids). This is separate from the evaluation and test sets and not intended to be statistically significant, just to illustrate what the model has learned.

reviews_raw = pd.read_pickle(os.path.join(PATH, 'data/reviews_raw_sentiment.pkl'))
def remove_spammers(reviews, max_reviews=135):
    'Remove users with suspiciously high review count.'
    mask = reviews.groupby('user_id')['podcast_id'].transform('count') <= max_reviews
    return reviews[mask]

def keep_only_latest_rating(ratings):
    'Remove repeat reviews, keeping the latest. Also sorts the ratings by date.'
    return ratings.sort_values(by='created_at', ascending=False).drop_duplicates(subset=['podcast_id', 'user_id'])
holdout_ids = = [956562, 49428, 15130, 212768, 123052, 283, 973, 1516, 2566, 14947, 922494, 9, 10, 76, 11204, 11211, 48339]
kids_podcasts = ['Wow in the World', 'Story Pirates', 'Pants on Fire', 'The Official Average Boy Podcast', 'Despicable Me', 'Rebel Girls', 'Fierce Girls', 'Like and Subscribe: A podcast about YouTube culture', 'The Casagrandes Familia Sounds', 'What If World - Stories for Kids', 'Good Night Stories for Rebel Girls', 'Gird Up! Podcast', 'Highlights Hangout', 'Be Calm on Ahway Island Bedtime Stories', 'Smash Boom Best', 'The Cramazingly Incredifun Sugarcrash Kids Podcast']
reviews = (
  reviews_raw.query('name not in @kids_podcasts')
             .query('index not in @holdout_ids')
             .pipe(remove_spammers)
             .pipe(keep_only_latest_rating)
)

The classifier will expect the labels (targets) to start at 0, which is why we need to create a labels column which shifts the ratings by one.

reviews['labels'] = reviews['rating'] - 1

Now we create validation and test sets, in such a way that they both have around 1000 reviews for each star rating (uniform distribution of star ratings). We do this to ensure that the accuracy metric treats all star ratings equally.

reviews_val_test = (
    reviews.groupby('labels')
           .sample(n=2000)
)

reviews_train = reviews.query('index not in @reviews_val_test.index')
reviews_val, reviews_test = train_test_split(reviews_val_test, test_size=0.5)
reviews_val['labels'].value_counts()
2    1013
4    1011
1    1007
3     989
0     980
Name: labels, dtype: int64
reviews_test['labels'].value_counts()
0    1020
3    1011
1     993
4     989
2     987
Name: labels, dtype: int64
reviews_train['labels'].value_counts()
4    811106
0     43229
3     26008
2     19150
1     17149
Name: labels, dtype: int64

The data has a very high skew towards 5 star ratings. We will create a training set which contains the same amount of reviews for each rating value, to make sure the model treats each rating class equally, so to speak. We did the same for the evaluation and test splits.

reviews_train_equal = (
  reviews_train.groupby('labels')
               .sample(n=16_000)
               .sample(frac=1) #shuffle rows
)

Now we pickle the train, evaluation and test sets to ensure reproducibility. We took care to set seeds for NumPy and PyTorch at the beginning of the notebook but it is best to be careful, particularly in a notebook were cells could be run multiple times or out of order.

reviews_train_equal.to_pickle(os.path.join(PATH, 'data/reviews_train_equal.pkl'))
reviews_val.to_pickle(os.path.join(PATH, 'data/reviews_val.pkl'))
reviews_test.to_pickle(os.path.join(PATH, 'data/reviews_test.pkl'))
reviews_train_equal = pd.read_pickle(os.path.join(PATH, 'data/reviews_train_equal.pkl'))
reviews_val = pd.read_pickle(os.path.join(PATH, 'data/reviews_val.pkl'))
reviews_test = pd.read_pickle(os.path.join(PATH, 'data/reviews_test.pkl'))
reviews_train_equal['labels'].value_counts()
4    16000
0    16000
1    16000
3    16000
2    16000
Name: labels, dtype: int64

Now we tokenize the datasets, which needs to be done before we feed them to the model. We saw in the previous notebook that under $3\%$ of reviews result in sequences of more than 256 tokens, which is why we set that as the max_length.

train_dataset_equal = Dataset.from_dict(reviews_train_equal[['demojized review', 'labels']])
val_dataset = Dataset.from_dict(reviews_val[['demojized review', 'labels']])
# We omit 'labels' in the test_dataset because otherwise we would get an error
# when evaluating the model fine tuned on SST2 with only 2 labels, instead of 5
test_dataset = Dataset.from_dict(reviews_test[['demojized review']])
dataset_dict = DatasetDict({'train_equal':train_dataset_equal, 'validation':val_dataset, 'test':test_dataset})
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
def tokenize_function(data, tokenizer, truncation=True, max_length=256):
    return tokenizer(data['demojized review'], truncation=truncation, max_length=max_length)
tokenized_datasets = (
    dataset_dict.map(partial(tokenize_function, tokenizer=tokenizer), batched=True)
                .remove_columns(['demojized review'])
)

tokenized_datasets.set_format('torch')

Now we are ready to do the hyperparameter search using Hugging Face and Ray Tune. We will perform a random search over the batch sizes 8, 16 and 32, as well as learning rates between $10^{-5}$ and $10^{-4}$. This roughly agrees with the recommended parameters in the original paper (Appendix A.3). They also recommend the epoch numbers 2, 3 and 4 but we will only use 2 epochs because that takes a long time already (and many Colab compute units 😬).

We also use an ASHA scheduler to terminate less promising trials, although in retrospect I'm not sure that is a good idea (see below). That said, with the scheduler it already took me 3 hours with a "premium GPU" on Colab and from the results it looks like the hyperparameter choice does not make a big difference (within a reasonable range).

The following function will evaluate the model during the training. It computes the accuracy and the recall. The recall is computed for every rating class and thus consists of 5 numbers.

def compute_metrics(eval_preds):
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  accuracy = accuracy_score(labels, predictions)
  recall = recall_score(
      y_true=labels,
      y_pred=predictions,
      labels=[0, 1, 2, 3, 4], 
      average=None,
  )
  metric_names = [f'recall_{n}_stars' for n in range(1, 6)] + ['accuracy']
  return dict(zip(metric_names, list(recall) + [accuracy]))
training_args = TrainingArguments(
    output_dir='hugging-face-trainers',
    num_train_epochs=2,
    eval_steps=500,
    evaluation_strategy='steps',
    save_strategy='no',
    disable_tqdm=True,
)

def get_model():
  return AutoModelForSequenceClassification.from_pretrained(
      'distilbert-base-uncased',
      num_labels=5,
      ignore_mismatched_sizes=True,
)

trainer = Trainer(
    model=None,
    model_init=get_model,
    args=training_args,
    train_dataset=tokenized_datasets['train_equal'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

scheduler = ASHAScheduler(
        metric="eval_accuracy",
        mode="max",
        grace_period=4,
        reduction_factor=4,
)

def hp_space(trial):
  return {
      'learning_rate': tune.loguniform(1e-5, 1e-4),
      'per_device_train_batch_size': tune.choice([8, 16, 32]),
  }


reporter = JupyterNotebookReporter(
    parameter_columns={
        'learning_rate': 'lr',
        'per_device_train_batch_size': 'train_bs/gpu',
    },
    metric_columns=[
        'eval_accuracy', 'eval_loss', 'epoch',
        'eval_recall_1_stars', 'eval_recall_2_stars', 'eval_recall_3_stars',
        'eval_recall_4_stars', 'eval_recall_5_stars'
    ]
)
best_run = trainer.hyperparameter_search(
    hp_space=hp_space,
    backend='ray',
    direction='maximize',
    n_trials=15,
    resources_per_trial={
        'cpu': 1,
        'gpu': 1/3,
    },
    scheduler=scheduler,
    checkpoint_score_attr='training_iteration',
    progress_reporter=reporter,
    local_dir=os.path.join(PATH, 'models'),
    name='hp_search_5class_uniform_ratings',
    log_to_file=True,
)
== Status ==
Current time: 2022-10-18 15:20:26 (running for 02:04:20.59)
Memory usage on this node: 9.4/83.5 GiB
Using AsyncHyperBand: num_stopped=9 Bracket: Iter 64.000: None | Iter 16.000: 0.5892 | Iter 4.000: 0.5764
Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/49.81 GiB heap, 0.0/24.91 GiB objects
Result logdir: /content/drive/MyDrive/ml-projects/podcast-reviews/models/hp_search_5class_uniform_ratings
Number of trials: 15/15 (15 TERMINATED)
Trial name status loc lr train_bs/gpu lr_scheduler eval_accuracy eval_loss epoch eval_recall_1_stars eval_recall_2_stars eval_recall_3_stars eval_recall_4_stars eval_recall_5_stars
_objective_0585f_00000 TERMINATED 172.28.0.2:907 2.36886e-05 8 0.5982 0.972204 2 0.615306 0.480636 0.499506 0.579373 0.816024
_objective_0585f_00001 TERMINATED 172.28.0.2:947 6.02131e-05 8 0.5778 0.992277 0.8 0.646939 0.409136 0.456071 0.521739 0.855589
_objective_0585f_00002 TERMINATED 172.28.0.2:949 1.43217e-05 32 0.5962 0.967467 2 0.639796 0.46574 0.467917 0.55814 0.849654
_objective_0585f_00003 TERMINATED 172.28.0.2:1774 2.1563e-05 32 0.597 0.961714 2 0.633673 0.474677 0.471866 0.564206 0.840752
_objective_0585f_00004 TERMINATED 172.28.0.2:2334 9.33061e-05 16 0.539 1.06329 0.4 0.346939 0.481629 0.563672 0.408493 0.885262
_objective_0585f_00005 TERMINATED 172.28.0.2:2670 1.51993e-05 8 0.543 1.06707 0.2 0.517347 0.543198 0.479763 0.471183 0.701286
_objective_0585f_00006 TERMINATED 172.28.0.2:2895 4.08934e-05 8 0.5496 1.05635 0.2 0.577551 0.474677 0.454097 0.569262 0.673591
_objective_0585f_00007 TERMINATED 172.28.0.2:3032 1.95537e-05 32 0.597 0.962354 2 0.635714 0.473684 0.46693 0.564206 0.844708
_objective_0585f_00008 TERMINATED 172.28.0.2:3170 1.11344e-05 32 0.5756 1.00606 0.8 0.694898 0.457795 0.378085 0.465116 0.883284
_objective_0585f_00009 TERMINATED 172.28.0.2:3273 6.09784e-05 32 0.5948 0.974191 2 0.614286 0.46574 0.501481 0.57634 0.816024
_objective_0585f_00010 TERMINATED 172.28.0.2:3751 9.62124e-05 8 0.5222 1.10365 0.2 0.67449 0.288977 0.455084 0.488372 0.707221
_objective_0585f_00011 TERMINATED 172.28.0.2:3914 4.05084e-05 8 0.556 1.058 0.2 0.610204 0.447865 0.479763 0.538928 0.704253
_objective_0585f_00012 TERMINATED 172.28.0.2:4076 1.03101e-05 8 0.5434 1.0641 0.2 0.516327 0.529295 0.481737 0.462083 0.725025
_objective_0585f_00013 TERMINATED 172.28.0.2:4261 6.43276e-05 8 0.5432 1.07166 0.2 0.697959 0.351539 0.465943 0.497472 0.706231
_objective_0585f_00014 TERMINATED 172.28.0.2:4370 1.70174e-05 32 0.5968 0.964603 2 0.637755 0.471698 0.46693 0.560162 0.847676


2022-10-18 15:20:26,481	INFO tune.py:759 -- Total run time: 7461.08 seconds (7460.56 seconds for the tuning loop).
best_run
BestRun(run_id='0585f_00000', objective=3.5890449331933936, hyperparameters={'learning_rate': 2.368863950364079e-05, 'per_device_train_batch_size': 8})

It seems that with higher batch size the accuracy is often better but at the cost of having higher recall for 1 and 5 stars and worse recall for the intermediate ratings. However, this might be an artifact of the ASHA early stopping, which appears to favor the larger batch sizes. This is probably because they converge more quickly at first given that the model gets to "see" more examples at each step. What makes me think this is indeed the case is: 1) all the completed trials except one have batch size 32 2) the only smaller batch trial that wasn't stopped early ended up having the highest accuracy (it was also the first trial and thus not subject to being stopped early).

In any case, the differences in accuracy are very small and it's not worth it to repeat the costly hyperparameter search.

3. Fine-Tuning distilBERT

We will use the best parameters we found above: batch size 8 and learning rate $2.44\cdot10^{-5}$. Note that because we will train with 4 epochs instead of the 2 epochs we used in the hyperparameter search, the learning rate will actually be s little higher for longer at the beginning, given that we are using a linear learning rate scheduler (the default).

base_model = AutoModelForSequenceClassification.from_pretrained(
      PRETRAINED,
      num_labels=5,
      ignore_mismatched_sizes=True,
)
training_args = TrainingArguments(
    output_dir=os.path.join(PATH, 'models/best-run'),
    learning_rate=2.4e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    num_train_epochs=4,
    evaluation_strategy='steps',
)

trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=tokenized_datasets['train_equal'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()
using `logging_steps` to initialize `eval_steps` to 500
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 80000
  Num Epochs = 4
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 40000
[40000/40000 2:22:31, Epoch 4/4]
Step Training Loss Validation Loss Recall 1 Stars Recall 2 Stars Recall 3 Stars Recall 4 Stars Recall 5 Stars Accuracy
500 1.245400 1.128538 0.621429 0.371400 0.318855 0.506572 0.841741 0.531600
1000 1.093900 1.104158 0.355102 0.348560 0.655479 0.425683 0.782394 0.515000
1500 1.078600 1.058309 0.416327 0.629593 0.319842 0.556117 0.807122 0.546400
2000 1.059700 1.082167 0.811224 0.345581 0.339585 0.427705 0.801187 0.544000
2500 1.051500 1.042632 0.665306 0.434955 0.320829 0.562184 0.812067 0.558400
3000 1.033800 1.027814 0.750000 0.400199 0.388944 0.485339 0.791296 0.562400
3500 1.048100 1.018655 0.706122 0.328699 0.543929 0.564206 0.717112 0.571400
4000 1.009200 1.007908 0.629592 0.470705 0.384995 0.577351 0.802176 0.572600
4500 1.027100 1.057035 0.359184 0.422046 0.571570 0.593529 0.796241 0.549600
5000 0.993200 1.007861 0.609184 0.384310 0.527147 0.549039 0.818002 0.577600
5500 1.004700 1.021505 0.647959 0.555114 0.429418 0.412538 0.809100 0.571000
6000 1.016400 0.996876 0.578571 0.509434 0.529121 0.430738 0.850643 0.580400
6500 0.990500 1.001491 0.620408 0.606753 0.366239 0.529828 0.786350 0.581800
7000 1.022000 0.974803 0.677551 0.448858 0.399803 0.566229 0.852621 0.588600
7500 0.989900 0.978257 0.700000 0.477656 0.432379 0.444894 0.873393 0.585600
8000 0.971000 0.997515 0.747959 0.423039 0.385982 0.419616 0.881306 0.571200
8500 1.010300 0.978720 0.729592 0.400199 0.478776 0.467139 0.871414 0.589200
9000 0.987800 0.982950 0.735714 0.433962 0.444225 0.492417 0.843719 0.589600
9500 1.009900 0.971602 0.758163 0.390268 0.434353 0.566229 0.837784 0.596600
10000 0.993400 0.987826 0.552041 0.414101 0.462981 0.669363 0.815035 0.582600
10500 0.861500 1.013859 0.531633 0.578947 0.483712 0.539939 0.806133 0.588600
11000 0.885600 1.003105 0.734694 0.397219 0.482725 0.486350 0.863501 0.592600
11500 0.878500 0.998439 0.530612 0.507448 0.475814 0.588473 0.833828 0.587600
12000 0.843300 0.981679 0.600000 0.504469 0.478776 0.479272 0.898121 0.592600
12500 0.872100 1.021082 0.726531 0.443893 0.365252 0.553084 0.855589 0.588200
13000 0.864900 0.984879 0.598980 0.450844 0.461994 0.597573 0.848665 0.591600
13500 0.867700 0.968538 0.574490 0.454816 0.502468 0.585440 0.827893 0.589200
14000 0.865500 0.996079 0.586735 0.510427 0.527147 0.442872 0.883284 0.590800
14500 0.860200 0.987961 0.653061 0.436941 0.505429 0.606673 0.751731 0.590400
15000 0.889400 0.988690 0.574490 0.409136 0.545903 0.575329 0.830861 0.587400
15500 0.851700 0.996287 0.610204 0.478649 0.463968 0.569262 0.826904 0.589800
16000 0.862600 0.978930 0.636735 0.508441 0.440276 0.550051 0.821958 0.591400
16500 0.881000 0.985724 0.555102 0.573982 0.440276 0.595551 0.800198 0.593200
17000 0.838600 0.989725 0.674490 0.440914 0.502468 0.550051 0.829871 0.599400
17500 0.886200 0.974624 0.635714 0.520357 0.464956 0.510617 0.838773 0.594200
18000 0.870900 0.985520 0.644898 0.527309 0.441264 0.505561 0.855589 0.595000
18500 0.873500 0.990966 0.595918 0.451837 0.438302 0.644085 0.818991 0.589600
19000 0.842400 1.015863 0.495918 0.555114 0.513327 0.589484 0.794263 0.590200
19500 0.861000 0.984092 0.674490 0.425025 0.513327 0.554095 0.803165 0.593800
20000 0.853900 0.994874 0.670408 0.511420 0.443238 0.526795 0.800198 0.590200
20500 0.738000 1.033550 0.562245 0.549156 0.472853 0.524772 0.832839 0.588800
21000 0.711800 1.087762 0.568367 0.425025 0.553801 0.570273 0.792285 0.582200
21500 0.723800 1.099150 0.552041 0.448858 0.510365 0.652174 0.767557 0.586200
22000 0.706500 1.131904 0.700000 0.482622 0.377098 0.500506 0.837784 0.579200
22500 0.716100 1.097652 0.605102 0.516385 0.462981 0.595551 0.724036 0.580600
23000 0.711500 1.042393 0.619388 0.471698 0.512340 0.521739 0.841741 0.593600
23500 0.665200 1.120767 0.505102 0.628600 0.455084 0.561173 0.727992 0.576000
24000 0.717600 1.072822 0.635714 0.463754 0.519250 0.496461 0.850643 0.593400
24500 0.701500 1.091549 0.598980 0.554121 0.446199 0.525784 0.790307 0.583200
25000 0.689600 1.085112 0.624490 0.444886 0.490622 0.537917 0.841741 0.588000
25500 0.725500 1.109724 0.603061 0.550149 0.455084 0.452983 0.858556 0.584400
26000 0.695800 1.102224 0.591837 0.508441 0.501481 0.507583 0.830861 0.588400
26500 0.697100 1.079746 0.673469 0.385303 0.510365 0.507583 0.842730 0.583800
27000 0.693500 1.112715 0.611224 0.499503 0.449161 0.608696 0.781405 0.589800
27500 0.700400 1.105307 0.632653 0.447865 0.479763 0.563195 0.826904 0.590000
28000 0.693200 1.099970 0.513265 0.598808 0.449161 0.559151 0.762611 0.577000
28500 0.733600 1.059791 0.565306 0.545184 0.494571 0.583418 0.748764 0.587600
29000 0.697600 1.076711 0.546939 0.530288 0.489635 0.589484 0.776459 0.586800
29500 0.687600 1.077929 0.607143 0.459782 0.492596 0.580384 0.826904 0.593400
30000 0.706100 1.084087 0.540816 0.493545 0.501481 0.582406 0.789318 0.581800
30500 0.542400 1.207177 0.639796 0.460775 0.473840 0.546006 0.796241 0.583200
31000 0.556300 1.225795 0.576531 0.466733 0.509378 0.590495 0.745796 0.577800
31500 0.529400 1.255818 0.602041 0.464747 0.511352 0.585440 0.739862 0.580600
32000 0.553200 1.246235 0.583673 0.493545 0.481737 0.544995 0.784372 0.577800
32500 0.570700 1.228383 0.570408 0.511420 0.469891 0.553084 0.797230 0.580600
33000 0.555200 1.267651 0.539796 0.507448 0.503455 0.603640 0.702275 0.571400
33500 0.552200 1.268231 0.623469 0.435948 0.476802 0.563195 0.789318 0.577600
34000 0.556900 1.260661 0.605102 0.478649 0.452122 0.568251 0.778437 0.576400
34500 0.548100 1.275731 0.588776 0.441907 0.505429 0.582406 0.742829 0.572200
35000 0.576100 1.266743 0.590816 0.474677 0.477789 0.577351 0.763600 0.576800
35500 0.550400 1.259426 0.634694 0.428004 0.490622 0.565217 0.773492 0.578200
36000 0.557200 1.279438 0.581633 0.498510 0.471866 0.567240 0.767557 0.577400
36500 0.546600 1.275166 0.579592 0.485601 0.480750 0.565217 0.787339 0.579800
37000 0.549500 1.283615 0.570408 0.492552 0.483712 0.583418 0.771513 0.580400
37500 0.546800 1.264037 0.568367 0.510427 0.463968 0.567240 0.781405 0.578400
38000 0.526100 1.276842 0.621429 0.452830 0.469891 0.563195 0.782394 0.577800
38500 0.547500 1.275661 0.576531 0.494538 0.479763 0.577351 0.771513 0.580000
39000 0.534700 1.273110 0.574490 0.494538 0.484699 0.575329 0.776459 0.581200
39500 0.561400 1.266741 0.581633 0.492552 0.478776 0.573306 0.778437 0.581000
40000 0.547000 1.265490 0.581633 0.493545 0.473840 0.563195 0.787339 0.580000

</div> </div> </div> </div> </div>

For all the models we trained so far the recall is higher for the extreme ratings (1 and 5 stars) than for the intermediate ratings. It is especially high for 5 star ratings. Could we make the model better by having a higher proportion of "harder" ratings in the training set? We will test that idea now by creating a new training set with unequal star rating proportions.

train_proportions = {
     0: 16000,
     1: 17000,
     2: 19000,
     3: 17000,
     4: 10000,
 }

# In case we loaded val and test with Pickle
reviews_train = (
    reviews.query('index not in @reviews_val.index')
           .query('index not in @reviews_test.index')
)

reviews_train_unequal = (
  reviews_train[['demojized review', 'labels']]
               .groupby('labels', group_keys=False)
               .apply(lambda x: x.sample(n=train_proportions[x.name]))
               .sample(frac=1)
)

train_dataset_unequal = Dataset.from_dict(reviews_train_unequal)

tokenized_train_dataset_unequal = (
    train_dataset_unequal.map(partial(tokenize_function, tokenizer=tokenizer), batched=True)
                   .remove_columns(['demojized review'])
)
base_model = AutoModelForSequenceClassification.from_pretrained(
      PRETRAINED,
      num_labels=5,
      ignore_mismatched_sizes=True,
)
training_args = TrainingArguments(
    output_dir=os.path.join(PATH, 'models/best-run-unequal-ratings'),
    learning_rate=2.4e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    evaluation_strategy='steps',
)

trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=tokenized_train_dataset_unequal,
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()
/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 79000
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 19750
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[19750/19750 1:34:17, Epoch 2/2]
Step Training Loss Validation Loss Recall 1 Stars Recall 2 Stars Recall 3 Stars Recall 4 Stars Recall 5 Stars Accuracy
500 1.253800 1.130906 0.546939 0.423039 0.576505 0.311426 0.693373 0.511000
1000 1.160400 1.073716 0.424490 0.494538 0.401777 0.491405 0.847676 0.532800
1500 1.120700 1.062113 0.583673 0.397219 0.514314 0.493428 0.757666 0.549400
2000 1.100400 1.043622 0.667347 0.312810 0.442251 0.562184 0.798220 0.556000
2500 1.082100 1.078694 0.767347 0.211519 0.526160 0.633974 0.559842 0.538200
3000 1.076700 1.028931 0.598980 0.474677 0.503455 0.470172 0.779426 0.565600
3500 1.058500 1.041818 0.668367 0.347567 0.590326 0.518706 0.655786 0.555800
4000 1.080900 1.041594 0.748980 0.280040 0.507404 0.584429 0.684471 0.560000
4500 1.060900 1.022272 0.539796 0.618669 0.403751 0.600607 0.681503 0.568800
5000 1.030700 1.036904 0.490816 0.391261 0.593287 0.611729 0.707221 0.559200
5500 1.047100 0.994701 0.460204 0.539225 0.556762 0.497472 0.821958 0.576200
6000 1.033200 0.990816 0.646939 0.415094 0.461007 0.666330 0.716123 0.580400
6500 1.021000 1.000840 0.623469 0.378352 0.524186 0.515672 0.870425 0.582600
7000 1.034700 1.009722 0.575510 0.606753 0.455084 0.488372 0.768546 0.579200
7500 1.023800 1.005465 0.724490 0.352532 0.434353 0.731041 0.632047 0.573400
8000 1.009200 0.996092 0.637755 0.479643 0.537019 0.590495 0.653808 0.579400
8500 1.015800 1.052649 0.564286 0.627607 0.480750 0.441860 0.688427 0.561000
9000 1.011900 1.008287 0.754082 0.417080 0.437315 0.632963 0.659743 0.579000
9500 1.033500 1.014436 0.491837 0.503476 0.560711 0.676441 0.603363 0.567200
10000 0.982700 1.003126 0.557143 0.455809 0.608095 0.547017 0.709199 0.575800
10500 0.922200 0.987351 0.596939 0.477656 0.518263 0.578362 0.798220 0.594000
11000 0.884400 1.028805 0.608163 0.577954 0.461994 0.518706 0.751731 0.583800
11500 0.907400 1.020639 0.503061 0.455809 0.595262 0.647118 0.650841 0.570600
12000 0.915900 0.999687 0.571429 0.520357 0.549852 0.522750 0.776459 0.588600
12500 0.889600 0.995653 0.597959 0.419067 0.519250 0.668352 0.719090 0.584400
13000 0.886300 1.010592 0.634694 0.425025 0.587364 0.565217 0.713155 0.585000
13500 0.862700 1.004916 0.527551 0.469712 0.608095 0.572295 0.732938 0.582600
14000 0.890900 1.003140 0.663265 0.465740 0.530109 0.559151 0.668645 0.577000
14500 0.897200 0.990828 0.648980 0.522344 0.453110 0.561173 0.807122 0.598400
15000 0.883700 1.015247 0.642857 0.424032 0.512340 0.680485 0.667656 0.584800
15500 0.876600 1.002395 0.532653 0.506455 0.535044 0.608696 0.722057 0.581200
16000 0.885100 0.986371 0.586735 0.494538 0.472853 0.651163 0.765579 0.594000
16500 0.877400 0.989928 0.612245 0.466733 0.528134 0.602629 0.751731 0.592200
17000 0.887300 0.999306 0.570408 0.520357 0.498519 0.602629 0.755687 0.589600
17500 0.874400 0.999116 0.605102 0.451837 0.551826 0.602629 0.731949 0.588600
18000 0.871900 0.997107 0.582653 0.463754 0.541955 0.617796 0.754698 0.592200
18500 0.870700 1.002845 0.551020 0.493545 0.536032 0.625885 0.732938 0.588000
19000 0.896300 0.995333 0.592857 0.486594 0.531096 0.617796 0.714144 0.588400
19500 0.865900 0.991305 0.603061 0.480636 0.518263 0.604651 0.752720 0.591800

</div> </div>

TrainOutput(global_step=19750, training_loss=0.9750066875988924, metrics={'train_runtime': 5661.6116, 'train_samples_per_second': 27.907, 'train_steps_per_second': 3.488, 'total_flos': 7133320516097280.0, 'train_loss': 0.9750066875988924, 'epoch': 2.0})
</div> </div> </div>

Changing the rating proportions in the training set definitely improved the recall score for 3 and 4 star ratings but at the cost of the recall for 1 and 5 star ratings. The accuracy stayed the same. Depending on the use case and what our objective function or cost function is, one might want to use such a training set with hand tuned proportions of star ratings. We will not pursue this further.

4. Evaluating on the Test Set

Now we will compare various checkpoints of the models we fine-tuned with the distilBERT which was fine-tuned on the SST2 dataset.

We will write a function which computes the accuracy and recall, and also plots the distributions of the probabilities the model assigns to each class, conditional on the actual ground truth class. We will do this both for the classification into star ratings (5 classes) and for the classification into sentiment (2 classes). The latter will allow us to compare our models to the SST2 distilBERT, which is a binary classifier.

We will use the test set for the evaluation. For that purpose we will create test_dataloader and the get_probs function, which takes the model and the dataloader and returns the predicted probabilities.

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

test_dataloader = DataLoader(
    tokenized_datasets['test'], batch_size=32, collate_fn=data_collator
)
def get_probs(model, dataloader):
  probs = []
  model = model.to(device)
  model.eval()
  m = nn.Softmax(dim=1)
  for batch in tqdm(dataloader):
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
      outputs = model(**batch)
      logits = outputs.logits
      probs += m(logits).tolist()
  return np.array(probs)

We define the sentiment as:

  • negative or 0 if the rating is 3 or lower
  • positive or 1 if the rating is 4 or 5.

The reason we make 3 negative is that the distilBERT fine-tuned on SST2 classifies most 3 star ratings as negative. Looking at the reviews they do seem mostly negative.

reviews_test['sentiment'] = (reviews_test['rating'] > 3).astype(int)

The following function computes the accuracy and recall (for each class). If the sentiment argument is set to True, the classes are the binary sentiment. If a model predicts ratings, they will be converted to sentiments inside the function. If the model predictions are binary, the sentiment argument passed is ignored and set to True.

The function also plots histograms of probabilities assigned by the model in a grid. The rows in the grid correspond to the true classes (rating or sentiment for the review) and the columns correspond to the probability predictions. Each histogram consists of the probabilities that the model assigns to the class for that column when restricting to the reviews with true class given by the row. If the model is performing well, we expect the probabilities on the diagonal of the grid to concentrate at 1 and for the remaining histograms to concentrate at 0.

def plot_rating_hists(probs, targets_df):
  fig, axs = plt.subplots(5, 5, figsize=(16, 8), constrained_layout=True)
  fig.suptitle(
    'Distribution of the predicted probabilities conditional on the true rating',
    size='x-large',
  )
  for y_true, ax_row in enumerate(axs):
    for y_pred, ax in enumerate(ax_row):
      sns.histplot(
          probs[targets_df['labels'] == y_true, y_pred],
          ax=ax,
          kde=True,
          bins=20,
      )
      if y_true == 0:
        ax.set_title(
            f'Probability that rating is {y_pred + 1}'
        )
      if y_pred == 0:
        ax.set_ylabel(f'True rating is {y_true + 1}', size='large')


def plot_sentiment_hists(probs, targets_df):
  fig, axs = plt.subplots(1, 2, figsize=(12, 6), constrained_layout=True)
  fig.suptitle(
    'Distribution of the probabilities that reviews are positive',
    size='x-large',
  )
  for sentiment in [0, 1]:
    sns.histplot(
        probs[targets_df['sentiment'] == sentiment],
        ax=axs[sentiment],
        kde=True,
        bins=20,
    )
    axs[sentiment].set_title(
            f"True sentiment is {['negative', 'positive'][sentiment]}"
    )


def evaluate_and_plot(checkpoint, dataloader, targets_df, sentiment=False):
  """targets_df must contain columns 'sentiment' (binary) and 'labels'"""
  model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
  probs = get_probs(model, dataloader)

  n_classes = probs.shape[1]
  assert n_classes in {2, 5}, 'Model must predict either rating or binary sentiment'

  if not sentiment and n_classes == 2:
    sentiment = True
    print(
      'Setting sentiment to True because the model is a binary classifier'
    )

  preds = np.argmax(probs, axis=1)
  if sentiment and n_classes == 5:
    preds = preds > 2
    probs = probs[:, 3:].sum(axis=1)
  
  if n_classes == 2:
    probs = probs[:, 1]

  if sentiment:
    plot_sentiment_hists(probs, targets_df)
  else:
    plot_rating_hists(probs, targets_df)
  
  target_col = 'sentiment' if sentiment else 'labels'
  
  return{
        'accuracy': accuracy_score(targets_df[target_col], preds),
        'recall': recall_score(targets_df[target_col], preds, average=None),
    }

Now we will use this function to evaluate what looks like it might be the best checkpoint during training, at 17,000 steps. This is a little under 2 epochs. After 17,000 steps, it appears the model is starting to overfit because at that point the evaluation slowly increases again while the training loss keeps going down. At the start of the fourth epoch we are clearly overfitting: the evaluation loss increases dramatically and the training loss just keeps going down.

evaluate_and_plot(
    os.path.join(PATH, 'models/best-run/checkpoint-17000'),
    test_dataloader,
    reviews_test,
)
{'accuracy': 0.5886,
 'recall': array([0.66078431, 0.44008056, 0.46909828, 0.52818991, 0.84428716])}

We see both in the recall and in the histograms that the model has a particularly hard time with 2 and 3 star ratings.

For example, for 3 star reviews, the model is more likely than not ($53\%$ of the time) to predict a different rating (mostly 2 or 4, but sometimes 1) than to predict the correct 3 star rating. Beyond that, even if it identifies the 3 star reviews correctly almost half of the time, it is very rarely "confident" in its prediction. To be fair, when it gets 3 star reviews wrong it's almost always predicting them to be 2 star reviews and sometimes 4 star reviews.

For 5 star ratings the model does pretty well, only occasionally mistaking them for 4 star reviews. Similarly for 1 star reviews.

The histograms for 4 star reviews are interesting: The probabilities in histogram (4,4) have twin peaks at 0 and 1. When it misclassifies 4 star reviews, it is mostly as 3 and 5 stars. But the probabilities it assigns to those mistaken predictions are different. For 5 star misclassifications there is a small peak of very confident predictions, whereas the 3 star misclassifications are less confident (mostly under 0.7). See the histograms at (4,3) and at (4,5).

To compare our model to the one fine-tuned on SST2 we need to evaluate the prediction of sentiment (rather than ratings). We do this in the next cell and get an accuracy of $0.883$. We also get a recall of $0.895$ for negative ratings and $0.864$ for positive ratings.

This prediction accuracy is similar to those in the literature for BERT models on similar datasets. For example, the authors of the distilBERT fine-tuned on SST2 which we are using Report a $0.913$ accuracy. See also Table 6 in the original BERT paper.

Our model has an accuracy which is 2 or 3 percent lower than the ones I referenced, but our dataset is also significantly noisier. Taking that into account $0.883$ is a good result.

evaluate_and_plot(
    os.path.join(PATH, 'models/best-run/checkpoint-17000'),
    test_dataloader,
    reviews_test,
    sentiment=True,
)
{'accuracy': 0.8828, 'recall': array([0.89533333, 0.864     ])}

Now we evaluate the model which was fine-tuned on SST2 and is available on Hugging Face.

evaluate_and_plot(
    FINETUNED_SST,
    test_dataloader,
    reviews_test,
    sentiment=True,
)
{'accuracy': 0.8148, 'recall': array([0.81466667, 0.815     ])}

The model fine-tuned on SST has a lower accuracy $0.815$ and also lower recall scores of $0.815$ and $0.815$. It is remarkable how similar the recall scores of positive and negative ratings are for the distilBERT model fine-tuned on SST. After rounding they are actually identical.

The model we fine-tuned does have significantly higher accuracy and recall than the one fine-tuned on the SST2 dataset. However, this is to be expected because of two main reasons: Firstly, podcast and movie reviews have pretty different distributions and so it is not unexpected that a model trained on one would do worse on the other. But an additional factor is that the sentiment labeling for SST2 was done on individual sentences by human judges, whereas our labels are for entire reviews consisting of multiple sentences (which might have different sentiments even within the same review) and the labels are the ratings, which are certainly a noisier signal of sentiment than labels given by multiple people for the specific purpose of training a classifier.

All things considered, it is actually impressive how well the distilBERT fine-tuned on SST2 does on this data!

5. Evaluating the Models on some Interesting Reviews

Let's see how our model does compared to the SST2 distilBERT on the special examples we held out of the training set. We found those examples in a previous notebook by looking at misclassifications coming from VADER, and saved them because they are interesting and some of them seem to encapsulate peculiarities of podcast reviews. The idea is that a model trained directly on podcast reviews might do better on those.

holdout_reviews = reviews_raw.query('index in @holdout_ids')
holdout_dataset = Dataset.from_dict(holdout_reviews[['demojized review']])
tokenized_holdout = (
    holdout_dataset.map(partial(tokenize_function, tokenizer=tokenizer), batched=True)
                   .remove_columns(['demojized review'])
)
holdout_dataloader = DataLoader(
    tokenized_holdout, batch_size=16, collate_fn=data_collator
)
mymodel = AutoModelForSequenceClassification.from_pretrained(
                      os.path.join(PATH, 'models/best-run/checkpoint-17000')
)
myprobs = get_probs(mymodel, holdout_dataloader)

holdout_reviews[[f'{k} star prob' for k in range(1, 6)]] = myprobs
holdout_reviews['star pred'] = myprobs.argmax(axis=1)

pos_prob = myprobs[:, 3:].sum(axis=1)
holdout_reviews['positive prob mymodel'] = pos_prob
holdout_reviews['sentiment pred mymodel'] = (pos_prob > 0.5).astype(int)

sstmodel = AutoModelForSequenceClassification.from_pretrained(FINETUNED_SST)
sstprobs = get_probs(sstmodel, holdout_dataloader)
holdout_reviews['positive prob sstmodel'] = sstprobs[:, 1]
holdout_reviews['sentiment pred sstmodel'] = (sstprobs[:, 1] > 0.5).astype(int)
holdout_reviews['sentiment'] = (holdout_reviews['rating'] > 3).astype(int)
pd.crosstab(holdout_reviews['sentiment pred mymodel'], holdout_reviews['sentiment'])
sentiment 0 1
sentiment pred mymodel
0 11 1
1 1 4
pd.crosstab(holdout_reviews['sentiment pred sstmodel'], holdout_reviews['sentiment'])
sentiment 0 1
sentiment pred sstmodel
0 7 4
1 5 1

Clearly the results are way better for our model than the model fine-tuned on SST2. The latter does much worse on these held out reviews than for the generic reviews. I swear I didn't cherry pick them to make our model look good! But I did pick some of them because they seemed like interesting examples that are particular to the context of podcast reviews.

Let's go over some of the reviews to see why they are interesting examples.

First there are two reviews for two different horror themed podcasts. I wondered if the distilBERT model would learn to classify them as positive even though they use what would be considered negative language in other context, and it appears to have worked!

holdout_reviews.loc[[11204, 11211], ['review', 'rating', 'positive prob mymodel', 'positive prob sstmodel', 'polarity score']]
review rating positive prob mymodel positive prob sstmodel polarity score
11204 The real stuff.... Genuinely disturbing horror... 5 0.948132 0.424419 -0.9390
11211 Best scare ever!. It sounds strange but I alwa... 5 0.984157 0.065295 -0.9027
holdout_reviews.loc[11204, 'review']
'The real stuff.... Genuinely disturbing horror!\nThese are "take out one of the earbuds" scary, the narration kills any disconnect you may have started with, SN makes you THERE!\nMOAR!'
holdout_reviews.loc[11211, 'review']
"Best scare ever!. It sounds strange but I always listen to horror stories through my headphones to help me fall asleep at night. But Knifepoint stories have literally kept me up all night hiding under the covers. I can't get enough of these terrifying stories!!"

On the next review both models agree but it really illustrates one of the issues with VADER. While the two distilBERT models are confident the review is negative, VADER gives it a high positive score because it contains the word "like" many times:

holdout_reviews.loc[956562, 'review']
'Like like like like like like like like like. I love the concept of this podcast - but just listening to 15 minutes I really couldn’t stand how many times all of the participants said LIKE. Literally unlistenable unless you want to hear a podcast that is 50% LIKE and 50% actual content.'
holdout_reviews.loc[956562, ['rating', 'positive prob mymodel', 'positive prob sstmodel', 'polarity score']]
rating                           1
positive prob mymodel     0.001567
positive prob sstmodel    0.008399
polarity score              0.9702
Name: 956562, dtype: object

Next there are two reviews discussing sound issues. Because this is a common complaint with podcasts, one might hypothesize that our model has learned that discussing the sound is usually associated with a negative rating. The results were mixed. The second review is arguably the harder case and our model gets it right (and the other model is extremely confident in its incorrect prediction). However, for some reason our model predicts that the first review is positive, albeit not with high confidence. Maybe "sound" is usually used in these critical reviews rather than "volume". We can't draw conclusions from just two reviews, of course.

holdout_reviews.loc[[9, 123052], ['rating', 'positive prob mymodel', 'positive prob sstmodel', 'polarity score']]
rating positive prob mymodel positive prob sstmodel polarity score
9 1 0.655020 0.004309 0.4749
123052 1 0.209684 0.933054 0.9515
holdout_reviews.loc[9, 'review']
'Volume???. Great podcast, but the editors turn the volume down for the talks. The intros are loud, then you have to crank up the volume for the talk.'
holdout_reviews.loc[123052, 'review']
"Want to love it. I love Colbert. And I really wanted to love this podcast. But I don't always listen to podcasts in a library where it's super quiet. The sound needs to be fixed so I can hear it while I'm going about my daily life. If they fix the sound I will definitely download it again."

Then there is a review complaining about something to do with politics. This is a very common theme in 1 star ratings. As we might have expected, our model is much more confident of the review being negative than the one fine-tuned on SST2. The latter model usually assigns probabilities very close to 0 or 1, so 0.2 is pretty low confidence for that model.

VADER gets it completely wrong, presumably because of the word "best" and despite the word "too".

holdout_reviews.loc[2566, ['rating', 'positive prob mymodel', 'positive prob sstmodel', 'polarity score']]
rating                           1
positive prob mymodel     0.004463
positive prob sstmodel     0.20247
polarity score              0.6369
Name: 2566, dtype: object
holdout_reviews.loc[2566, 'review']
'Too Political. Talk about food. That’s what you do best.'

The following review contains mostly positive language (talking about how great the podcast used to be) but the word "unsubscribe". Sure enough, our model was very confident that it is a negative review whereas the SST2 distilBERT and VADER predicted positive.

holdout_reviews.loc[14947, ['review', 'rating', 'positive prob mymodel', 'positive prob sstmodel', 'polarity score']]
review                    Unsubscribe. Was a huge supporter of the pod a...
rating                                                                    1
positive prob mymodel                                              0.001306
positive prob sstmodel                                             0.506458
polarity score                                                       0.6937
Name: 14947, dtype: object

Finally, the following positive review seems like a really hard one to classify and it's impressive that our fine-tuned distilBERT got it right! By contrast, the SST2 distilBERT and VADER were very confident that it is negative. See for yourself:

holdout_reviews.loc[48339, ['rating', 'positive prob mymodel', 'positive prob sstmodel', 'polarity score']]
rating                           5
positive prob mymodel     0.932284
positive prob sstmodel    0.005084
polarity score              -0.944
Name: 48339, dtype: object
holdout_reviews.loc[48339, 'review']
"This episode...all about failure.... Wow! I’m in tears! My first podcast review but it had to be done because this podcast spoke to me! \n\n I’ve spent a lot of time lately feeling bad about things I’ve missed because I  didn't lose this weight sooner, didn’t devote more time to my business sooner, didn’t figure out a way to get my irritability in check so my kids don’t have a mom that yells - could’ve, would've should’ve...these thoughts rotate through my head daily and make me feel terrible about myself and my life. But you know what bows the time. I’m not going to sit with regrets any longer!"

Here is the whole holdout dataframe. I mostly went over the reviews in which the distilBERT models disagree but you can see that they also agree in many cases. They are both generally superior to VADER.

holdout_reviews[['review', 'rating', 'positive prob mymodel', 'positive prob sstmodel', 'polarity score']].head(17) # Making sure all 17 rows are shown
review rating positive prob mymodel positive prob sstmodel polarity score
9 Volume???. Great podcast, but the editors turn... 1 0.655020 0.004309 0.4749
10 America’s Forgotten Working Class. This episod... 5 0.939482 0.995457 -0.7269
76 One-dur-ful. **Warning**\nIf you listen to the... 5 0.108636 0.335130 -0.1779
283 Rebroadcasts after rebroadcasts. This used to ... 1 0.008035 0.625919 0.7447
973 Everything else is better. I listen to a few c... 1 0.007977 0.003618 0.7311
1516 How does it work?. Worked great on my old Ipod... 1 0.060916 0.001340 0.6249
2566 Too Political. Talk about food. That’s what yo... 1 0.004463 0.202470 0.6369
11204 The real stuff.... Genuinely disturbing horror... 5 0.948132 0.424419 -0.9390
11211 Best scare ever!. It sounds strange but I alwa... 5 0.984157 0.065295 -0.9027
14947 Unsubscribe. Was a huge supporter of the pod a... 1 0.001306 0.506458 0.6937
15130 What happened??. Used to love it, but now it’s... 1 0.009618 0.018876 0.9773
48339 This episode...all about failure.... Wow! I’m ... 5 0.932284 0.005084 -0.9440
49428 Entertainment. Not quality.. I think many woul... 1 0.000944 0.001380 0.9100
123052 Want to love it. I love Colbert. And I really ... 1 0.209684 0.933054 0.9515
212768 Good show that needs a professional narrator. ... 1 0.494779 0.998852 0.9200
922494 A different show now.. I loved to old podcast ... 1 0.005331 0.884350 0.9913
956562 Like like like like like like like like like. ... 1 0.001567 0.008399 0.9702

6. On Model Confidence

Something that jumps out when looking at the distributions of predicted probabilities is that the distilBERT fine-tuned on SST2 is more confident of its predictions than our model. The former mostly assigns probabilities close to 0 and 1 whereas the latter outputs more probabilities in between.

Over time our model also gets more confident but never reaches that level of sharpness, which might be due to our data being more noisy. Below we see the results for a relatively early model, at 6000 steps (0.6 epochs), and the last model at 40,000 steps (4 epochs).

The 6000 steps model is actually not that different to the 17,000 steps model we saw above in terms of the accuracy. However, the recall for negative reviews is significantly higher than for positive reviews, mostly because it classifies many 4 star reviews as 3 star reviews. Another difference is that the histograms are much less concentrated (less "confident") at 6000 steps.

The 40,000 steps model is clearly overfitting as at that point the evaluation loss has been going up and the training loss went down dramatically. One of the symptoms of this overfitting is the high confidence. The accuracy is actually worse than it was at 17,000 steps, but the histograms are much more concentrated at 0 and 1.

evaluate_and_plot(
    os.path.join(PATH, 'models/best-run/checkpoint-6000'),
    test_dataloader,
    reviews_test,
)
{'accuracy': 0.5662,
 'recall': array([0.60588235, 0.50151057, 0.51874367, 0.36597428, 0.84226491])}
evaluate_and_plot(
    os.path.join(PATH, 'models/best-run/checkpoint-40000'),
    test_dataloader,
    reviews_test,
)
{'accuracy': 0.5708,
 'recall': array([0.59117647, 0.48539778, 0.44883485, 0.53610287, 0.79271992])}
evaluate_and_plot(
    os.path.join(PATH, 'models/best-run/checkpoint-6000'),
    test_dataloader,
    reviews_test,
    sentiment=True,
)
{'accuracy': 0.8744, 'recall': array([0.93233333, 0.7875    ])}
evaluate_and_plot(
    os.path.join(PATH, 'models/best-run/checkpoint-40000'),
    test_dataloader,
    reviews_test,
    sentiment=True,
)
{'accuracy': 0.8762, 'recall': array([0.895, 0.848])}
</div>