In a previous post we trained a recommender on a million ratings from Apple Podcasts. However, we didn't use the content of the reviews, which are an additional source of signal of user preference. Some of that signal can be extracted using sentiment analysis. In this post we will do so using two methods: VADER and BERT.
Oct 21, 2022
In this post we will use the Hugging Face API and PyTorch to fine-tune distilBERT on podcast reviews. We will train it to predict the rating from the title and body of the review. By converting ratings to sentiment, this also gives us a sentiment classifier. We will use Ray Tune for hyperparameter search and we will also evaluate the model in various ways.
Oct 21, 2022
We will explore a source of data leakage in the popular Titanic competition on Kaggle: Passengers traveling together have similar survival outcomes and this correlation can be used to make predictions for the test set in a way that wouldn't be possible in reality. To prevent the leakage we implement a leak-proof cross-validation. We compare the accuracy of an XGBoost classifier to various baselines to investigate which role, if any, the leakage is playing in the predictions of our classifier.
Apr 11, 2022
We train a podcast recommender using matrix-based collaborative filtering. Visualizing the resulting latent factors gives us some insight into what the model has learned.
Mar 19, 2022
We explore the possible effects of student groups on final grades. For this we implement hypothesis tests from three different paradigms: the permutation test, a semi-parametric bootstrap test, and ANOVA. As part of the analysis we compare the different tests through simulation.
Mar 19, 2022