Diving Deep: An Analysis of r/RunningShoeGeeks Subreddit
September 5, 2024
As an avid runner training for my second marathon. I find myself literally running my shoes to the ground with all the training I do. I have covered about 350KM in my New Balance Supercomp Trainer V2 and I am now on the hunt for another workhorse shoe for when I reach the recommended 800KM mark.
I've found the r/RunningShoeGeeks subreddit to be an excellent resource for discovering the best shoes for various types of running and terrains. In this blog post, I'll be conducting a basic analysis of the subreddit to explore the shoes and topics that are actively discussed within the running community.
About the Analysis
I collected data from the subreddit using the Reddit API, focusing on posts and comments from August 2024. Using Python and libraries such as PRAW, pandas, and NLTK, I processed and analyzed the data to uncover trends in discussions, popular brands, and common concerns among running shoe enthusiasts.
The dataset collected included 231 submissions and 8,108 comments from the specified one-month period. For each post, I captured metadata such as title, body text, timestamp, score, number of comments, and author. Similarly, for comments, I collected details including comment ID, text, score, timestamp, and author. This comprehensive dataset formed the basis for my in-depth analysis of the running shoe community's interests and behaviors.
Overall Sentiment
Using VADER sentiment analysis, I have found that the sentiment of the community is generally positive, with most discussions being positive or informative.
- Positive: 59.95%
- Neutral: 26.06%
- Negative: 13.98%
Top Redditors
The top contributors to the subreddit included the AutoModerator bot and several deleted accounts. However, the most active human contributor was "Trick_ad5549," who made 126 contributions—just 8 fewer than the AutoModerator bot.
Most Discussed Brands
To find out the most discussed brands, I used NLTK to tokenize the comments and count the frequency of each brand mention. I then used pandas to sort the brands by frequency and plot the results.
It's surprising to see that Asics is the most discussed brand because Nike still has the highest market share. Hoka, a relatively new brand has found itself in the top 5 most discussed brands.
Most Discussed Shoe Models
This was the biggest surprise. The Asics Superblast was mentioned over 500 times. More than 2 times the next shoe which was also from Asics - The Novablast. It is followed by the Adidas Adios Pro and then The Nike Pegasus.
I am happy to see the Nike Pegasus still being highly discussed, as I have had multiple versions of these shoes in the past and have had great success with them.
Top Brands Based on Sentiment Score
Using VADER sentiment analysis on posts and comments where brands were mentioned, I calculated the sentiment score for each brand.
We see that the highest scoring brands are not the same as the most discussed brands. This suggests that when these brands are mentioned, it is with high praise or for a specific use case. For example, VivoBarefoot, which is known for its zero-drop, flexible, and foot-shaped shoes, leads the brands chart, followed by Diadora and Vibram (Didn't they make those ugly five finger shoes?). In contrast, more widely known brands like Asics, Nike, and New Balance do not appear on this list.
For the top shoe models, Nike's Vaporfly holds the first position, followed by Adidas' Adizero Boston and Altra's Lone Peak. Nike seems to have left a good impression on the community with the Vaporfly, as it is the most discussed Nike shoe model.
Topic Modeling
I used LDA (Latent Dirichlet Allocation) to find the most frequent topics in the comments. LDA is a topic modeling technique that uses the concept of "latent" topics to group similar words together. It is a probabilistic approach that attempts to find the underlying topics in a corpus of text. I arrived at 3 topics as seen in these word clouds:
Topic 1: Community Interaction and Product Experience Sharing
This topic appears to be centered around community interactions and discussions within the subreddit. Words like “please,” “thanks,” “comment,” “give,” and “share” suggest a focus on polite exchanges, requests for information, and sharing experiences. Other words like “review,” “track,” “experience,” and “information” indicate that users are discussing their experiences with various running shoes, likely providing feedback or reviews.
Topic 2: Shoe Performance and Fit
This topic seems to revolve around the performance characteristics and fit of running shoes. Words like “feel,” “trainer,” “speed,” “foam,” “fit,” “upper,” and “heel” indicate discussions about how the shoes perform in terms of comfort, speed, and suitability for diPerent types of runs (e.g., races, daily runs). There is also mention of shoe components like “midsole” and “upper,” suggesting that users are discussing specific aspects of shoe design and how they aPect performance.
Topic 3: Shoe Brands, Models, and Purchasing Decisions
This topic appears to focus on discussions about specific shoe brands and models, along with purchasing decisions. Words like “nike,” “adidas,” “asics,” “pair,” “get,” “size,” “new,” “price,” “buy,” “store,” “sale,” and “deal” indicate that users are talking about diPerent shoe brands and models, their availability, pricing, sales, and perhaps how to get the best deals. The presence of words like “brand,” “model,” and “color” suggests that these discussions may also involve preferences for certain brands or models based on their design and features.
Notes
Tuning the parameters for the LDA model to achieve a clear separation of topics was challenging. I adjusted the no_above parameter (to exclude words that appear in more than a certain percentage of documents) to a very low value. This was necessary to filter out common words like “running” and “shoes,” as they don’t provide meaningful topic differentiation given the subreddit’s focus. I also used the no_below parameter (to exclude words that appear in fewer than a certain number of documents) to remove infrequent words that didn’t contribute to the overall topics.
Compiling a comprehensive list of brands and shoe models took some effort. I had to search online for detailed lists and normalize different variations of the same brand or model name. For instance, I standardized “New Balance” and “NB,” as well as “AP3” and “Adios Pro 3.”
During data preprocessing, I made a key decision not to remove the word “on” if it was followed by “running.” This was important because “On Running” refers to the Swiss running shoe brand “On,” and removing it could have resulted in the loss of valuable information related to that brand in the analysis.
I think this analysis was a good exercise in scraping social media data and gaining a better understanding of it. I learned a lot about how different brands and shoes are perceived by the community and I can't wait to get my hands on a pair of Asics superblasts when I retire my New Balance's!