By now, being a software engineer and a rock climber is somewhat cliché. This archetype is usually explained by the cerebral similarities between tricky climbing sequences and programming problems. My more cynical theory is that many software engineers avoided traditional sports in their youth, and climbing offers a friendly path to adult fitness. Whatever the cause, climber-engineers are in luck because climbing produces a lot of interesting data, allowing these folks to combine hobbies.
The climbing community creates a lot of technology. Prominent sites like Mountain Project, SuperTopo, and Gunks Apps immediately come to mind, but app stores are also littered with less polished projects. Add in the online technical commentary about gear, and you have quite a collection. To add to the pile, I decided to see if I could understand the climbing community better with data.
Framing The Data Problem
As mentioned above, Mountain Project is probably the most-recognizable climbing app. It allows users to find climbing routes, review them, and contribute content like photos or helpful descriptions. It’s common to see folks at the base of a cliff scratching their heads and looking between their phone screen and the wall. These are climbers attempting to correlate a particular skinny tree on their phone with the real thing.
When climbers complete a climb, they can create a Mountain Project “tick”. The tick serves as a record of the climb, and it includes metadata like the location, difficulty, how the climber performed, and freeform notes. Ticks are available publicly (see mine), and climbers often use tick histories to search for partners.
I decided to capture some tick data and see what kind of questions I could answer about the climbing community. The analysis is ongoing, but here are some goal I have in mind:
Predicting Climbing Popularity
Climbing is exploding as a sport, with new gyms cropping up all over the United States. Many of these newcomers eventually explore outdoor crags. Predicting this outdoor traffic could help parks prepare for future demand. If I could predict popularity based on location, then I could also possibly avoid the crowds! Since ticks are associated with dates, this can be framed as a times series forecasting problem.
Recommending Climbing Partners
As I mentioned above, many Mountain Project users scour the app looking for partners based on ticks. This mundane task could potentially be automated with a recommender system. Recommender systems typically compare users using their choices on a platform (Spotify songs, Netflix shows, etc). We can model ticks as these choices, which enables us to recommend either climbing routes or climbing partners. Note that this assumes that two climbers that are very similar would be good climbing partners, which is an assumption I am making based on experience.
Understanding Climbing Archetypes
If we can compare climbers by their tick histories, then we can also try to segment them. Businesses often perform cluster analysis on their customers to try and understand the different personas they attract. In my case, I just think it would be fun to see if I can find hard evidence of climbing myths like the Gumby, the Trad Dad, the Rope Gun, or the Solemn Boulderer.
Getting the Tick Data
So of course, first I had to get some tick data. Note to that respect Mountain Project’s terms of service, I do not post the complete code I used to do this or the data itself.
Perusing the site, I noticed that each user’s ticks page has a handy “Export CSV” button, which downloads a CSV of their ticks! Using a powerful Python web scraping tool called Scrapy, I was able to cobble together a crawler that looks for ticks pages and downloads a CSV for each one. If you want to try this out, remember two things:
- Be nice to the site you are scraping and use Scrapy’s AutoThrottle
- Use Scrapy Jobs so that your crawler can start and stop
My crawler ran for about 4 days straight on my laptop, eventually completing with about 83K CSV files! Using my expert StackOverflow search skills, I found this answer to help me combine them into a single CSV file.
Tasting the knowledge ahead, I rushed to get this CSV into a queryable format. My first idea was to stage the file on Google Cloud Storage and import it into BigQuery for exploration. BigQuery supports this feature, so I thought this would be trivial, but I was naive to a major peril: data cleaning.
Cleaning the Data For Import
When BigQuery tries to ingest a CSV, it fails upon encountering errors (you can configure how many errors to allow before failure). These errors often refer to a specific position in the file. When I encountered these errors, I jumped to the position in the file by opening it in vim (slow if the file is large) and jumping there with goto.
In my case, I learned that some of the ticks contained a carriage return character. This character is actually somewhat difficult to create on a Mac, but ultimately I was able to simply remove it from the file using vim regex commands. I got lucky: this was all that I needed for BigQuery to accept the data.
Exploring the Data
At long last, tick data was at my fingertips! I started by querying some fast facts to understand what I was working with:
- 3,849,902 ticks
- 114,992 routes
- 85,143 users
- 27,433 crags
Next, I wanted to answer some questions along a variety of dimensions. Check out the captions for assessment of each image.

Tick distribution across users. The vast majority have not made many ticks, while a few outliers have created a few thousand. It’s not clear whether this means many climbs go un-ticked, or the vast majority of climbs are completed by a small group.

Just for fun: a word cloud of text used in tick notes. Note that “OS” and “O” are referring to onsight (when the climbers sends the climb on the first try, with no prior information).
Attempting Climber Segmentation
I decided that first I would see if I could discover climber archetypes. I chose this one first because it was fun and because BigQuery supports k-means clustering out of the box. While k-means isn’t the only clustering method or necessarily the best one for this task, I figured it was low hanging fruit.
The first problem I encountered was that I had a table of ticks but I wanted to cluster users. I needed a way to map ticks to users. Based on some research and advice for friends, there was actually no standard procedure for this.
In an example where companies are clustering based on stock data, a column is created for every day, where the values are the changes in stock price for each company. When I looked at the RFM technique, commonly used for user segmentation, I found that “categories may be derived from business rules or using data mining techniques to find meaningful breaks.” In this assessment of Instacart users, Dimitre Linde describes the features he builds from the purchase data. It seems like the real art of clustering comes from the feature extraction.
I decided that I wanted to understand climbers by both how often they do different activities and which activities they do. I also thought about the personalities I suspected and tried to tailor the columns to them. Ultimately I settled on the following categories: months climbing, number of ticks, number of ticks on a weekday, number of ticks on a weekend, number of trad climbs, number of sport climbs, number of boulder routes, number of multipitch climbs, number of bigwall climbs, hardest grade climbed, mean grade climbed, number of locations they ticked, number of states they climbed in.
Note that to make climbing grades comparable, I converted from the US system (which has numbers and letters) to the Australian system (which has ordered numbers) using this chart. Bouldering grades can also be converted to this system.
Unfortunately, my results were somewhat disappointing. I performed k-means clustering for 2, 3, and 4 clusters, but in all cases, the clusters clearly broke down by climbing time. BigQuery shows the centroid value for each feature, allowing me to get a feel for the meaning of the clusters.

It’s still not clear whether this was due to poor feature development or whether this is genuinely the best way to segment climbers. After all, anecdotally differences in volume do seem very meaningful among my climbing friends.
I tried a second experiment where I used percentages for the columns instead of absolute numbers to eliminate the differences from simply climbing more. This time, users seemed to segment mostly by time multipitching and hardest grade.

Overall, I wouldn’t say clustering has yielded anything very meaningful yet. From my reading it seems notoriously fickle, since it is totally unsupervised. Next steps would be to attempt PCA so that I can visualize these clusters and see how logical they look. I may also try to derive more complex features like “how far they travel” and “how often they go on climbing trips”.