TL;DR
In this post I describe fine-tuning GPT-2, a powerful natural language model, for the task of generating fake rock climbing route descriptions. A climbing route description uses technical climbing jargon to describe a specific pathway up a rock face, and each one is unique. As seems to be common with GPT-2, I found that the model accurately captures the tone and specific language of the domain, but it sometimes generates impossible climbing routes given the physical world. It is also prone to repeating itself. At the bottom, I provide a Colab where anyone can download the fine-tuned model and try it out.
The Project
Since I created a rock climbing dataset a few months ago (described in this post), I’ve been thinking about ways to use it for home projects. Lately, I’ve been exploring concepts in NLP (Natural Language Processing) and trying to understand the state-of-the-art capabilities. Since the dataset contains a lot of textual information about climbing routes and logged climbs from the community, I figured there was an NLP opportunity.
At first, I considered trying to find climbing routes similar to an input route using BERT, a powerful natural language model that can be applied to a variety of applications. In a nutshell, BERT would transform rock climbing route descriptions into embeddings that can be mathematically compared. Then, a vector similarity search tool like ScaNN could use these embeddings to find the closest matches for a given route.
But, I realized that the route comparison problem was not a great fit for NLP for a few reasons:
- Humans are already good at finding climbing routes using existing tools.
- Comparing routes is based on many more factors than the description like difficulty, location, and discipline (trad, sport, boulder, etc). Accommodating these factors would turn the project from an NLP exploration to a feature-engineering grind on tabular data.
- The use case didn’t feel real. I have never actually had the urge to find similar routes to a specific climb.
Then, I remembered a joke I have with my climbing friends where we describe fake routes. This is funny because route descriptions are esoteric and full of quirks. For one thing, they generally have a matter-of-fact tone, even when they describe scary or dangerous sections. Sometimes the description notes something exciting or intimidating, but the written form simply cannot prepare a climber for the real thing. For example, consider the beginning of the description of a famous climb called Moonlight, which of friend of mine has recalled as “absolutely terrifying”:
“This is a delightful 5.6, almost as nice as Madame G’s or High E. The crux is as exciting as any 5.6 in the Gunks.“
Besides the understated mentions of adrenaline-pumping sequences, routes are also filled with climbing jargon. The technocratic language stands out starkly from normal language, which we can see by continuing on with Moonlight’s description.
The first pitch is easy to locate: Use the same access trail as for CCK, just past the Andrew boulder. At the top of the access trail you’ll see an obvious crack/corner system leading straight up. This is about 35′ left of the huge Erect Direction corner.
P1: Climb up the face just right of an overhang and head for an obvious anchor on some small trees. Then follow the corner up to a bolted belay at the GT Ledge. 5.5, 130′. Alternatively, you can start by climbing the crack behind the flake to the left, at 5.7+R, then continuing up the left-facing corner.
The text continues onward in a similar fashion. It’s mostly gibberish to non-climbers, but it’s trivial for my friends and me to synthesize it. Over time, we have become attuned to the subtleties of the structure, tone, and rhetoric of the descriptions. When we make one up, we are pattern matching on-the-fly to produce something that seems like the real thing but with a comical twist. As a concrete example, I have often described Ledgetopia, a fantasy route with significant injury risk on numerous ledges:
A classic at the grade, this route starts with delicate, unprotected face climbing off the deck. Access a large ledge and take a rest before a tricky sequence with marginal protection. With gear well below the feet, grab the lip of a sizable shelf and mantel up. Save some gas for a short but burly roof, eventually reaching another series of ledges before the chains.
It dawned on me that this practice is very similar to natural language text generation. In text generation, a model generates realistic, long-form text from a prompt. The field has attracted buzz in recent years since large, generic models like GPT-2 have achieved state-of-the-art performance on many canonical generative tasks (GPT-2 has since been superseded by the even larger and more powerful GPT-3). Most importantly for me, GPT-2 can be repurposed for any specific task using a process called fine-tuning. This means it can be taught to generate text in the style of a route description, thus making my friend group obsolete.
The Process
I decided to see if I could get GPT-2 to talk like a climber and generate fake route descriptions. First, I had to verify my expectation that GPT-2 would not create climbing routes out of the box. GPT-2 can be easily downloaded from huggingface.co, which includes simple sample code for trying it out. Here are some examples using the default model:
- “A classic at the grade” —-> “A classic at the grade, an A in history, a G in history, and an A in writing. And yet those two things just don’t”
- “Start on two crimps” —-> ”Start on two crimps to tighten it from its very point on. A small screwdriver that will stick it down.\n\n3) Use one”
So the initial hypothesis is correct, and we can safely conclude that GPT-2 does not speak like a climber by default. This makes perfect sense, since the climbing community is a small sliver of the internet. Next, I had to fine-tune GPT-2 to learn the ways of the climbing community. To do this, I collected 150K rock climbing route descriptions from a popular site called Mountain Project. I got rid of short descriptions, which are often boring, and split the rest into train and test data (90/10). Both consisted simply of text files where each route description had its own line.
Now unfortunately, I do not have a GPU. Virtual machines with GPUs can be rented from many cloud providers, but they can cost a few hundred dollars a month! After digging around a bit, I found out that Google Colab actually offers free GPU access. So, as long as I could fit the training onto a single GPU and download my outputs, I could fine-tune for free.
For the actual fine-tuning code, I found a small example from huggingface using an open-source script and a much more complete example from Joey S on medium. This post basically walked through the fine-tuning procedure on Google Colab using an example dataset. With relative ease, I was able to reuse that code on my own data. Along the way I ran into a few interesting quirks/limitations:
- The Colab GPU is sometimes shared with other users, which can lead to memory issues during training. You can ensure that the GPU has no existing utilization by running the code described in this post by Ori Bar-El.
- The fine-tuning script allows the caller to train assuming the data has one example per-line or train assuming the data is a single blob. However, the line-by-line option does not work with GPT-2 (see Github thread). So instead of seeing each line as a discrete example, GPT-2 trains on blocks of text, often covering multiple descriptions in one example. As we will see, this causes GPT-2 to often output a few route descriptions given a prompt, where only the first one contains the prompt.
After about 2.5 hours of training time, GPT-2 finished fine-tuning with a perplexity score of 19.02. I’m no expert in the perplexity metric, but the score seemed decent because the tutorial achieved a perplexity of ~20, and in OpenAI’s GPT-2 post they describe achieving state-of-the-art on seemingly similar tasks with a perplexity of ~18. From my amateur perspective, the point was that the number seemed within a reasonable threshold such that some amount of fine-tuning probably took place. Later, I should probably go back and assess the perplexity score of the GPT-2 model on my test set before fine-tuning.
Results
Finally, I was able to get some generated text out of this thing, and the results were very surprising! Going back to the initial tests above:
- “A classic at the grade” —-> ‘A classic at the grade! The crux moves are the start of the climb. It starts with some thin crimps and then eases up to the top. The route name is a reference to the famous ‘Bolt Line’ on the wall above.”’
- “Start on two crimps” —> “Start on two crimps to the left of the first bolt. Move up to the sloping shelf, then up to another sloping shelf. Top out on good holds.”
The anecdotes suggest pretty good understanding of rock climbing lingo! As I explored some more anecdotes, I even found that the model would generally keep descriptions within certain archetypes. Some examples include (prompts are underlined):
Boulder Problem
“Start on two heinous crimps. From here, move up to the jug on the right and then up to the topout.”
Multi-pitch Climb
“This route follows a wandering line so bring long runners. There are some loose rocks in the rock, so bring some small gear. There is a large ledge at the top to belay, so bring long runners. There are two bolts on top. Pitch 1: Follow the bolt line up and left on easy ground. The crux is at the second bolt. There are some loose flakes at the top. There is another bolt on top. Pitch 2: Climb straight up from the belay to the top of the pillar. The route starts on the left of the pillar, then moves to the right on easy ground to reach the top of the pillar. Belay from gear in a crack in a left-facing corner. Pitch 3: Follow the crack up to the top of the pillar. It’s easy to get lost here. There is a fixed line up and right on the pillar at the top. The route is about 20 feet to the left of the crack. The crux is right before you reach the top. The rock is good. The top is a bit dirty, but not a problem.”
Old School Reminiscing Climber
“Bob started climbing in the late ’80s with a friend who was a local. He had climbed the first half of the route, but it was only about 5 pitches, and it was the first time I ever climbed a 5.10a. I thought it was a great route. I think this was the first route I ever did at Red Rocks. It is a fun, well- protected line that is not too difficult, but not very sustained or long. I think it is a good warmup.”
Dangerous Route
“A terrifying route, and a good warmup if you’re looking for a route that will keep you on your toes. It’s a little runout on the upper half, but you’re not out of the woods either.”
These anecdotes demonstrate pretty decent understand of rock climbing, albeit with some mistakes.
Limitations
While the model seems to capture the tone of the climbing descriptions and roughly use the jargon correctly, it still has some limitations that clearly mark the descriptions as synthetic.
Repetition
Sometimes, the model starts repeating short phrases that are grammatically correct but odd. For example:
“A lousy route with a bad landing, so don’t fall! The route climbs a thin crack system that goes straight up to a ledge with a large pine. The route starts with a short crack that leads into an off-width that leads to a ledge. The crux is a thin hand crack that leads into the corner and a ledge above. The crack then widens and becomes a chimney. The chimney is the crux and the off-width is the crux and the off-width is the crux and the crack is offwidth.”
Despite the somewhat nonsensical repetition, this description does remarkably associate off-width climbing, traditionally feared by the climbing community, multiple cruxes, and dangerous landings as “lousy”. A further exploration would be to deviate from the training script with my own fine-tuning to attempt to reduce the repetition, possibly using techniques described in this post from Patrick von Platen.
Logical Fallacy
I noticed that in many cases, the model describes grammatically correct phrases that don’t make sense for climbing. These mistakes are often subtle enough that a non-climber might not notice, but a climber can tell immediately. Some quotes include:
- “…it was only about 5 pitches…but not very sustained or long. I think it is a good warmup.” 5 pitches would generally be considered much too long for a warmup.
- “Start up a short, steep corner with good handjams for pro.” A hand jam, though often described as secure, is not actually protection.
- “I first climbed this 20 years ago when a large block was chopped off by a tree.” It seems unlikely that a tree would chop off a large block and more likely that a large block would potentially break a tree.
Try It Out
Initially, I had grand plans of hosting my model in a live application so that folks could try it out. But, I discovered that predictions were extremely slow on a CPU and still pretty slow on the Colab GPU. Hosting the server with one or more GPUs would be expensive, but I still created a Colab that you can try! The model is available via a publicly accessible Google Cloud Storage bucket, and the Colab uses it for inference.
Training Colab
If you want to dig more into my actual fine-tuning code, please see this Colab. Note that the data is not accessible, since I could not publish Mountain Project data off of their site per their terms of service. I have also left out the exact scraping code because I’m not sure about the rules there, but I describe the process in a prior post.