Introduction
Every sport discipline has a method to compare skills between athletes. Runners measure time on a lap or a marathon, cyclists use network apps such as Strava. When it comes to climbers they introduce some unique grading systems.
0-6 grading system
Old scales were used, when people tend to climb in mountains only and it was used for exploration, not as athletics sport
Modern grading system
Despite the chart above, it is still unclear if the route rated 7b is hard as other 7b. There are not any objective values, as time or distance to present route’s grade, rather subjective climber’s experience in comparison with other routes. Experiences depend on many factors such as height or weight. That’s why soft grade, meaning a route that should have been graded lower than actually is, is a hot issue among cimber’s society.
Therefore we come up with the idea to explore dataset, thanks to Kaggle, to seek for the routes that are rated somehow like 7b but should be treated elsewhere like 7c+ or 7a.
The Dataset
Kaggle introduces data from web service which contains 4.1 million posts from climbers ascent. They provide lots of data but for analysis we need only a few of them.
name |
crag_id |
grade_id |
user_id |
fra_routes |
Maßarbeit |
16600 |
55 |
1476 |
7b+ |
Zugabe |
16600 |
55 |
1476 |
7b+ |
Fingerbeißer |
16600 |
53 |
1476 |
7b |
Ich Habs Wollen Wis |
16600 |
55 |
1476 |
7b+ |
Halucinacna |
0 |
55 |
1476 |
7b+ |
Nove koreni |
27784 |
55 |
1476 |
7b+ |
Spanelske lzi |
27784 |
53 |
1476 |
7b |
Deprese |
29360 |
55 |
1476 |
7b+ |
Gottes Vergessene Kinder |
21971 |
49 |
1476 |
7a |
data needed
name - name which route’s author have given after first ascent
crag_id - id of the place where there route is placed
user_id - id of the person who have climbed a route
A great problem of the dataset is the lack of route_id column. It leads us to the point where the pair (name, crag_id) is the best approximation of the candidate key of the route. But there could be some crags that contain two different routes with the same name. What’s more people tend to name the same routes in different convention. Some of them gives a route’s variation name or simply make a typo and the system will assign it as different route. We decided to use a pair (name, crag_id). It is appropriate for majority of examples given.
Seeking for a soft grade
We propose two different ways to find a route that has a high grade however, when in comes to ascending it is easy to be accomplished.
User’s local maximum
We explore user’s climbing progression function over the time and peaks in the chart. We focus on local maximum because it could be the route, for instance 8c, that a climber did while they were able to do significantly easier routes.
Chart contains of logs from one of the strongest climber
Routes mean and dominant grade comparison
Second way to look for soft grades is to analyse all grades people have proposed for a route. Climbers tend to propose the same difficulty as the first person to ascent has given, or just the same as in a guidebook. But when a route is much harder or easier than it is graded, they break up and propose more accurate rate. Dominant of route grade when compared to mean grade would show presented scenario.
Preprocessing
Dataset need to be cleaned up in order to expect useful results in analysis. We take a subset where name occur more than n times. It is crucial because central_tendency algorithm calculates mean and dominant which is useless on too small dataset. Then we drop all users that have posted less then m ascents for local_maximum function. It is essential to properly analyse climbers peaks in progression. Experimentally we’ve chosen minimum quantity of route’s logs: 500 and minimum routes logged by a user: 30. Lastly many unexisting routes are being deleted with name such as “don’t know the name” or “unnamed”.
Analysis
After filtering data we were given 10415 posts of 832 routes in a data frame. While evaluating local maximum finder we use a find_peaks function from scipy. It has attributes: height, threshold, distance, prominence, width, wlen, rel_height, plateau_size.They change the sensitivity of the algorithm in finding peaks. We don’t want the noise to be described as a peak, but on the other hand no significant peak should be omitted.
The most significant arguments are threshold, which is vertical distance to its neighbouring samples and prominence - the best described by its similarity to a peak prominence in mountains.
Central tendency has one argument - difference - between mean value and dominant. Climbers tend not to change original grade even if grade is slightly easier than it should be. That’s why difference argument is set very low.
Local maximum arguments value is being set by using a grid search to most accurately fit confusion matrix of predicted (local maximum) and actual (central tendency) value.
In our model results from central tendency algorithm are labeled as Actual Values and from local maximum as Predicted.
After experiments we have chosen a local_maximum case with greatest accuracy and precision. Why are these parameters most significant. Obviously accuracy describes how often are true values in analysis. We focused on precision because we want to minimize routes, which were predicted as soft_grade but in fact they are just casual ones. The arguments are being set to threshold = 6 and prominence = 6. Threshold = 9 is being omitted due to the fact system find too few true positives with threshold = 9.
Performance
With accuracy and precission set to 6 we are given confusion matrix:
|
Real Positive |
Real Negative |
Predicted Positive |
63 |
136 |
Predicted Negative |
77 |
556 |
accuracy: 74%
precision: 32%
recall: 45%
f1: 37%
Confusion matrix demonstrates us how well algorithms have assigned the easiness of the route. Due to great number of routes being properly described there is big ratio of true negative.
Conclusion,
While having certainly high accuracy, why is the precision indicator so low? First of all, what hasn’t been said, climbers tend to ascend some routes as their projects, and they do it really often. It means that they choose one really hard route and try it numerous times. When they finally accomplish the route it is a peak in local_maximum finder and cannot be distinguished from soft grade ones. It is especially hard to conclude, because fact about projects require domain knowledge of climbers’ habits and is nearly impossible to deduce from data.