Predicting gamer nationalities using machine learning


Data science and video game genres

Welcome to this post about data science and video games. In this post, we describe how we built a model that can predict player region nationality based on game genre and amount of minutes played. Some practical implications that relate to this model entail customer segmentation and predicting customer behavior. I hope you enjoy this read about video games and player nationalities.

In practice, games are labeled according to particular genres by the game developers. Think about World of Warcraft as a massive multiplayer online role playing game (MMORPG) and FIFA as a sports game. In this study, we will determine categories of genres based on data analyses and existing theoretical frameworks. We will mainly focus on what kind of classification can be used for genres in order to predict nationality adoption of games. All definitions of this framework that are used in this research are displayed in Table 1.

Table 1: Different categories of video games and their explanation

Another factor that we will use for predicting player nationality, is average playing time (in terms of minutes per country). According to previous research, both game genre preference and playing intensity have effect on playing time. Prior research defines playing intensity as as the degree of harmonious and obsessive passion for a game as well as an increasing amount of hours played.This research therefore accounts for average playing time per genre.


Dataset collection and pre-processing

The dataset that is being used for this research is a dataset from the gaming platform Steam. It contains information on 109 million gamers, 716 million games and a total of 1.1 million years of playtime. The dataset contains 13 different features per player, with various sub-features. In this research only the features “Player_country” and “Games_id” are being used.

To reduce the amount of data to be analyzed and to focus on the most relevant data, several filters were applied to the dataset. Only players with a playing time of more than 5000 minutes were selected. All genres besides the 8 genres mentioned above were removed. Initially, 160 countries were present in the dataset. However, as the focus of this research is on game genres, only countries with data from 3 or more genres were selected, resulting in the removal of 102 countries from the dataset. The final dataset contained 58 countries, 2998 games and 31746 players.

Since the research also looks at continents, the data was split up by North America, South America, Africa, Europe, Asia and Oceania. The list of countries per continent was taken from Europe is analyzed in more depth, and thus split into regions from a business perspective. These regions are displayed in table 2.

Table 2: Categorization of countries for Europe

Finally, playing minutes for each country per genre were divided by the total number of players in that country.


Dataset description

The dataset contains 8 different genres. Figure 1 displays the distribution of these genres and from left to right, the number of games per genre (3120, 2654, 3953, 71, 2918, 313, 496 and 2043). Most games are action, simulation, strategy, adventure, and role playing game.

Figure 1. Number of games per genre

Finally, the average minutes of playtime by genre per country was calculated. The result is schematically displayed in Figure 2:


Different models for classification

Logistic Regression

We start of with a logistic regression. Table 3 explains when the average playing time per country per genre is added, all genres have positive effect on the prediction Europe, except for education. Action and racing genre have the strongest positive impact. The accuracy of this model is 80,4%, which is well above the baseline model of 51%. This means that because playing time is always positive, genre will generally have a positive effect on the predictor value.

Some interesting observations are that the sports genre has a stronger positive effect in UK than in Northern Europe and the simulation genre has a stronger positive effect in DACH compared to other regions. All the accuracy scores for European regions are above 90%, except for DACH, which is 84%. All accuracies are displayed in Table 4.

Table 3. Results for genre only logistic regression on continent Europe

Table 4. Accuracies per continent/region.


Decision tree

The decision tree is used to learn how the data eventually will be split in order to predict player continent and European player region. The leaves will indicate the decisions based on the training examples in the data. For visualization, a total amount of 4 leaves will be used as parameter in order to keep the diagram visible.

Continent and European region

Using a decision tree an accuracy of 0.91 is achieved for continents. In Figure 3, the training data is split into smaller and smaller subsets based on the features. It is notable to see that most decisions are based on average playing minutes per continent.

Starting at the top, the average playing minutes for education per countries splits the players up into average time less than 30992 minutes and more than 30992 minutes. If a player spends more than 30992 minutes on the education game type, the continent is North America.  Then average strategy, average racing and the average action are being split up into smaller subsets. Most players are from Oceania if their average action time is not smaller than 3933 minutes. Another very clear subset is Europe. A big part of the data is that if average racing time is not smaller than 2 minutes.

Figure 3 Decision Tree for continents with a depth of four.

Figure 4 Decision Tree for European regions with a maximum depth of four.

For European regions, an accuracy score of 0.937 is obtained on the test set. The parameters for this decision tree set a maximum depth of 4 layers. Compared to the logistic regression, we can especially see that the region Northern Europe are in line with the results. Sports genre in Northern Europe has a big negative effect in the logistic regression, which is validated through the decision tree.


Support Vector Machine

SVM is being used for multi labeling to determine accuracies of the models. The optimization parameters for regular SVM are C and Gamma (g). For C parameter, 1, 5, 10, 50,100 values are used. For Gamma, values of 10-6,10-3 ,1,10,100 are being used to test the accuracy score. High accuracy (above 0.96) is obtained for European continent with support vector machine for gamma equal 100 independently of C parameter. The corresponding confusion matrix is shown in Table 5. Interesting to see is that Benelux and DACH are classified perfectly. Northern Europe and Mediterranean are classified the worst. This can be due to the fact that the MED region has a variety of players and preferences for game genre.

Table 5 Result of parameter optimization for European continent.

Continent and European region

For multilabel classification, different optimization parameters are used in order to come to a confusion matrix and an accuracy score. Around 15% of the total dataset is taken into account when displaying the confusion matrix (table 6). We can also see for example that the 4th continent has no data (South America) and the 5th continent (Africa) has only 2 recorded instances. This is because both continents contained very few data instances compared to other continents. Like mentioned before, the SVM could not properly label these continents. The accuracy score for multilabel classifier is 0.88.

Table 6 Confusion matrix for continents


Neural network

The final machine learning model is the neural network. The neural network is implemented in Python with the use of the Keras library.  The number of hidden layers, experimentally determined, is set to 11. The dense layers activation will be set to sigmoid with a true bias. In total, 300 attempts will be made to increase the output set as accuracy. Within each layer, the default values are used in order to determine the output. On first sight, the classification accuracy for continents is much higher than European regions. After 300 Epochs, the Accuracy is almost 86%. For European areas, the region is much lower. Accuracy for European regions is at most 61,8 %. The accuracy in both cases is still high compared to both baselines.

Table 7 Classification accuracy on continents (1) and European regions (2).



Summary of results

A total of 4 classifiers have been used to determine accuracy for player region. In table 8 a summary of all the models and their accuracy scores are displayed.

Table 8 Summary of results applying different classifiers

In all of the cases, the accuracy scores are higher than their baselines. We can observe that decision tree has highest accuracy scores in general, whilst SVM has the highest score for the European regions. Based on the decision tree, we can observe that all regions can be splitted in an efficient and clean manner. These results are based on that the average amount of minutes played per country can very well distinguish and predict both continent and European regions.


The aim of this research is to predict player region and continent based on game genre and average playing time per genre. The following research questions have been proposed at the beginning of this study:

1. To what extent can game genre and average playing time predict player region?

2. To what extent can player region be differentiated based on game genre and average playing time?

The purpose of the first question is to predict player region and continent based on genre and average playing time for each genre per country. This study focuses specifically on continents and regions within Europe. The results demonstrate that both continents and regions within Europe can be predicted through various machine learning models. The racing, adventure, and action genres are the most reliable for predicting that players are located within the European continent. The data also indicates that European players tend to play racing games more than players on other continents. It is also evident that players’ location within the DACH region is correlated with the simulation genre. Benelux players are specifically associated within combinations of sports and action genres. In this respect, both Benelux and DACH are more distinctive from the continent of Europe than other European regions. The models used can predict European regions and continents with an accuracy of around 90%.

The intention of the second research question is to determine whether regions can be distinguished based on game genre and average playing time per genre across countries. This research has determined an accuracy score of 88% when predicting multiple continents and European regions. However, the confusion matrix also indicates that few data points are accounted for, which renders this conclusion somewhat dubious. Within Europe, a clear divide between different regions is evident. As mentioned previously, particularly DACH and Benelux display a stark distinction relative to other European regions.

Thanks for reading this. Would you like to predict something based on your data? Then do not hesitate to contact us. To go back to our home page, click here!



DataCrunch V.O.F.