What follows is a description of how the stats backend functions for BFBC2, what happens during high load, and what we are doing to resolve it. Consider it a peek ‚under the hood‘ of BFBC2.
System overview
When playing online, all game clients and game servers are permanently connected to the game’s backend servers.
There is a separate backend for each of the PC/PS3/360 versions of BFBC2.
A backend is split into two portions – one group of machines which run some custom software, and a database. The database is not directly accessible by game clients/servers; they can only reach it by sending requests to the custom software portion, which in turn talks to the database.
Each database is a cluster of machines which run Oracle 9i with RAC enabled.
There are a few modules in the backend, and a few tables in the database, which are shared between multiple platforms / titles. Those are generally rather low-intensity processes. However those have to be cared for if one wants to perform changes to the physical configuration of the machines that run the backend.
Stats
A stat is a short identifier with an accompanying value. Stats are tracked for each player, and they are saved between game sessions. For BC2 there are approximately 2000 unique stats values. Some of the stats have a direct meaning – your current score with a specific kit, number of kills with a specific weapon and so on – whereas other stats are meaningless on their own and track your progress toward various achievements/trophies, pins and insignias.
The stats are kept in a couple of big tables in the Oracle database.
Game client and stats
The game client only reads from the stats database; it never writes.
Stats reads happen on two occasions: when a player logs in, and when a player exits from a server back to the main menu. The client has a local cache of all stats. When one of the two previous events occur, the game client requests a handful stats (for instance the, the player’s total score and accumulated online playtime). If any of those stats are different from the locally cached values, the game client goes out and grabs all stats (approximately 2000 values).
The game client uses these stats to display information in the main menu. It is not used in-game in multiplayer.
Game server and stats
The game server reads and writes to the stats database.
When a player enters a server, the server requests approximately 1000 stats for that player from the database. Anything that has to do with stats and ranks is controlled by the server (for instance, which weapons are unlocked for a specific player).
The server writes back a player’s stats when the player leaves the server. Also, all players‘ stats are written to the database at the end of each round. This is to minimize the risk that player progress is lost because of a server crash. When writing stats, the server will only write those stats that have changed. In addition, whenever possible the server will issue commands like „add 3 to stat named ABCD“ rather than „write 27 to stat named ABCD“. This minimizes the risk that any bugs in the code or network communications problems will trample stats; the worst that can occur is that a stat is not increased, it will not get lowered or set to zero inadvertently.
Usually the game client will write a lot less than 1000 stats. I don’t have figures at hand, but perhaps 100 stats are usually updated after a player has played a full round.
High load scenarios and the backend
Normally the database responds to the custom software’s read/write queries very quickly. The database can service requests from a couple of game clients/servers in parallel; if there are too many requests made at once, new ones are put into a queue. Normal turnaround time for retrieving 2000 stats is approximately a second. Requesting 2000 stats takes a bit more time than requesting 1000 stats – probably about twice as long. The database completes the queued-up entries as quickly as it can.
The requests do not come in a steady flow however. Sometimes many servers and clients will ask for stats data at nearly the same time. The database will then service some of those requests a bit slower than usual.
The database is the weaker portion of BFBC2; that is, the custom software can handle more players being active simultaneously, than the database can.
If the clients/servers are doing a lot of requests to the database over a long period of time, then the backlog of queries in the database’s queue will get longer and longer. When the queue is so long that the database is unable to service queries in 10 seconds, the custom software will give up on those queries and respond with an error to those clients/servers.
High load scenarios and the game client/server
With the above in mind, let’s imagine what happen when the number of simultaneous players increases.
At first, there are not a lot of players. The database will handle any requests quickly and its queue is nearly empty all the time.
As the number of players go up, the database will still be able to keep up with most requests. However, occasionally a lot of servers/clients will happen to perform stat requests at nearly the same time. This causes the queue to fill up a bit more than usual. Some of those queries will then time out when they hit the 10 second cutoff. Since clients normally request more data, it is usually the game client’s requests that fail first.
If the game client’s request fails, the game client will attempt to retrieve stats for 10 or 20 seconds – and then give up, and the game’s main menu will claim that the player is Rank 1 and has zero score etc.
As the load increases further, the game server read requests will also fail more often. When game server read requests fail, the players which are affected will play with rank 1 and no stats-related unlocks. When this happens, the game server will not record & write back progress for the affected players either.
Finally, with a really high load, all requests from game clients & game servers will fail.
High load versus too high load
One important thing to notice about some online systems and load, is that the load does not behave like you would intuitively expect it to. Usually it rises slowly… until it gets to a certain point, and then it all spirals out of control and horror ensues. There are several reasons for this.
One is the human factor: When the load is at such a level that stats requests are failing intermittently, it appears to the player like he/she has lost all his/her progression, but either logging in/out (in the case of no stats in the main menu) or disconnecting/reconnecting (in the case of no stats in the game) has a % of chance to get stats back. People will then naturally do this over and over until they either get stats, or are frustrated enough to give up. This behaviour will cause more load on the backend than normal gameplay behaviour, which worsens the problem overall.
Another can be in the code; sometimes game client/game server code is written to retry a couple of times when an operation fails. This is a good thing when the backend is not under high load – after all, the error might be due to a momentary hiccup. However, when the load is high this will make the problem worse (in just the same way as the „human factor example“).
There are also some things happening in the background on databases – like backups, or regularly scheduled maintenance / dataprocessing jobs.
This means that some online systems can seem to be running fine, with a steady load, and then something happens and within minutes they grind to a halt.
How well-behaving the system is depends on what functions it performs, and the behaviours of the users of the system.
BFBC2’s custom backend software is well-behaved in most respects. The database suffers a bit from the problems described above – the step between „players are occasionally not getting stats“ and „players are never getting stats“ is smaller than theory would predict.
A closer look at the database itself
Somehow the stats database used to handle considerably more players back when it launched than now. In other words, reads/writes against the database takes more time to complete. There are two main reasons for this.
- There are stats for much more players in the database now than back when we started. Databases are good and servicing requests like, „give me the contents for user with ID=1234, it is somewhere in that huge table“, but performance does go down as the tables grow in size.
- The tables themselves are becoming fragmented. Several years and several games ago, when the database administrators designed the database setup for the system, they asked what the priorities were for the database. The response was — runtime performance; the database should be setup to be able to service as many reads/writes per second as possible. One deliberate tradeoff of the highest-performance setup they could create was that the database would gradually acquire small gaps in it. These gaps would not get reclaimed automatically. The amount of „lost“ space in the database would grow over time, and after a while the lost space would result in performance loss (due to disk caches not being as efficient anymore). This is sorted out by taking the database offline once every couple of months and rebuilding it – thereby squeezing out all the gaps. However, due to some reason these regular rebuilds have not been happening for any BFBC2 title.
Defining the problem
The problem we will tackle is the following: the current player population is suffering from stats outages. That shouldn’t be happening. Stats should be reliable with roughly the player numbers that we have now, plus a bit of headroom. We will not attempt to make it handle 100.000 concurrent users on a single backend.
Tackling the problem
One can attempt to make individual database accesses faster.
Taking the database offline, and rebuilding the tables.
This is certain to help. That is also the first thing that we will do. (And schedule new rebuilds whenever necessary in the future.)
Making disk cache sizes larger.
Memory is faster than disk, so if more of the database is kept in memory then accesses will go faster.
The PC and 360 database clusters have as much memory as is possible. The PS3 cluster has room for more memory though.
We will add it.
Redesigning the tables.
The table layout is not designed specifically for BFBC2; the same design is used by many other EA titles. Changing the design would improve performance for most requests by a fair bit. However, the time required for getting such a modification implemented, tested, and live is far too long.
We will therefore not do it.
Adding more machines to the database clusters.
One might think that doubling the number of machines in a database cluster will also double the performance of a cluster. In reality, all those machines need to coordinate their work with each other. Therefore, adding more machines only helps sometimes. In some cases, performance actually gets worse.
We will therefore not do it.
Moving to a newer Oracle version or another database altogether.
Again, the turnaround time for doing this to a live system is far too long.
We will therefore not do it.
Or one can reduce the amount of database accesses.
Making game clients request fewer stats.
The game client is already doing a small fetch before doing a full fetch (in case score/time or a couple other stats have changed). If the client doesn’t update all the stats in its cache, the main menu will not be able to show the player’s ingame progression correctly. It is perhaps possible to split the stats fetching into two portions – one portion for showing the most important stuff in the main menu (in the case of BC2 PC, the stats-related items in the main screen), and another portion for showing all the achievements/trophies etc.
It is under consideration.
Making the game servers cache stats for players.
The servers could have a cache like the game clients, but cache stats for many different players. This would help with people who play near-exclusively on one server. It is doubtful if it would make a difference (I don’t have statistics on this, just guessing).
We will not do it.
Making the game servers request fewer stats.
Fetching fewer stats will make the game server unable to evaluate the full player progression.
We will therefore not do it.
Making the game servers write fewer stats.
If the game servers would write stats to the backend at each Nth round instead of at each round, then there would be fewer unique stats written. There is a tradeoff here – is there a risk that players lose their progression due to server crashes? – but N=2 or N=3 keeps both risk and impact very small.
We have already implemented this change for both consoles, and will implement it for PC.
Once one set of changes is in place, we will then reassess the situation. Etc.
http://forums.electronicarts.co.uk/battlefield-bad-company-2-pc/1387445-stats-system-performance-perspective.html