Designing the UFC Moneyball

Published

Using big data analysis on sports? Gigi Sayfan takes us through doing just that with Cassandra/Scylla, MySQL, and Redis. In this Compose's Write Stuff article, he shows us how he constructs his own Moneyball.

Sports and big data analysis are a great match. In any sporting event, so much is happening at every moment. There are many trends that evolve over various time scales - momentum within the same match, within a season, and over the career of an athlete. You can collect a lot of data, analyze it and use it for many purposes. The movie Moneyball (based on a book by Michael Lewis) made this notion popular.

In this article, we will take this concept to the world of MMA (mixed martial arts) and design a data collection and analytics platform for it. The main focus will be on the data collection, storage, and access. The design of the actual analytics will be left as the dreaded exercise to the reader! The weapons of choice will Cassandra/Scylla, MySQL, and Redis.

UFC Moneyball

Before we jump ahead and start talking databases, let's understand the domain a little bit and what we want to accomplish. I always find it as a critical first step that provides structure and a framework to operate within. The use cases and conceptual model usually stabilize pretty quickly and provide a lot of clarity. Additional use cases and concepts are often added later as "yet another" and fit into the existing framework.

Quick intro to MMA

MMA is the sport where two competitors with backgrounds in multiple martial arts fight each other in a cage using versatile techniques that involve striking (punching and kicking) and grappling (throws, takedowns, joint locks, and chokes). A fighter wins if his opponents submits, is knocked out, or is unable to defend themselves intelligently as determined by a referee present in the cage with them. The UFC is most popular and successful organization and there is a lot of money involved.

Use cases

When you start looking at MMA from data analytics point of view many use cases come to mind. Here is an arbitrary list:

For example, the fighter Demian Maia is an elite Jiu Jitsu practitioner. He is famous for his effective ground game where he drags opponents to the floor, mounts them or takes their back and chokes them. Maia is at one end of the scale because his game plan is very simple and everybody knows what he's going to do, but he is so good at it that he is very difficult to stop. The fighter Yair Rodriguez is at the other end of the scale exhibiting an extravagant style full of jumping, spinning kicks and somersaults mixed with surprise take-downs. It doesn't appear that he himself knows what he's going to do from one second to the next.

Conceptual Model

A basic conceptual model for this domain may include the following entities: Fighter, Match, and Event.

The Fighter entity has a lot of data: physical attributes, age, weight class, fighting stance, stamina, ranks in various martial arts disciplines, match history, injury history, favorite techniques, etc.

The Fight entity has a lot of data, too: venue, opponents, referee, number of rounds, and a collection of events.

The FightEvent entity represents anything relevant that happens during a match: fighter A advances, takedown/throw attempt, jab thrown, jab lands, uppercut thrown, knock down, front kick to the face, eye poke (illegal), stance switch, guard pass to side control, arm-bar attempt, etc.

The EventCategory classifies events to various categories. For example, movement, punch, kick, judo throw, position change on the ground, submission.

The interesting aspect of fight events is that they represent a time-series and the order and timing of event sequences contain a lot of information that help our use cases.

Note that the UFC organizes and promotes events that contain multiple matches in one night. The events we consider here are fight events happening during the match.

MMA Analytics

Machine learning or more traditional statistical analysis and visualization can take all the data about a fighter and their opponent, both historically and in real-time during a match. It can provide a lot of insights that will help the well-informed fighter and their coaches prepare the perfect game plan for a particular opponent in a particular match and adjust it intelligently based on how the match evolves.

Cassandra, MySQL and Redis

The UFC moneyball data is diverse and will be used in different ways. Storing it all in one database is not ideal. In this section, I'll describe briefly the databases we will use in our design.

Cassandra/Scylla

The open source Apache Cassandra is a great database for time-series data. It was designed for distributed, large-scale workloads. It is fast, stable and battle-tested. It is a decentralized, highly available and has no single point of failure. Cassandra is also idempotent and provides an interesting mix of consistency levels on a query by query level. Cassandra succeeds in doing all that by a careful selection of its feature set and even more careful selection of the features it doesn't implement. For example, efficient ad-hoc queries are not supported. With Cassandra you better know the shape of your queries when you model your data and design your schema.

Scylla a high-performance drop-in replacement for Cassandra, which is already plenty fast. It claims to have 10X better throughput and super low latency. Conveniently Compose provides Hosted Scylla(in beta right now).

This is cool because you get to benefit from the extensive Cassandra documentation, experience, tooling and community and yet run a streamlined and highly optimized Scylla engine.

MySQL

MySQL needs no introduction. I'll just mention that Compose now has a hosted MySQL service in beta. If you prefer PostgreSQL, which is also hosted by Compose, or any other relational database that's fine. I will not be using any MySQL-specific capabilities here and the concepts transfer.

Redis

Redis is top of the class when it comes to fast in-memory key-value stores. But, it is much more than that and defines itself as a data structure server. We'll see this capability in action later. Of course, Compose can host Redis for you.

A Hybrid Poly-Store Storage Scheme

In this section, we'll model our domain and conceptual model. The basic idea is to utilize the strength of each store and divide each type of data or metadata into the most appropriate data store. Then the application can combine data from multiple stores.

Storing Fight Events in Cassandra

Cassandra is a columnar database. This means that column data is stored sequentially in memory (and on disk). But, unlike relational databases, you can query arbitrary data in a single query. Cassandra organizes the data in wide rows. Each such wide row has a key and can contain a lot of data (e.g. 100MB) and you can query a single wide row at the time. If you try to think of it in relational terms, then a wide row is the analog of a SQL table in a DB that doesn't support joins. This can get really confusing because CQL (Cassandra Query Language) is very similar syntactically to SQL, but the same terms mean different things. For example, a Cassandra table is made of multiple wide rows. Each row in the table shares the same schema, but since you can query on a single row at a time it is better to think of a Cassandra table as a collection of SQL tables with similar schema in a sharded relational database. This is pretty accurate because different wide rows may be split across machines. Another limitation of Cassandra's design is that you can efficiently query only consecutive data from a single wide row. That means that it is very important to design your schema in a way that matches your queries. If you need to query data in different orders, Cassandra says disks are cheap and you just need to store the data multiple times in different orders (a.k.a materialized views). Let's see how all this affects our modeling of fight events.

We're interested in querying fight events at the match level and then at the round level. This way we can analyze the meaningful time series. We may be interested also in doing longitudinal studies on a particular fighter, how they evolved over their career, what are their strengths and weaknesses etc.

Here is a Cassandra table schema that addresses these concerns:

CREATE KEYSPACE fightdb WITH REPLICATION = {  
    'class' : 'SimpleStrategy', 
    'replication_factor' : 3 };

use fightdb;

DROP TABLE fight_events;

CREATE TABLE fight_events (  
   fight_id int,
   round int,
   ts int,
   fighter_id int,
   event_id int,
   PRIMARY KEY (fight_id, round, ts)
) WITH CLUSTERING ORDER BY (round ASC, ts ASC);

Let's break it down. The first line creates a keyspace called fightdb, which is like a separate DB with its own policies. Normally, replication factor will be at least 3 to gain redundancy. Then we tell Cassandra to use that it, so there is no need to qualify names with the DB name. Next, we drop the fight_events table in case we're re-creating the DB from scratch. Don't do this in production because you'll destroy all your data. You can ALTER TABLE to modify the schema. Finally, we get to create the table fight_Events.

It looks like regular SQL. The columns are defined using Cassandra data types. The primary key is where things get interesting. The primary key is composed of a partition key and a clustering key. The partition key is fight_id and it defines the wide row. Every entry with the same fight id will go into the same wide row. The clustering key is round and ts. The ts column represents seconds into the current 5 minutes round (values will be 0 through 299). Inside the wide row, each record is called a compound column, which is the analog of a SQL row. Then we have the clustering order, which says that the order will be by round first and ts second both ascending. It looks pretty harmless so far. Let's insert some data. Inserts look just like SQL inserts, but you have to provide the primary key. No auto incrementing ID (which you want to avoid in a distributed system anyway)

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (1, 1, 10, 2, 1);

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (1, 1, 11, 2, 4);

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (1, 1, 12, 1, 3);

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (1, 2, 7, 1, 4);

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (1, 2, 8, 2, 1);

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (2, 1, 3, 2, 2);

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (2, 1, 4, 2, 1);  

Cassandra will also overwrite records with the same primary key. No uniques and no duplicates in Cassandra. The reason is that Cassandra is idempotent. You can perform the same operation multiple times with modifying the state. So, insert is also update in Cassandra.

OK, let's run some queries. This is where things get interesting. Starting with selecting all records:

select * from fight_events;

 fight_id | round | ts | event_id | fighter_id
----------+-------+----+----------+------------
        1 |     1 | 10 |        1 |          2
        1 |     1 | 11 |        4 |          2
        1 |     1 | 12 |        3 |          1
        1 |     2 |  7 |        4 |          1
        1 |     2 |  8 |        1 |          2
        2 |     1 |  3 |        2 |          2
        2 |     1 |  4 |        1 |          2

(7 rows)

So far, so good. Note that the timestamp seems different. The order is indeed by round and ts. Let's verify that by inserting a record with the same primary key replacing the existing one.

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (1, 1, 10, 2, 5);  

We replaced the first record. Selecting all the records shows the following and the order remains the same:

select * from fight_events;

 fight_id | round | ts | event_id | fighter_id
----------+-------+----+----------+------------
        1 |     1 | 10 |        1 |          2
        1 |     1 | 11 |        4 |          2
        1 |     1 | 12 |        3 |          1
        1 |     2 |  7 |        4 |          1
        1 |     2 |  8 |        1 |          2
        2 |     1 |  3 |        2 |          2
        2 |     1 |  4 |        1 |          2

(7 rows)

Let's try getting just the events with an id greater than 3:

select * from fight_events where event_id > 3;

InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"  

We can't do that. The ALLOW FILTERING option is a table scan, so not much help there. Maybe, we can at least select events with a particular event_id (e.g. 4):

select * from fight_events where event_id = 4;  
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"  

We get exactly the same result. You must specify at least the partition key. Maybe, we're asking too much. The event_id is not part of the primary key, so it's sort of understandable why you can't efficiently query by it. Let's go for something simpler. Let's just get all the records from round 1:

select * from fight_events where round = 1;

InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"  

You can't do that either. Cassandra has the IN operator that allows you to provide multiple keys in one query, but you will have to list each and every partition key:

select * from fight_events where fight_id IN (1, 2) AND round = 1;

 fight_id | round | ts | event_id | fighter_id
----------+-------+----+----------+------------
        1 |     1 | 10 |        1 |          2
        1 |     1 | 11 |        4 |          2
        1 |     1 | 12 |        3 |          1
        2 |     1 |  3 |        2 |          2
        2 |     1 |  4 |        1 |          2

(5 rows)

You must use equality tests on all the components of your where clause (which should only use elements from your clustering key left to right) except the last component where you can use inequalities or ranges.

For example to get events that occurred after the first 5 seconds of round 1:

select * from fight_events where fight_id IN (1, 2) AND round = 1 AND ts > 5;

 fight_id | round | ts | event_id | fighter_id
----------+-------+----+----------+------------
        1 |     1 | 10 |        1 |          2
        1 |     1 | 11 |        4 |          2
        1 |     1 | 12 |        3 |          1

Here are a couple of other queries that are invalid in CQL:

select * from fight_events where fight_id IN (1, 2) AND ts > 5;

InvalidRequest: Error from server: code=2200 [Invalid query] message="PRIMARY KEY column "ts" cannot be restricted as preceding column "round" is not restricted"

select * from fight_events where fight_id IN (1, 2) AND round < 3 AND ts > 5;

InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "ts" cannot be restricted (preceding column "round" is restricted by a non-EQ relation)"  

What about indices? Cassandra supports secondary indexes, but they come with so many restrictions and caveats that the consensus is that you should rarely use them. Check out this article for all the nitty-gritty details about secondary indices: Cassandra Native Secondary Index Deep Dive.

The official position of Cassandra is that disks are cheap and if you want to access your data in different ways, you should simply duplicate the data. In general, for every query type you should have a dedicated table where they data is already organized sequentially such that you can pull out the answer as is. For example, if we want to pull all the data by rounds then our partition key should be round. Consider this primary key for the same fight_events table:

PRIMARY KEY (round, fight_id, ts)  
WITH CLUSTERING ORDER BY (fight_id ASC, ts ASC);  

We merely switched fight_id and round, but this changes everything (and not for the better). Remember that the partition key defines a wide row that must be fully present on the same node. Since there are only 3 rounds (ignoring championship fights that last 5 rounds) then we can't distribute our data on more than 3 nodes. This is ridiculous of course. Given enough time and data a single machine won't even be able to hold all the fight events that occurred in round 1. The solution is a compound partition key. For example, we can add the month to the partition key (the parentheses group tells Cassandra that round and month are a compound partition key):

PRIMARY KEY ((round, month), fight_id, round, ts)  
) WITH CLUSTERING ORDER BY (round ASC, ts ASC);

Now, each wide row will contain all the events that occurred in a round in a particular month. This is better for data distribution and you don't have to worry about the data in a single wide row growing over time beyond the capacity of a single machine. But, now if you want to query all the events in round 1 over the entire year of 2016, you'll need to run 12 queries for each combination of round 1 and a month:

select * from fight_events where round = 1 and month = 1;  
select * from fight_events where round = 1 and month = 2;  
select * from fight_events where round = 1 and month = 3;  
...
select * from fight_events where round = 1 and month = 12;  

In general, we prefer to avoid data duplication. Cassandra's assertion that disks are cheap doesn't hold up for web-scale systems. In a previous company, I ran a Cassandra cluster that accumulated half a billion events per day. Over more 3 years that system collected many terabytes of data. Various analytics jobs required fast access to the entire dataset. The thing with Cassandra is that even if disk space is relatively cheap, network traffic isn't. Cassandra replicates data as part of its robust design. Cassandra is also constantly compacting and re-shuffling data across the cluster. The more data you have the more you pay for maintenance operations that might even be invisible to you.

That's the reason to store as little as possible in Cassandra. You'll note also that the schema contains just integer ids. Where is the actual data? Again, the idea is to save storage. Why store repeatedly the same values? Even with Cassandra's compression, there is a price to pay (mostly in big result sets). This is especially true if you need to update some value stored ubiquitously across the cluster.

Enter the relational DB.

Storing Metadata in MySQL

The idea is that all these ids like fight_id, event_id and fighter_id are identifiers of rows in a corresponding relational metadata DB. Let's look at a simple schema:

CREATE TABLE fight_event (  
    id INTEGER,
    name VARCHAR(255),
    PRIMARY KEY(id)
)   ENGINE=INNODB;

CREATE TABLE fighter (  
    id INTEGER,
    name VARCHAR(255),
    age  INTEGER,
    weight INTEGER,
    PRIMARY KEY(id)    
)   ENGINE=INNODB;

CREATE INDEX fighter_age ON fighter(age);  
CREATE INDEX fighter_weight ON fighter(weight);

CREATE TABLE fight (  
    id INTEGER,
    fighter1_id INTEGER,
    fighter2_id INTEGER,
    title VARCHAR(255),
    PRIMARY KEY(id),
    FOREIGN KEY (fighter1_id) REFERENCES fighter(id),
    FOREIGN KEY (fighter2_id) REFERENCES fighter(id)
)   ENGINE=INNODB;

CREATE INDEX fight_fighter1 ON fight(fighter1_id);  
CREATE INDEX fight_fighter2 ON fight(fighter2_id);  

MySQL can manage the metadata that will be indexed extensively. The metadata is very read-heavy. A lot of indices to update on insert don't present a problem. But, you can slice and dice it very efficiently to arrive at important ids that are stored in Cassandra. The hybrid query pattern is that you query MySQL using convoluted ad-hoc query to your heart's content. You end up with fight ids and fighter ids that you use to construct and filter Cassandra queries and then when you get back from Cassandra a result set with a bunch event ids, you can look them up in the fight_event table or more likely in a in-memory dictionary you loaded at the beginning of your program.

Storing Blazing Hot Data in Redis

It sounds like we're all set with the hybrid Cassandra + MySQL hybrid query system. But, sometimes it's not enough. Consider a live UFC championship fight, millions of viewers watching the fight via our custom app that adds live stats and displays various visualizations of real-time fight events and slow-motion. The typical web solution to deal with the massive demand of popular content is a CDN (content delivery network). CDNs are great, but they are mostly optimized for static, large content. Here we're talking about live streams of relatively small data.

You may try to service each request dynamically as it comes from the hybrid Cassandra + MySQL, but the reality is that it is very difficult to try and fine-tune the caching behavior. Instead, we can use Redis. Redis is a super-fast, in-memory (yet can be durable), data structure server. That means it's a fancy key-value store that excels at retrieving data for its users. It can be distributed via a Redis Cluster, so you don't have to worry about being limited to a single machine. When there is a massive demand for a lot of data, Redis can be a great solution to improve the responsiveness of the system, as well as providing additional capacity quickly (à la elastic horizontal scaling). In comparison adding a new node to a Cassandra cluster is a long and tedious process. The replication will impact the entire cluster because Cassandra will try to evenly distribute the data between all nodes, even if you just want to add a node temporarily to handle a spike in requests.

Redis can also be great for distributed locks and counters. Overall, Redis gives you a lot of options for high-performance flexible operations on data that is not suitable for either Cassandra or MySQL. Cassandra has distributed counters, but they suffer from high latency due to some design limitations.

For example, let's say we want to keep track on the significant strikes (very important statistic) of every fighter in every fight this evening. A good way to model it in Redis is to use its HASH data structure. The HASH is a dictionary or a map. Let's create a HASH called significant_strikes. The HASH will map the pair fight_id:fighter_id to the number of significant strikes they delivered to the opponent. Note that in some tournaments the same fighter may participate in multiple fights.

Here we initialize the significant_strikes HASH by setting two keys (fight_id:fighter_id) to 0. In this case, the fighters 44 and 55 fight each other in fight 123.

HSET significant_strikes 123:44 0  
HSET significant_strikes 123:55 0  

Let's say 44 delivered a significant strike. We need to increment its counter:

HINCRBY significant_strikes 123:44 1  

Now, suppose 55 countered with a 3 strike combo (Wow!):

HINCRBY significant_strikes 123:55 3  

At each point you can get the entire significant_strikes HASH:

HGETALL significant_strikes  
1) "123:44"  
2) "1"  
3) "123:55"  
4) "3"  

Or just specific keys:

HGETALL significant_strikes 123:55  
"3"

Architecting the UFC Moneyball

Let's go big and think about the overall system architecture. The working assumption is that a large number of users will access the data concurrently. During a live event there will be peak demand for data related to the matches and the participating fighters. In addition, various jobs will run in the background and some long running machine learning processes will digest and crunch numbers constantly. There will be publicly facing REST APIs. Stateless API servers (e.g. nginx) will delegate queries and requests to internal services via fast protocols (e.g. grpc). The services will fetch data from all the stores, merge them, massage the data and return it to the users via the APIs. The users will consume the data via various clients: mobile, web, custom tools, etc. In addition to Cassandra, MySQL and Redis, the system may also use some cloud storage for AWS S3 for archiving cold data and for backups. The system will run on one of the public cloud providers: AWS, GCE or Azure. The stateless microservices will be deployed as Docker containers. The data stores will be deployed directly, and the containers will be orchestrated as a Kubernetes cluster.

Conclusion

Large-scale systems require multiple types of data stores to manage their data properly. When you deal with time-series data, Cassandra is a solid option. ScyllaDB is a promising high-performance drop-in replacement for Cassandra. But, Cassandra data modeling is not trivial and querying it efficiently can be assisted by storing metadata in a relational DB like MySQL. Redis is a great option for caching frequently used data in memory to offload pressure from Cassandra and MySQL. One of the most challenging aspects when designing a large-scale system that has to handle a lot of data is figuring out what kinds of data you need to handle, their cardinality, and the operations that you need to perform on each. Of course, very often you will not have a full grasp at the outset of your problem domain and even if you do, things will change. That means that you also have to build a flexible enough system that will allow you to move data between stores (and possibly add more data stores) as you learn more.

Gigi Sayfan is the chief platform architect of VRVIU, a start-up developing cutting-edge hardware and software technology in the virtual reality space. Gigi has been developing software professionally for 21 years in domains as diverse as instant messaging, morphing, chip fabrication process control, embedded multi-media application for game consoles, brain-inspired machine learning, custom browser development, web services for 3D distributed game platform, IoT/sensors and most recently virtual reality.

attribution Skitterphoto

This article is licensed with CC-BY-NC-SA 4.0 by Compose.