Benchmarketing – Neo4j and MySQL

CTEs are out in MySQL 8, which makes it easier to write graph queries.  I wanted to revisit some of the performance comparisons I’ve seen online touting Neo4j’s superiority.

For my tests, I’ve only done the MySQL side.  I’m comparing to numbers given for Neo4j in a couple of online posts.  They may be outdated.  TLDR; MySQL did great, generally beating the (perhaps outdated) Neo4j numbers in the posts.

I spun up MySQL 8 on my laptop, with a 3G innodb_buffer_pool_size and all else default.  No fine-tuning.  I loaded data from https://snap.stanford.edu/data/soc-pokec.html, with ~1.6M nodes and ~30M relationships.

To cut to the quick, as data was cached, queries performed on a par with the Neo4j numbers in the posts I had read.  If the dataset can fit into memory, as I expect it probably did for Neo4j, MySQL does very well.  If it isn’t cached, it does fairly well, too.  As an overview, finding friends to a depth of 4 recursions took about 1/2 a second.  To a depth of 5, it took 10-15 seconds.

Following are more details of my quick tests.

I installed MySQL 8 on my macbook, and gave it the following ~/.my.cnf file:

[mysqld]

secure_file_priv=”

innodb_buffer_pool_size=2G

local_infile=1

[client]

loose-local-infile = 1

I created two tables, using a subset of the columns in the sample data, and loaded the datasets:

create table soc_pokec_profiles (

    user_id int primary key,

    public boolean,

    completion_percentage int,

    gender boolean,

    region varchar(256),

    last_login datetime,

    registration datetime,

    AGE int,

    body varchar(256),

    working_in_field varchar(256),

    spoken_languages varchar(256),

    hobbies varchar(256),

    enjoy_good_food varchar(256),

    pets varchar(256),

    body_type varchar(256),

    my_eyesight varchar(256),

    eye_color varchar(64),

    hair_color varchar(64),

    hair_type varchar(64),

    completed_level_of_education varchar(64),

    favourite_color varchar(256),

    relation_to_smoking varchar(64),

    relation_to_alcohol varchar(128),

    sign_in_zodiac varchar(64)

);

load data local infile ‘/Users/benkrug/Downloads/soc-pokec/soc-pokec-profiles.txt’

into table soc_pokec_profiles

FIELDS TERMINATED BY ‘\t’;

create table soc_pokec_relationships (

user_id int, friend_id int, primary key (user_id,friend_id));

load data infile ‘/Users/benkrug/Downloads/soc-pokec/soc-pokec-relationships.txt’

replace

into table soc_pokec_relationships

FIELDS TERMINATED BY ‘\t’;

I warmed up the buffer pool using

SELECT COUNT(*) FROM soc_pokec_profiles WHERE sign_in_zodiac = ‘load_cache’;

SELECT COUNT(*) FROM soc_pokec_relationships WHERE friend_id = 33;

THen I ran queries like the following which searches to 5 levels of recursions (“depth” 1 is level 0):

WITH RECURSIVE links as

(

  SELECT 1 AS depth, 990000 AS user_id

  FROM soc_pokec_profiles 

  UNION DISTINCT

  SELECT depth+1, friend_id FROM soc_pokec_relationships JOIN links 

    WHERE links.user_id=soc_pokec_relationships.user_id

    AND depth < 6

)

SELECT * FROM links;

Varying the initial user_id, 4 levels of recursion generally took .6 seconds.  5 levels took about 10-15 seconds.

For 4 levels, that’s faster than the Neo4j times given at https://neo4j.com/news/how-much-faster-is-a-graph-database-really/ .  For 5 levels, it’s slower, but nothing like the “not finished in an hour” listed.  In fact, nothing in the quote “For the simple friends of friends query, Neo4j is 60% faster than MySQL. For friends of friends of friends, Neo is 180 times faster. And for the depth four query, Neo4j is 1,135 times faster. And MySQL just chokes on the depth 5 query” is remotely correct in my results.  My times were all better than the Neo4j times given, until 5 levels, then still were very good.

Of course, I don’t have the details of the tests they used.  I will note that they mention 1M users, and don’t say how many relationships.  I used 1.6M users and 30M relationships.

A somewhat snarky post at https://maxdemarzi.com/2017/02/06/neo4j-is-faster-than-mysql-in-performing-recursive-query/ is very proud of 2.7 seconds for 4 levels, whereas I got .6 seconds practically out of the box.  (Also, the post mentions 100k nodes, then 1M nodes, it’s a bit confusing, and 10M relationships, whereas I had 30M.)

In all, don’t believe the hype.  Use what’s best based on all the relevant considerations (manageability, ease-of-use, maintainability, performance, etc).  And do your own testing.

P.S. For a good introduction to CTEs, I thought this book was good (see my review).  I also consulted the MySQL reference here .

Advertisements