Very poor weakref performance in Python/SQL Alchemy

I’ve spent the day trying to debug a memory problem in my Python script. I’m using SQL Alchemy as my ORM. There are several confounding issues here, and I’m hoping that if I list them all out, somebody will be able to point me in the right direction.

In order to achieve the performance I’m looking for, I read in all the records in a table (~400k), then loop through a spreadsheet, match the records I’ve previously read in, then create new records (~800k) into another table. Here’s roughly what the code looks like:

dimensionMap = {}
for d in connection.session.query(Dimension):
   dimensionMap[d.businessKey] = d.primarySyntheticKey

# len(dimensionMap) == ~400k, sys.getsizeof(dimensionMap) == ~4MB

allfacts = []
sheet = open_spreadsheet(path)
for row in sheet.allrows():
    dimensionId = dimensionMap[row[0]]
    metric = row[1]

    fact = Fact(dimensionId, metric)
    connection.session.add(fact)
    allfacts.append(fact)

    if row.number % 20000 == 0:
        connection.session.flush()

# len(allfacts) == ~800k, sys.getsizeof(allfacts) == ~50MB

connection.session.commit()

sys.stdout.write('All Done')

400k and 800k don’t seem like especially big numbers to me, but I’m nonetheless running into memory problems a machine with 4GB of memory. This is really strange to me, as I ran sys.getsizeof on my two biggest collections, and they were both well under any size that would cause problems.

While trying to figure this out, I noticed that the script was running really, really slowly. So I ran a profile on it, hoping the results would lead me in the direction of the memory problem, and came up with two confounding issues.

Profiler output

First, 87% of the program time is spent in the commit, specifically on this line of code:

self.transaction._new[state] = True

This can be found in session.py:1367. self.transaction._new is an instance of weakref.WeakKeyDictionary(). Why is weakref:261:__setitem__ taking up so much time?

Second, even when the program is done (‘All Done’ has been printed to stdout), the script continues, seemingly forever, with 2.2GB of memory used.

I’ve done some searching on weakrefs, but haven’t seen anybody mention the performance issues I’m facing. Ultimately, there isn’t a whole lot I can do about this, given it’s buried deep in SQL Alchemy, but I’d still appreciate any ideas.

Key Learnings

As mentioned by @zzzeek, there’s a lot of overhead required to maintain persistent objects. Here’s a little graph to show the growth.

Total memory used vs number of persistent instances

The trendline suggests that each persistent instance takes about 2KB of memory overhead, even though the instance itself is only 30 bytes. This actually brings me another thing I learned, which is to take sys.getsizeof with a huge grain of salt.

This function only returns the shallow size of an object, and doesn’t take into account any other objects that need to be there for the first object to make sense (__dict__, for example). You really need to use something like Heapy to get a good understanding of the actual memory footprint of an instance.

The last thing I learned is that, when Python is on the verge of running out of memory, and is thrashing like crazy, weird stuff happens that shouldn’t be taken as part of the problem. In my case, the massive slow-down, the profile pointing to the weakref creation, and the hangup after the program completed, are all effects of the memory issue. Once I stopped creating and keeping around persistent instances, and instead just kept around the objects’ properties that I needed, all the other issues went away.

Best answer

800K ORM objects is very large. These are Python objects, each of which has a __dict__ as well as an _sa_instance_state attribute which is itself an object, which then has weakrefs and other things inside of it, then the Session has more than one weakref to your object – an ORM object is identity tracked, a feature
which provides a high degree of automation in persistence but at the cost of lots more memory and
function call overhead.

As far as why your profiling is all focused on that one weakref line, that seems very strange, I’d be curious to see the actual profile result there (see How can I profile a SQLAlchemy powered application? for background).

Your code example can be modified to not use any ORM identity-mapped objects as follows.
For more detail on bulk inserts, see Why is SQLAlchemy insert with sqlite 25 times slower than using sqlite3 directly?.

# 1. only load individual columns - loading simple tuples instead 
# of full ORM objects with identity tracking.  these tuples can be
# used directly in a dict comprehension
dimensionMap = dict(
    connection.session.query(Dimension.businessKey, Dimension.primarySyntheticKey)
)

# 2. For bulk inserts, use Table.insert() call with
# multiparams in chunks
buf = []
for row in sheet.allrows():
    dimensionId = dimensionMap[row[0]]
    metric = row[1]

    buf.append({"dimensionId": dimensionId, "metric": metric})

    if len(buf == 20000):
        connection.session.execute(Fact.__table__.insert(), params=buf)
        buf[:] = []

connection.session.execute(Fact.__table__.insert(), params=buf)
sys.stdout.write('All Done')