How do I free up the memory used by an lxml.etree?

I’m loading data from a bunch of XML files with lxml.etree, but I’d like to close them once I’m done with this initial parsing. Currently the XML_FILES list in the below code takes up 350 MiB of the program’s 400 MiB of used memory. I’ve tried del XML_FILES, del XML_FILES[:], XML_FILES = None, for etree in XML_FILES: etree = None, and a few more, but none of these seem to be working. I also can’t find anything in the lxml docs for closing an lxml file. Here’s the code that does the parsing:

def open_xml_files():
    return [etree.parse(filename) for filename in paths]

def load_location_data(xml_files):
    location_data = {}

    for xml_file in xml_files:
        for city in xml_file.findall('City'):
            code = city.findtext('CityCode')
            name = city.findtext('CityName')
            location_data['city'][code] = name

        # [A few more like the one above]    

    return location_data

XML_FILES = utils.open_xml_files()
LOCATION_DATA = load_location_data(XML_FILES)
# XML_FILES never used again from this point on

Now, how do I get rid of XML_FILES here?

Best answer

You might consider etree.iterparse, which uses a generator rather than an in-memory list. Combined with a generator expression, this might save your program some memory.

def open_xml_files():
    return (etree.iterparse(filename) for filename in paths)

iterparse creates a generator over the parsed contents of the file, while parse immediately parses the file and loads the contents into memory. The difference in memory usage comes from the fact that iterparse doesn’t actually do anything until its next() method is called (in this case, implicitly via a for loop).

EDIT: Apparently iterparse does work incrementally, but doesn’t free memory as is parses. You could use the solution from this answer to free memory as you traverse the xml document.