Quickest way of Importing 500GB Text File taking only the sections wanted

I have about 500GB of text file seperated in months. In these text files the first 43 lines are just connection information (not needed). the next 75 lines are descriptors for an observation. This is followed by 4 lines (not needed) then the next observation which is 75 lines.

The thing is all I want are these 75 lines (descriptors are in the same place for every observation) which are characterized like this:

ID: 5523
Date: 20052012
Mixed: <Null>
.
.

And I want to change it to csv format 5523;20052012;;.. for each observation. So that I end up with much smaller text files. Since the descriptors are the same I’ll know the first position for example is ID.

Once I finish with the text file I’ll be opening the next one and appending it (or would creating a new file be quicker?).

What I’ve done is quite inefficient I’ve been opening the file. Loading it. Deleting these observations going line by line. If it’s taking a fair bit with a test sample it clearly isn’t the best method.

Any suggestions would be great.

Best answer

You said that you have “about 500GB of text files.” If I understand correctly, you don’t have a fixed length for each observation (note, I’m not talking about the number of lines, I mean the total length, in bytes, of all of the lines for an observation). This means that you will have to go through the entire file, because you can’t know exactly where the newlines are going to be.

Now, depending on how large each individual text file is, you may need to look for a different answer. But if each file is sufficiently small (less than 1 GB?), you might be able to use the linecache module, which handles the seeking-by-line for you.

You’d use it perhaps like this:

import linecache

filename = 'observations1.txt'

# Start at 44th line
curline = 44
lines = []

# Keep looping until no return string is found
# getline() never throws errors, but returns an empty string ''
# if the line wasn't found (if the line was actually empty, it would have
# returned the newline character '\n')
while linecache.getline(filename, curline):
    for i in xrange(75):
        lines.append(linecache.getline(filename, curline).rstrip())
        curline += 1

    # Perform work with the set of observation lines
    add_to_observation_log(lines)

    # Skip the unnecessary section and reset the lines list
    curline += 4
    lines = []

I tried a test of this, and it chewed through a 23MB file in five seconds.