This is a quick note to help with reducing the memory overhead when reading large text files. I run into a problem where I need to process large text files between a few hundred MB to a few GB in size and the traditional way of loading the file into memory is significantly slowing down the computer. Sometimes the computer became unresponsive when the file is being loaded.
After searching and experiment a bit, the Python generator function seems to solve the problem. This post combines the two solutions offered in the following links:
https://www.journaldev.com/32059/read-large-text-files-in-python
https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python
Before getting to the solution itself, I need to mention that one of the mental blocks I had was due to the fact that my original source file did not have a clean line break. I was using a scroll function to retrieve 10,000 Elasticsearch records at a time and simply writing them to another file before uploading it to an S3 bucket.
The files will look something similar to this:
{{...}{...}{...}{...}{...}{...}{...}{...}{...}{...}}{{...}{...}{...}{...}{...}{...}{...}{...}{...}{...}}{{...}{...}{...}{...}{...}{...}{...}{...}{...}{...}}
The first step was to make the records clearly deliminated by line break in the original file, this added to the size but makes parsing a lot easier:
{...}
{...}
{...}
{...}
The first script uses the usual with open
statement to open the file and loop over the lines:
#!/usr/bin/env python
import os
import resource
filename = 'largefile.txt'
print(f'File Size is {os.stat(filename).st_size / (1024 * 1024)} MB')
line_count = 0
with open(filename, 'r') as f:
....for line in f:
........line_count += 1
print(f'line count: {line_count}')
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
The second script uses the yeild
function to return each line for processing before loading the next one:
#!/usr/bin/env python
import os
import resource
filename = 'largefile.txt'
print(f'File Size is {os.stat(filename).st_size / (1024 * 1024)} MB')
def read_large_file(filename):
....line_count = 0
....with open(filename, 'r') as f:
........for line in f:
............line_count += 1
............yield line
if __name__ == "__main__":
....read_large_file(filename)
....print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
Here are the performace results, the memory number are in kilobytes (https://manpages.debian.org/buster/manpages-dev/getrusage.2.en.html):
$ python read_attemp_1.py
File Size is 213.77720069885254 MB
line count: 159020
7364608
$ python read_attemp_2.py
File Size is 213.77720069885254 MB
6709248
As the link on top indicates, we can also use the f.load()
to load a chunk of file contents into buffer first if the files are not structured with line breaks.
I hope this offers some value to people when they need to process a large text files, I know I will come back to this post from time to time when the need arises.
Happy coding,
Eric