Saturday, 14 September 2013

How to read a big binary file and split its content by some marker

How to read a big binary file and split its content by some marker

In Python, reading a big text file line-by-line is simple:
for line in open('somefile', 'r'): ...
But how to read a binary file and 'split' (by generator) its content by
some given marker, not the newline '\n'?
I want something like that:
content = open('somefile', 'r').read()
result = content.split('some_marker')
but, of course, memory-efficient (the file is around 70GB). Of course, we
can't read the file by every byte (it'll be too slow because of the HDD
nature).
The 'chunks' length (the data between those markers) might differ,
theoretically from 1 byte to megabytes.
So, to give an example to sum up, the data looks like that (digits mean
bytes here, the data is in a binary format):
12345223-MARKER-3492-MARKER-34834983428623762374632784-MARKER-888-MARKER-...
Is there any simple way to do that (not implementing reading in chunks,
splitting the chunks, remembering tails etc.)?

No comments:

Post a Comment