Reading WARC Files Efficiently

Reading WARC Files Efficiently - python

I am reading a WARC file with python's 'warc' library. Current file that I am using, is around 4.50 GB. The thing is ;
file = warc.open("random.warc")
html_lists = [line for line in file]
Executing these 2 lines takes up to 40 seconds. Since there will be 64000 more files like this one, it is not acceptable that it takes 40 seconds per file. Do you guys have any tips to improve performance or any different approaches?
Edit : I found out that Beautifulsoup operations take some time. So I removed it and wrote the necessary stuff myself. It is 100x faster now. It takes +- 60 seconds to read and process 4.50 GB data. With this line of code I remove the scripts from data;
clean = re.sub(r"<script.*?</script>", "", string=text)
And with this one I split the text and remove the stamp which I don't need
warc_stamp = str(soup).split(r"\r\n\r\n")
As I said it is faster but 60 seconds are not that good in this case. Any suggestions ?

but 60 seconds are not that good in this case
Of course, it would mean that processing all 64,000 WARC files takes 45 days if not done in parallel. But as a comparison: the Hadoop jobs to crawl the content of the WARC files and also those to transform WARCs into WAT and WET files need around 600 CPU days each.
WARC files are gzip-compressed because disk space and download bandwidth are usually the limiting factors. Decompression defines the baseline for any optimization. E.g., decompressing a 946 MB WARC file takes 21 seconds:
% time zcat CC-MAIN-20170629154125-20170629174125-00719.warc.gz >/dev/null
real 0m21.546s
user 0m21.304s
sys 0m0.240s
Iterating over the WARC records needs only little extra time:
% cat benchmark_warc.py
import gzip
import sys
import warc
n_records = 0
for record in warc.WARCFile(fileobj=(gzip.open(sys.argv[1]))):
if record['Content-Type'] == 'application/http; msgtype=response':
n_records += 1
print("{} records".format(n_records))
% time python benchmark_warc.py CC-MAIN-20170629154125-20170629174125-00719.warc.gz
43799 records
real 0m23.048s
user 0m22.169s
sys 0m0.878s
If processing the payload only doubles or triples the time needed anyway for decompression (I cannot imagine that you can outperform the GNU gzip implementation significantly), you're close to the optimum. If 45 days is too long, the development time is better invested in parallelization of the processing. There are already plenty of examples available how to achieve this for Common Crawl data, e.g. cc-mrjob or cc-pyspark.

Get the source code of that module, and check for optimization potential.
Use a profiler to identify performance bottlenecks, then focus on these for optimization.
It can make a huge difference to rewrite Python code in Cython and compile it into native code. So that is likely worth a try.
But by any means, rather than speculating on an internet forum on how to accelerate a two line script, you really need to work with the actual code underneath!

Related

Python long time script running huge RAM and swap memory consumption

I am trying to write python script that will periodically (each 20ms) read the data from USB port and write obtained data to .csv file. Program need to run on RaspberryPi 3B for at least 1 week. But now I am facing the problem with RAM and swap memory consumption. After 9 hours of running Linux killing my process with just one word 'Killed' in terminal. I have checked the RAM usage using psutil module and it seems like the problem is the RAM and swap usage (1 min before crash it was 100% of swap is used overall processes and 57% of RAM is in use by this process). I was trying to find out where is this memory leakage happening by using memory profiler, so it seems like the problem is in csv_append function (after 10 minutes of running it increments 7Mb of data), but when I have a closer look on this function with #profiler decorator it seems like there is no leakage.
Here is an example of this function:
def _csv_append(self,data):
"""
Appends to .csv file
"""
with open(self.last_file_name, 'a') as csvfile:
csv_writter = csv.writer(csvfile)
csv_writter.writerow(data)
Is there anything that I can improve in my program so it will stop memory leaking and work for a long time without get killed by Linux OOM? In main loop function there is nothing more then reading bytes, interpreting them as int using int.from_bytes(), calling csv_append() and wait if some time left to ensure 0.02s period
Thank you for your help :)
Analyze memory consumption using memory profiler, no info that can help. Seems like the problem in csv_append() but there is no leakage
Delete all variables each cycle and use garbage collector gc.collect()

I just ran the following little script:
import csv
Iterations = 1_000
def csv_append(data):
with open("log.csv", "a", newline="") as f:
writer = csv.writer(f)
writer.writerow(data)
for i in range(Iterations):
data = [i, "foo", "bar"]
csv_append(data)
and got some basic stats with /usr/bin/time -l ./main.py:
Iterations
Real_time
Peak_mem_footprint
1_000
0.06
5_718_848
1_000_000
22.08
5_833_664
I'm not even clearing data and memory is virtualy unchanged with 1000 times more iterations. I don't think it's the CSV file opening/writing.
I think there's something else in your program/setup you need to consider.

How can I force Python code to read input files again without rebooting my computer

I am scanning through a large number of files looking for some markers. I am starting to be really confident that once I have run through the code one time Python is not rereading the actual files from disk. I find this behavior strange because I was told that one reason I needed to structure my file access in the manner I have is so that the handle and file content is flushed. But that can't be.
There are 9,568 file paths in the list I am reading from. If I shut down Python and reboot my computer it takes roughly 6 minutes to read the files and determine if there is anything returned from the regular expression.
However, if I run the code a second time it takes about 36 seconds. Just for grins, the average document has 53,000 words.
Therefore I am concluding that Python still has access to the file it read in the first iteration.
I want to also observe that the first time I do this I can hear the disk spin (E:\ - Python is on C:). E is just a spinning disk with 126 MB cache - I don't think the cache is big enough to hold the contents of these files. When I do it later I do not hear the disk spin.
Here is the code
import re
test_7A_re = re.compile(r'\n\s*ITEM\s*7\(*a\)*[.]*\s*-*\s*QUANT.*\n',re.IGNORECASE)
no7a = []
for path in path_list:
path = path.strip()
with open(path,'r') as fh:
string = fh.read()
items = [item for item in re.finditer(test_7A_re,string)]
if len(items) == 0:
no7a.append(path)
continue
I care about this for a number of reasons, one is that I was thinking about using multi-processing. But if the bottleneck is reading in the files I don't see that I will gain much. I also think this is a problem because I would be worried about the file being modified and not having the most recent version of the file available.
I am tagging this 2.7 because I have no idea if this behavior is persistent across versions.
To confirm this behavior I modified my code to run as a .py file, and added some timing code. I then rebooted my computer - the first time it ran it took 5.6 minutes and the second time (without rebooting) the time was 36 seconds. Output is the same in both cases.
The really interesting thing is that even if shut down IDLE (but do not reboot my computer) it still takes 36 seconds to run the code.
All of this suggests to me that the files are not read from disk after the first time - this is amazing behavior to me but it seems dangerous.
To be clear, the results are the same - I believe given the timing tests I have run and the fact that I do not hear the disk spinning that somehow the files are still accessible to Python.

This is caused by caching in Windows. It is not related to Python.
In order to stop Windows from caching your reads:
Disable paging file in Windows and fill the RAM up to 90%
Use some tool to disable file caching in Windows like this one.
Run your code on a Linux VM on your Windows machine that has limited RAM. In Linux you can control the caching much better
Make the files much bigger, so that they won't fit in cache

I fail to see why this is a problem. I'm not 100% certain of how Windows handles file cache invalidation, but unless the "Last modified time" changes, you and I and Windows would assume that the file still holds the same content. If the file holds the same content, I don't see why reading from cache can be a problem.
I'm pretty sure that if you change the last modified date, say, by opening the file for write access then closing it right away, Windows will hold sufficient doubts over the file content and invalidate the cache.

Volatile Loading Times

I am loading large files into memory using Python.
df = pd.read_hdf( filename, 'data' )
Sometimes it takes about a minute. Sometimes it takes a few seconds. It seems to take a few seconds if I run it twice in a row, or if I am sequentially going through all the files in a directory.
I suspect this might be something to do with the way the hard drive works and caching. Is there a way to make this less eratic?

Why is this dictionary-updating code slow? How can it I improve it's efficiency?

I'm trying to reduce the processor time consumed by a python application, and after profiling it, I've found a small chunk of code consuming more processor time than it should:
class Stats(DumpableObject):
members_offsets = [
('blkio_delay_total', 40),
('swapin_delay_total', 56),
('read_bytes', 248),
('write_bytes', 256),
('cancelled_write_bytes', 264)
]
[...other code here...]
def accumulate(self, other_stats, destination, coeff=1):
"""Update destination from operator(self, other_stats)"""
dd = destination.__dict__
sd = self.__dict__
od = other_stats.__dict__
for member, offset in Stats.members_offsets:
dd[member] = sd[member] + coeff * od[member]
Why is this so expensive? How can I improve the efficiency of this code?
Context:
One of my favourite Linux tools, iotop, uses far more processor time than I think is appropriate for a monitoring tool - quickly consuming minutes of processor time; using the built in --profile option, total function calls approached 4 million running for only 20 seconds. I've observed similar behaviour on other systems, across reboots, & on multiple kernels. pycallgraph highlighted accumulate as one of a few time-consuming functions.
After studying the code for a full week, I think that dictionaries are the best choice for a data structure here, as a large number of threads to update will require many lookups, but don't understand why this code is expensive. Extensive search failed to enlighten. I don't understand the curses, socket, and struct libraries well enough to ask a self-contained question. I'm not asking for code as lightweight as pure C is in i7z.
I'd post images & other data, but I don't have the reputation.
The iotop git repository: http://repo.or.cz/w/iotop.git/tree (The code in question is in data.py, beginning line 73)
The system in question runs Ubuntu 13.04 on an Intel E6420 with 2GB of ram. Kernel 3.8.0-35-generic.
(I wish that Guillaume Chazarain had written more docstrings!)

Python's mmap() performance down with time

I am wondering why Python's mmap() performance going down with time? I mean I have a little app which make changes to N files, if set is big (not too really big, say 1000) first 200 is demon-speed but after that it goes slower and slower. It looks like I should free memory once in a while but don't know how and most importantly why Python do not do this automagically.
Any help?
-- edit --
It's something like that:
def function(filename, N):
fd = open(filename, 'rb+')
size = os.path.getsize(filename)
mapped = mmap(fd.fileno(), size)
for i in range(N):
some_operations_on_mmaped_block()
mapped.close()

Your OS caches the mmap'd pages in RAM. Reads and writes go at RAM speed from the cache. Dirty pages are eventually flushed. On Linux performance will be great until you have to start flushing pages, this is controlled by vm.dirty_ratio sysctl variable. Once your start flushing dirty pages to disk the reads will compete with the writes on your busy IO bus/device. Another thing to consider is simply whether your OS has enough RAM to cache all the files (the buffers counter in top output). So I would watch the output of "vmstat 1" while your program runs and watch the cache / buff counters go up until suddenly you start doing IO.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.