python parse large logs to extract certain patterns

python parse large logs to extract certain patterns - python

I have large system log (max 1GB) and I need to parse it for extracting certain things.
initially i wrote python code to reading file line by line (using with open) it took very very long time
I learned about mmap and using mmap it's taking around 5 mins .
(I have precompiled the regular expression to save time)
is there any better approach so that it takes less time ? (i am using python 2.7.3 32bit )

You might want to use awk for that kind of things, it's more or less what it's designed to do.

Related

Search multiple strings in a file using python which is time efficient

I have a long list of strings to look into a very large file. I know I can achieve the above by using two for loops:
dns = sys.argv[2]
file = open(dns)
search_words = var #[list of 100+ strings ]
for line in file:
for word in search_words:
if word in line::
print(line)
However I'm looking for an efficient way todo this so that I don't have to wait for an half an hour for this to run. Can anyone help ?

The problem here is that you read the file line by line instead of actually loading the entire text file into RAM at once, which would gain you a lot of time in this case. This is what takes most time, but text-search can be improved in many ways that aren't as straightforward.
That said, there are multiple packages that are genuinely designed to do text-search efficiently in Python. I suggest that you have a look at AhocoraPy, which is based on the Aho-Corasick Algorithm, which, by the way, is the algorithm used in the well-known grep function. The GitHub page of the package provides explanation on how to achieve your task efficiently, so I will not go into further detail here.

Instant access to line from a large file without loading the file

In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so.
I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible.
The files consists of many short lines ~(20 mil lines). The "raw" files has varying line length, but with a short pre-processing I can make all lines have the same length (though, the perfect solution would not require pre-processing)
I already tried the default python solutions mentioned here but they were too slow (and the linecache solution loads the file into memory, therefore is not usable here)
The next solution I thought about is to create some kind of index. I found this solution but it's very outdated so it needs some work to get working, and even then I'm not sure if the overhead created during the processing of the index file won't slow down the process to time-scale of the solution above.
Another solution is converting the file into a binary file and then getting instant access to lines this way. For this solution I couldn't find any python package that supports binary-text work, and I feel like creating a robust parser this way could take very long time and could create many hard-to-diagnose errors down the line because of small miscalculations/mistakes.
The final solution I thought about is using some kind of database (sqlite in my case) which will require transferring the lines into a database and loading them this way.
Note: I will also load thousands of (random) lines each time, therefore solutions which work better for groups of lines will have an advantage.
Thanks in advance,
Art.

As said in the comments, I believe using hdf5 would we a good option.
This answer shows how to read that kind of file

Two scripts - read / write clash

I have two separate scripts, one written in Python and one in Ruby, which run on a schedule to achieve a single goal. Ruby isn't my code of choice, but it is all I can use for this task.
The Python script is run every 30 seconds, talks to a few scientific instruments, gather certain data, writes the data to a text file (one per instrument).
The ruby script then reads these files every 20 seconds and displays the information on a dashing dashboard.
The trouble I have is that sometimes the file is being written to by Python at the same time as Ruby is trying to read it. You can see obvious problems here...
Despite putting several checks in my ruby code such as:
If myFile.exists? and myFile.readable? and not myFile.zero?
I still get these clashes every now and then.
Is there a better way in ruby to avoid reading open files / files being written to?

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)

I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

Streaming 1GB File in Python

How long should it take to stream a 1GB file in python on say a 2Ghz Intel Core 2 Duo machine?
fp = open('publisher_feed_8663.xml')
for line in fp:
a = line.split('<')
I suppose I wasn't specific enough. This process takes 20+ minutes which is abnormally long. Based on empirical data, what is a reasonable time?

Your answer:
start = time.time()
fp = open('publisher_feed_8663.xml')
for line in fp:
a = line.split('<')
print time.time() - start
You will require a 1GB file named publisher_feed_8663.xml, python and a 2Ghz Intel Core 2 Duo machine.
For parsing of XML, you probably want to use an event based stream parser, such as SAX or lxml. I recommend reading the lxml documentation about iterparse: http://lxml.de/parsing.html#iterparse-and-iterwalk
As for how long should this take, you can do trivial harddrive benchmarks on linux using tools like hdparm -tT /dev/sda.
More RAM always helps with processing large files, as the OS can keep a bigger disk cache.

Other people have talked about the time, I'll talk about the processing (XML aside).
If you're doing something this massive, you should certainly look at generators. This pdf will teach you basically all you will ever need to know about generators. Any time you are either consuming or producing large amounts of data (especially serially) generators should be your very best friend.

That will entirely depend on what's in the file. You're reading it a line at a time, which will mean a load of overhead calling the iterator again and again for the common case of lots of short lines. Use fp.read(CHUNK) with some large number for CHUNK to improve performance.
However, I'm not sure what you're doing with split('<'). You can't usefully process XML with tools as basic as that, or with line-at-a-time parsing, since XML is not line-based. If you actually want to do something with the XML infoset in the file as you read it, you should consider a SAX parser. (Then again, 1GB of XML? That's already non-sensible really.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.