I have big log files (from 100MB to 2GB) that contain a (single) particular line I need to parse in a Python program. I have to parse around 20,000 files. And I know that the searched line is within the 200 last lines of the file, or within the last 15000 bytes.
As it is a recurring task, I need it be as fast as possible. What is the fastest way to get it?
I have thought about 4 strategies:
read the whole file in Python and search a regex (method_1)
read only the last 15,000 bytes of the file and search a regex (method_2)
make a system call to grep (method_3)
make a system call to grep after tailing the last 200 lines (method_4)
Here are the functions I created to test these strategies :
import os
import re
import subprocess
def method_1(filename):
"""Method 1: read whole file and regex"""
regex = r'\(TEMPS CP :[ ]*.*S\)'
with open(filename, 'r') as f:
txt = f.read()
match = re.search(regex, txt)
if match:
print match.group()
def method_2(filename):
"""Method 2: read part of the file and regex"""
regex = r'\(TEMPS CP :[ ]*.*S\)'
with open(filename, 'r') as f:
size = min(15000, os.stat(filename).st_size)
f.seek(-size, os.SEEK_END)
txt = f.read(size)
match = re.search(regex, txt)
if match:
print match.group()
def method_3(filename):
"""Method 3: grep the entire file"""
cmd = 'grep "(TEMPS CP :" {} | head -n 1'.format(filename)
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
print process.communicate()[0][:-1]
def method_4(filename):
"""Method 4: tail of the file and grep"""
cmd = 'tail -n 200 {} | grep "(TEMPS CP :"'.format(filename)
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
print process.communicate()[0][:-1]
I ran these methods on two files ("trace" is 207MB and "trace_big" is 1.9GB) and got the following computation time (in seconds):
+----------+-----------+-----------+
| | trace | trace_big |
+----------+-----------+-----------+
| method_1 | 2.89E-001 | 2.63 |
| method_2 | 5.71E-004 | 5.01E-004 |
| method_3 | 2.30E-001 | 1.97 |
| method_4 | 4.94E-003 | 5.06E-003 |
+----------+-----------+-----------+
So method_2 seems to be the fastest. But is there any other solution I did not think about?
Edit
In addition to the previous methods, Gosha F suggested a fifth method using mmap :
import contextlib
import math
import mmap
def method_5(filename):
"""Method 5: use memory mapping and regex"""
regex = re.compile(r'\(TEMPS CP :[ ]*.*S\)')
offset = max(0, os.stat(filename).st_size - 15000)
ag = mmap.ALLOCATIONGRANULARITY
offset = ag * (int(math.ceil(offset/ag)))
with open(filename, 'r') as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY, offset=offset)
with contextlib.closing(mm) as txt:
match = regex.search(txt)
if match:
print match.group()
I tested it and get the following results:
+----------+-----------+-----------+
| | trace | trace_big |
+----------+-----------+-----------+
| method_5 | 2.50E-004 | 2.71E-004 |
+----------+-----------+-----------+
You may also consider using memory mapping (mmap module) like this
def method_5(filename):
"""Method 5: use memory mapping and regex"""
regex = re.compile(r'\(TEMPS CP :[ ]*.*S\)')
offset = max(0, os.stat(filename).st_size - 15000)
with open(filename, 'r') as f:
with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY, offset=offset)) as txt:
match = regex.search(txt)
if match:
print match.group()
also some side notes:
in the case of using a shell command, ag may be in some cases orders of magnitude faster than grep (although with only 200 lines of greppable text the difference probably vanishes compared to the overhead of starting a shell)
just compiling your regex in the beginning of the function may make some difference
Probably faster to do the processing in the shell so as to avoid the python overhead. Then you can pipe the result into a python script. Otherwise it looks like you did the fastest thing.
Seeking then regex match should be very fast. Method 2 and 4 are the same but you incur the extra overhead of python making a syscall.
Does it have to be in Python? Why not a shell script?
My guess is that method 4 will be the fastest/most efficient. That's certainly how I'd write it as shell script. And it's got the be faster than 1 or 3. I'd still time it in comparison to method 2 to be 100% sure though.
Related
Sorry for putting such a low level question but I really tried to look for the answer before coming here...
Basically I have a script which is searching inside .py files and reads line by line there code -> the object of the script is to find if a line is finishing with a space or a tab as in the below example
i = 5
z = 25
Basically afte r the i variable we should have a \s and after z variable a \t . ( i hope the code format will not erase it)
def custom_checks(file, rule):
"""
#param file: file: file in-which you search for a specific character
#param rule: the specific character you search for
#return: dict obj with the form { line number : character }
"""
rule=re.escape(rule)
logging.info(f" File {os.path.abspath(file)} checked for {repr(rule)} inside it ")
result_dict = {}
file = fileinput.input([file])
for idx, line in enumerate(file):
if re.search(rule, line):
result_dict[idx + 1] = str(rule)
file.close()
if not len(result_dict):
logging.info("Zero non-compliance found based on the rule:2 consecutive empty rows")
else:
logging.warning(f'Found the next errors:{result_dict}')
After that if i will check the logging output i will see this:
checked for '\+s\\s\$' inside it i dont know why the \ are double
Also basically i get all the regex from a config.json which is this one:
{
"ends with tab":"+\\t$",
"ends with space":"+s\\s$"
}
Could some one help me please in this direction-> I basically know that I may do in other ways such as reverse the line [::-1] get the first character and see if its \s etc but i really wanna do it with regex.
Thanks!
Try:
rules = {
'ends with tab': re.compile(r'\t$'),
'ends with space': re.compile(r' $'),
}
Note: while getting lines from iterating the file will leave newline ('\n') at the end of each string, $ in a regex matches the position before the first newline in the string. Thus, if using regex, you don't need to explicitly strip newlines.
if rule.search(line):
...
Personally, however, I would use line.rstrip() != line.rstrip('\n') to flag trailing spaces of any kind in one shot.
If you want to directly check for specific characters at the end of the line, you then need to strip any newline, and you need to check if the line isn't empty. For example:
char = '\t'
s = line.strip('\n')
if s and s[-1] == char:
...
Addendum 1: read rules from JSON config
# here from a string, but could be in a file, of course
json_config = """
{
"ends with tab": "\\t$",
"ends with space": " $"
}
"""
rules = {k: re.compile(v) for k, v in json.loads(json_config).items()}
Addendum 2: comments
The following shows how to comment out a rule, as well as a rule to detect comments in the file to process. Since JSON doesn't support comments, we can consider yaml instead:
yaml_config = """
ends with space: ' $'
ends with tab: \\t$
is comment: ^\\s*#
# ignore: 'foo'
"""
import yaml
rules = {k: re.compile(v) for k, v in yaml.safe_load(yaml_config).items()}
Note: 'is comment' is easy. A hypothetical 'has comment' is much harder to define -- why? I'll leave that as an exercise for the reader ;-)
Note 2: in a file, the yaml config would be without double backslash, e.g.:
cat > config.yml << EOF
ends with space: ' $'
ends with tab: \t$
is comment: ^\s*#
# ignore: 'foo'
EOF
Additional thought
You may want to give autopep8 a try.
Example:
cat > foo.py << EOF
# this is a comment
text = """
# xyz
bar
"""
def foo():
# to be continued
pass
def bar():
pass
EOF
Note: to reveal the extra spaces:
cat foo.py | perl -pe 's/$/|/'
# this is a comment |
|
text = """|
# xyz |
bar |
"""|
def foo(): |
# to be continued |
pass |
|
def bar():|
pass |
|
|
|
There are several PEP8 issues with the above (extra spaces at end of lines, only 1 line between the functions, etc.). Autopep8 fixes them all (but correctly leaves the text variable unchanged):
autopep8 foo.py | perl -pe 's/$/|/'
# this is a comment|
|
text = """|
# xyz |
bar |
"""|
|
|
def foo():|
# to be continued|
pass|
|
|
def bar():|
pass|
I am a hardware engineer working in a design department and we routinely generate directories with large amounts of data (both large files and directories that contain large numbers of small files). This data can hang around on the disk for quite a while and I am looking for a metric to identify directories with large amounts of old data in them as candidates for deletion.
The metric I have decided on is File Size (in M) * File Age (in days).
I have a working solution, but it is a combination of shell scripting and c and is neither maintainable, pretty nor elegant.
I am looking for ideas to improve the script.
The basic idea is to generate raw data on all the files using find
find $Dir -type f -exec stat -c "%s,%Y,%n" {} \; > rpt3
and then process that file in C to get a file (rpt3b) in the format
Metric,Age,Size,FileName
Metric is Age*Size
Age is number of days since file was modified
Size is size of file in M
FileName is name of file.
I then process this file to sum the metrics for each directory
for Directory in $( find /projects/solaris/implementation -maxdepth 4 -type d ) ; do
Total=`grep $Directory/ rpt3a | sed -e 's?,.*??' | paste -sd+ - | bc`
echo $Total,$Directory >> rpt3c
done
So the output is similar to a du, but it is the metric that is reported rather than the size taken on disk.
I could pull the last step into the C program, but I am looking for a solution that ideally works in one environment (doesn't have to be C, I am open to learning new languages).
Thanks in advance
You could do the whole lot in Perl. Perl comes with two operators -M and -s which are respectively the age of the file in days and the size of the file in bytes. Age here being the script start time minus the file modification time, and also the File::Find module that mimics the find command.
#!perl
use strict;
use warnings;
use File::Find;
find(\&process, shift); # shift the start directory off #ARGV
sub process {
# Lots of use of the magic _ file handle so we don't keep having to call stat()
print( (-M _) * (-s _), ' ', -M _, ' ', -s _, " $File::Find::name\n")
if -f $_;
}
Use cut to extract the correct column from your extracted lines in place of sed. cut -d, -f3 will extract the third column each separated by ,.
With input:
10,2,5,a/b
20,4,5,a/c
30,2,15,b/d
40,4,10,a/d
command grep a/ a.txt | cut -f3 -d, | paste -sd+ - | bc will produce:
20
and command grep b/ a.txt | cut -f3 -d, | paste -sd+ - | bc:
15
Call 'python script.py startdir ~/somefile.txt'.
You can use this as starting point:
import os
import sys
import time
def get_age_in_days(file_stats):
"""Calculate age in days from files stat."""
return (time.time() - file_stats.st_mtime) // (60*60*24)
def get_size_in_MB(file_stats):
"""Calculate file size in megabytes from files stat."""
return file_stats.st_size / (1024 * 1024)
def metric(root,f):
"""Uses root and f to create a metric for the file at 'os.path.join(root,f)'"""
fn = os.path.join(root,f)
fn_stat = os.stat(fn)
age = get_age_in_days(fn_stat)
size = get_size_in_MB(fn_stat)
metric = age*size
return [metric, age, size, fn]
path = None
fn = None
if len(sys.argv)==3:
path = sys.argv[1]
fn = sys.argv[2]
else:
sys.exit(2)
with open(fn,"w") as output:
# walk directory recursivly and report anything with a metric > 1
for root,dirs,files in os.walk(path):
total_dict = 0
for f in files:
m = metric(root,f)
# cutoff - only write to file if metric > 1
if m[0] > 1:
total_dict += m[0]
output.write(','.join(map(str,m))+"\n")
output.write(','.join([str(total_dict), "total","dictionary",root])+"\n")
# testing purposes
# print(open(fn).read())
Example-file - (without cutoff - using https://pyfiddle.io/):
0.0,0.0,0.0011606216430664062,./main.py
0.0,0.0,0.0,./myfiles.txt
0.0,total,dictionary,./
You can look up any line that contains ,total,dictionary,: 0.0,total,dictionary,./ for dictionary-totals.
I use this bash command for catching a string in a text file
cat a.txt | grep 'a' | grep 'b' | grep 'c' | cut -d" " -f1
How can I implement this solution in python? I don't want to call os commands because it should be a cross platform script.
You may try this,
with open(file) as f: # open the file
for line in f: # iterate over the lines
if all(i in line for i in ('a', 'b', 'c')): # check if the line contain all (a,b,c)
print line.split(" ")[0] # if yes then do splitting on space and print the first value
You can always use the os library to do a system call:
import os
bashcmd = " cat a.txt | grep 'a' | grep 'b' | grep 'c' | cut -d' ' -f1"
print os.system( bashcmd )
Whats an easy way convert the output of Python Pretty table to grammatically usable format such as CSV.
The output looks like this :
C:\test> nova list
spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
Perhaps this will get you close:
nova list | grep -v '\-\-\-\-' | sed 's/^[^|]\+|//g' | sed 's/|\(.\)/,\1/g' | tr '|' '\n'
This will strip the --- lines
Remove the leading |
Replace all but the last | with ,
Replace the last | with \n
Here's a real ugly one-liner
import csv
s = """\
spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+"""
result = [tuple(filter(None, map(str.strip, splitline))) for line in s.splitlines() for splitline in [line.split("|")] if len(splitline) > 1]
with open('output.csv', 'wb') as outcsv:
writer = csv.writer(outcsv)
writer.writerows(result)
I can unwrap it a bit to make it nicer:
splitlines = s.splitlines()
splitdata = line.split("|")
splitdata = filter(lambda line: len(line) > 1, data)
# toss the lines that don't have any data in them -- pure separator lines
header, *data = [[field.strip() for field in line if field.strip()] for line in splitdata]
result = [header] + data
# I'm really just separating these, then re-joining them, but sometimes having
# the headers separately is an important thing!
Or possibly more helpful:
result = []
for line in s.splitlines():
splitdata = line.split("|")
if len(splitdata) == 1:
continue # skip lines with no separators
linedata = []
for field in splitdata:
field = field.strip()
if field:
linedata.append(field)
result.append(linedata)
#AdamSmith's answer has a nice method for parsing the raw table string. Here are a few additions to turn it into a generic function (I chose not to use the csv module so there are no additional dependencies)
def ptable_to_csv(table, filename, headers=True):
"""Save PrettyTable results to a CSV file.
Adapted from #AdamSmith https://stackoverflow.com/questions/32128226
:param PrettyTable table: Table object to get data from.
:param str filename: Filepath for the output CSV.
:param bool headers: Whether to include the header row in the CSV.
:return: None
"""
raw = table.get_string()
data = [tuple(filter(None, map(str.strip, splitline)))
for line in raw.splitlines()
for splitline in [line.split('|')] if len(splitline) > 1]
if table.title is not None:
data = data[1:]
if not headers:
data = data[1:]
with open(filename, 'w') as f:
for d in data:
f.write('{}\n'.format(','.join(d)))
Here's a solution using a regular expression. It also works for an arbitrary number of columns (the number of columns is determined by counting the number of plus signs in the first input line).
input_string = """spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+"""
import re, csv, sys
def pretty_table_to_tuples(input_str):
lines = input_str.split("\n")
num_columns = len(re.findall("\+", lines[0])) - 1
line_regex = r"\|" + (r" +(.*?) +\|"*num_columns)
for line in lines:
m = re.match(line_regex, line.strip())
if m:
yield m.groups()
w = csv.writer(sys.stdout)
w.writerows(pretty_table_to_tuples(input_string))
I need to extract the 10 most frequent words from a text using a pipe (and any additional python scripts as needed); output being a block of all-caps words separated by a space.
This pipe needs to extract text from any external file: I've managed to get it to work on .txt files, but I also need to be able to input a URL and have it do the same thing with that.
I have the following code:
alias words="tr a-zA-Z | tr -cs A-Z | tr ' ' '\012' | sort -n | uniq -c |
sort -r | head -n 10 | awk '{printf \"%s \", \$2}END{print \"\"}'" (on one line)
which, with cat hamlet.txt | words gives me:
TO THE AND A 'TIS THAT OR OF IS
To make it more complicated, I need to exclude any 'function' words: these are 'non-lexical' words like 'a', 'the', 'of', 'is', any pronouns (I, you, him), and any prepositions (there, at, from).
I need to be able to type htmlstrip http://www.google.com.au | words and have it print out like the above.
For the URL-opening:
The python script I'm trying to figure out (let's call it htmlstrip) strips any tags from the text, leaving only 'human readable' text. This should be able to open any given URL, but I can't figure out how to get this to work.
What I have so far:
import re
import urllib2
filename = raw_input('File name: ')
filehandle = open(filename)
html = filehandle.read()
f = urllib2.urlopen('http://') #???
print f.read()
text = [ ]
inTag = False
for ch in html:
if ch == '<':
inTag = True
if not inTag:
text.append(ch)
if ch == '>':
inTag = False
print ''.join(text)
I know this is both incomplete and probably incorrect - any guidance would really be appreciated.
You can use scrape.py and regular expressions like this:
#!/usr/bin/env python
from scrape import s
import sys, re
if len(sys.argv) < 2:
print "Usage: words.py url"
sys.exit(0)
s.go(sys.argv[1]) # fetch content
text = s.doc.text # extract readable text
text = re.sub("\W+", " ", text) # remove all non-word characters and repeating whitespace
print text
And then just:
./words.py http://whatever.com
Use re.sub for this:
import re
text = re.sub(r"<.+>", " ", html)
For special cases such as scripts, you can include a regex such as:
<script.*>.*</script>
UPDATE: Sorry, just read the comment about the pure Python without any additional modules. Yes, in this situation re, I think, will be the best way.
Maybe it'll be easier and more correct to use pycURL rather then to remove tags by re?
from StringIO import StringIO
import pycurl
url = 'http://www.google.com/'
storage = StringIO()
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.WRITEFUNCTION, storage.write)
c.perform()
c.close()
content = storage.getvalue()
print content