Two of these statements run while the other fails with a syntax error. What am I doing wrong?
>>> Timer('for i in xrange(10): oct(i)').repeat(3)
[2.7091379165649414, 2.6934919357299805, 2.689150094985962]
>>> Timer('n = [] ; n = [oct(i) for i in xrange(10)]').repeat(3)
[4.0500171184539795, 3.6979520320892334, 3.701982021331787]
>>> Timer('n = [] ; for i in xrange(10): n.append(oct(i))').repeat(3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/timeit.py", line 136, in __init__
code = compile(src, dummy_src_name, "exec")
File "<timeit-src>", line 6
n = [] ; for i in xrange(10): n.append(oct(i))
^
SyntaxError: invalid syntax
Your failing statement is syntactically incorrect. If you need to time multiple statement's define it in a function and call Timer, after importing the function from main
>>> def foo():
n = []
for i in xrange(10): n.append(oct(i))
>>> Timer("foo()","from __main__ import foo")
Now you need to understand why the failing statement is incorrect
Excerpt from the docs for Compound Statement
A suite can be one or more semicolon-separated simple statements on the same line as the header, following the header’s colon, or it can be one or more indented statements on subsequent lines.
stmt_list ::= simple_stmt (";" simple_stmt)* [";"]
and similarly, a simple statement is
simple_stmt ::= expression_stmt
| assert_stmt
| assignment_stmt
| augmented_assignment_stmt
| pass_stmt
| del_stmt
| print_stmt
| return_stmt
| yield_stmt
| raise_stmt
| break_stmt
| continue_stmt
| import_stmt
| global_stmt
| exec_stmt
It should now be clear to you when a semi-colon can (not should) be used.
Timer('n = []\nfor i in xrange(10): n.append(oct(i))').repeat(3)
[2.026008492408778, 2.065228002189059, 2.048982731136192]
You can use triple quotes as well:
statement = '''n = []
for i in xrange(10):
n.append(oct(i))'''
Timer(statement).repeat(3)
Related
Sorry for putting such a low level question but I really tried to look for the answer before coming here...
Basically I have a script which is searching inside .py files and reads line by line there code -> the object of the script is to find if a line is finishing with a space or a tab as in the below example
i = 5
z = 25
Basically afte r the i variable we should have a \s and after z variable a \t . ( i hope the code format will not erase it)
def custom_checks(file, rule):
"""
#param file: file: file in-which you search for a specific character
#param rule: the specific character you search for
#return: dict obj with the form { line number : character }
"""
rule=re.escape(rule)
logging.info(f" File {os.path.abspath(file)} checked for {repr(rule)} inside it ")
result_dict = {}
file = fileinput.input([file])
for idx, line in enumerate(file):
if re.search(rule, line):
result_dict[idx + 1] = str(rule)
file.close()
if not len(result_dict):
logging.info("Zero non-compliance found based on the rule:2 consecutive empty rows")
else:
logging.warning(f'Found the next errors:{result_dict}')
After that if i will check the logging output i will see this:
checked for '\+s\\s\$' inside it i dont know why the \ are double
Also basically i get all the regex from a config.json which is this one:
{
"ends with tab":"+\\t$",
"ends with space":"+s\\s$"
}
Could some one help me please in this direction-> I basically know that I may do in other ways such as reverse the line [::-1] get the first character and see if its \s etc but i really wanna do it with regex.
Thanks!
Try:
rules = {
'ends with tab': re.compile(r'\t$'),
'ends with space': re.compile(r' $'),
}
Note: while getting lines from iterating the file will leave newline ('\n') at the end of each string, $ in a regex matches the position before the first newline in the string. Thus, if using regex, you don't need to explicitly strip newlines.
if rule.search(line):
...
Personally, however, I would use line.rstrip() != line.rstrip('\n') to flag trailing spaces of any kind in one shot.
If you want to directly check for specific characters at the end of the line, you then need to strip any newline, and you need to check if the line isn't empty. For example:
char = '\t'
s = line.strip('\n')
if s and s[-1] == char:
...
Addendum 1: read rules from JSON config
# here from a string, but could be in a file, of course
json_config = """
{
"ends with tab": "\\t$",
"ends with space": " $"
}
"""
rules = {k: re.compile(v) for k, v in json.loads(json_config).items()}
Addendum 2: comments
The following shows how to comment out a rule, as well as a rule to detect comments in the file to process. Since JSON doesn't support comments, we can consider yaml instead:
yaml_config = """
ends with space: ' $'
ends with tab: \\t$
is comment: ^\\s*#
# ignore: 'foo'
"""
import yaml
rules = {k: re.compile(v) for k, v in yaml.safe_load(yaml_config).items()}
Note: 'is comment' is easy. A hypothetical 'has comment' is much harder to define -- why? I'll leave that as an exercise for the reader ;-)
Note 2: in a file, the yaml config would be without double backslash, e.g.:
cat > config.yml << EOF
ends with space: ' $'
ends with tab: \t$
is comment: ^\s*#
# ignore: 'foo'
EOF
Additional thought
You may want to give autopep8 a try.
Example:
cat > foo.py << EOF
# this is a comment
text = """
# xyz
bar
"""
def foo():
# to be continued
pass
def bar():
pass
EOF
Note: to reveal the extra spaces:
cat foo.py | perl -pe 's/$/|/'
# this is a comment |
|
text = """|
# xyz |
bar |
"""|
def foo(): |
# to be continued |
pass |
|
def bar():|
pass |
|
|
|
There are several PEP8 issues with the above (extra spaces at end of lines, only 1 line between the functions, etc.). Autopep8 fixes them all (but correctly leaves the text variable unchanged):
autopep8 foo.py | perl -pe 's/$/|/'
# this is a comment|
|
text = """|
# xyz |
bar |
"""|
|
|
def foo():|
# to be continued|
pass|
|
|
def bar():|
pass|
I have big log files (from 100MB to 2GB) that contain a (single) particular line I need to parse in a Python program. I have to parse around 20,000 files. And I know that the searched line is within the 200 last lines of the file, or within the last 15000 bytes.
As it is a recurring task, I need it be as fast as possible. What is the fastest way to get it?
I have thought about 4 strategies:
read the whole file in Python and search a regex (method_1)
read only the last 15,000 bytes of the file and search a regex (method_2)
make a system call to grep (method_3)
make a system call to grep after tailing the last 200 lines (method_4)
Here are the functions I created to test these strategies :
import os
import re
import subprocess
def method_1(filename):
"""Method 1: read whole file and regex"""
regex = r'\(TEMPS CP :[ ]*.*S\)'
with open(filename, 'r') as f:
txt = f.read()
match = re.search(regex, txt)
if match:
print match.group()
def method_2(filename):
"""Method 2: read part of the file and regex"""
regex = r'\(TEMPS CP :[ ]*.*S\)'
with open(filename, 'r') as f:
size = min(15000, os.stat(filename).st_size)
f.seek(-size, os.SEEK_END)
txt = f.read(size)
match = re.search(regex, txt)
if match:
print match.group()
def method_3(filename):
"""Method 3: grep the entire file"""
cmd = 'grep "(TEMPS CP :" {} | head -n 1'.format(filename)
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
print process.communicate()[0][:-1]
def method_4(filename):
"""Method 4: tail of the file and grep"""
cmd = 'tail -n 200 {} | grep "(TEMPS CP :"'.format(filename)
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
print process.communicate()[0][:-1]
I ran these methods on two files ("trace" is 207MB and "trace_big" is 1.9GB) and got the following computation time (in seconds):
+----------+-----------+-----------+
| | trace | trace_big |
+----------+-----------+-----------+
| method_1 | 2.89E-001 | 2.63 |
| method_2 | 5.71E-004 | 5.01E-004 |
| method_3 | 2.30E-001 | 1.97 |
| method_4 | 4.94E-003 | 5.06E-003 |
+----------+-----------+-----------+
So method_2 seems to be the fastest. But is there any other solution I did not think about?
Edit
In addition to the previous methods, Gosha F suggested a fifth method using mmap :
import contextlib
import math
import mmap
def method_5(filename):
"""Method 5: use memory mapping and regex"""
regex = re.compile(r'\(TEMPS CP :[ ]*.*S\)')
offset = max(0, os.stat(filename).st_size - 15000)
ag = mmap.ALLOCATIONGRANULARITY
offset = ag * (int(math.ceil(offset/ag)))
with open(filename, 'r') as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY, offset=offset)
with contextlib.closing(mm) as txt:
match = regex.search(txt)
if match:
print match.group()
I tested it and get the following results:
+----------+-----------+-----------+
| | trace | trace_big |
+----------+-----------+-----------+
| method_5 | 2.50E-004 | 2.71E-004 |
+----------+-----------+-----------+
You may also consider using memory mapping (mmap module) like this
def method_5(filename):
"""Method 5: use memory mapping and regex"""
regex = re.compile(r'\(TEMPS CP :[ ]*.*S\)')
offset = max(0, os.stat(filename).st_size - 15000)
with open(filename, 'r') as f:
with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY, offset=offset)) as txt:
match = regex.search(txt)
if match:
print match.group()
also some side notes:
in the case of using a shell command, ag may be in some cases orders of magnitude faster than grep (although with only 200 lines of greppable text the difference probably vanishes compared to the overhead of starting a shell)
just compiling your regex in the beginning of the function may make some difference
Probably faster to do the processing in the shell so as to avoid the python overhead. Then you can pipe the result into a python script. Otherwise it looks like you did the fastest thing.
Seeking then regex match should be very fast. Method 2 and 4 are the same but you incur the extra overhead of python making a syscall.
Does it have to be in Python? Why not a shell script?
My guess is that method 4 will be the fastest/most efficient. That's certainly how I'd write it as shell script. And it's got the be faster than 1 or 3. I'd still time it in comparison to method 2 to be 100% sure though.
I have one table (Please Ref Image) In this Table I want to remove "A" char from each Row How can I do in Python.
Below is my code using regexe_replace but code is not optimised I want optimised code
def re(s):
return regexp_replace(s, "A", "").cast("Integer")
finalDF = finalD.select(re(col("C0")).alias("C0"),col("C1"),
re(col("C2")).alias("C2"),
re(col("C3")).alias("C3"),col("C4"),
re(col("C5")).alias("C5"),
re(col("C6")).alias("C6"),col("C7"),
re(col("C8")).alias("C8"),
re(col("C9")).alias("C9"),col("C10"),
re(col("C11")).alias("C11"),col("C12"),
re(col("C13")).alias("C13"),
re(col("C14")).alias("C14"),col("C15"),
re(col("C16")).alias("16"),col("C17"),
re(col("C18")).alias("18"),
re(col("C19")).alias("C19"),col("Label"))
finalDF.show(2)
Thank you in Advance.
Why regex? Regex will be over kill.
If you have data in format you have given, then use replace function as below:
Content of master.csv:
A11| 6|A34|A43|
A11| 6|A35|A44|
Code :
with open('master.csv','r') as fh:
for line in fh.readlines():
print "Before - ",line
line = line.replace('A','')
print "After - ", line
print "---------------------------"
Output:
C:\Users\dinesh_pundkar\Desktop>python c.py
Before - A11| 6|A34|A43|
After - 11| 6|34|43|
---------------------------
Before - A11| 6|A35|A44|
After - 11| 6|35|44|
---------------------------
Code with replacing 'A' from complete data in in one shot (without going line by line)
with open("master.csv",'r') as fh:
data = fh.read()
data_after_remove = data.replace('A','')
print "Before remove ..."
print data
print "After remove ..."
print data_after_remove
Output:
C:\Users\dinesh_pundkar\Desktop>python c.py
Before remove...
A11| 6|A34|A43|
A11| 6|A35|A44|
After remove ...
11| 6|34|43|
11| 6|35|44|
C:\Users\dinesh_pundkar\Desktop>
Whats an easy way convert the output of Python Pretty table to grammatically usable format such as CSV.
The output looks like this :
C:\test> nova list
spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
Perhaps this will get you close:
nova list | grep -v '\-\-\-\-' | sed 's/^[^|]\+|//g' | sed 's/|\(.\)/,\1/g' | tr '|' '\n'
This will strip the --- lines
Remove the leading |
Replace all but the last | with ,
Replace the last | with \n
Here's a real ugly one-liner
import csv
s = """\
spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+"""
result = [tuple(filter(None, map(str.strip, splitline))) for line in s.splitlines() for splitline in [line.split("|")] if len(splitline) > 1]
with open('output.csv', 'wb') as outcsv:
writer = csv.writer(outcsv)
writer.writerows(result)
I can unwrap it a bit to make it nicer:
splitlines = s.splitlines()
splitdata = line.split("|")
splitdata = filter(lambda line: len(line) > 1, data)
# toss the lines that don't have any data in them -- pure separator lines
header, *data = [[field.strip() for field in line if field.strip()] for line in splitdata]
result = [header] + data
# I'm really just separating these, then re-joining them, but sometimes having
# the headers separately is an important thing!
Or possibly more helpful:
result = []
for line in s.splitlines():
splitdata = line.split("|")
if len(splitdata) == 1:
continue # skip lines with no separators
linedata = []
for field in splitdata:
field = field.strip()
if field:
linedata.append(field)
result.append(linedata)
#AdamSmith's answer has a nice method for parsing the raw table string. Here are a few additions to turn it into a generic function (I chose not to use the csv module so there are no additional dependencies)
def ptable_to_csv(table, filename, headers=True):
"""Save PrettyTable results to a CSV file.
Adapted from #AdamSmith https://stackoverflow.com/questions/32128226
:param PrettyTable table: Table object to get data from.
:param str filename: Filepath for the output CSV.
:param bool headers: Whether to include the header row in the CSV.
:return: None
"""
raw = table.get_string()
data = [tuple(filter(None, map(str.strip, splitline)))
for line in raw.splitlines()
for splitline in [line.split('|')] if len(splitline) > 1]
if table.title is not None:
data = data[1:]
if not headers:
data = data[1:]
with open(filename, 'w') as f:
for d in data:
f.write('{}\n'.format(','.join(d)))
Here's a solution using a regular expression. It also works for an arbitrary number of columns (the number of columns is determined by counting the number of plus signs in the first input line).
input_string = """spu+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+
| 6bca09f8-a320-44d4-a11f-647dcec0aaa1 | tester | ACTIVE | - | Running | OpenStack-net=10.0.0.1, 10.0.0.3 |
+--------------------------------------+--------+--------+------------+-------------+-----------------------------------+"""
import re, csv, sys
def pretty_table_to_tuples(input_str):
lines = input_str.split("\n")
num_columns = len(re.findall("\+", lines[0])) - 1
line_regex = r"\|" + (r" +(.*?) +\|"*num_columns)
for line in lines:
m = re.match(line_regex, line.strip())
if m:
yield m.groups()
w = csv.writer(sys.stdout)
w.writerows(pretty_table_to_tuples(input_string))
I am trying to parse each line of a database file to get it ready for import. It has fixed width lines, but in characters, not in bytes. I have coded something based in Martineau's answer, but I am having trouble with the especial characters.
Sometimes they will break the expected width, some other times they will just throw UnicodeDecodeError. I believe the decode error could be fixed, but can I continue doing this struct.unpack and correctly decode the especial characters? I think the problem is that they are encoded in multiple bytes, messing up with the expected field widths, which I understand to be in bytes and not in characters.
import os, csv
def ParseLine( arquivo):
import struct, string
format = "1x 12s 1x 18s 1x 16s"
expand = struct.Struct(format).unpack_from
unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
for line in arquivo:
fields = unpack(line)
yield [x.strip() for x in fields]
Caminho = r"C:\Sample"
os.chdir(Caminho)
with open("Sample data.txt", 'r') as arq:
with open("Out" + ".csv", "w", newline ='') as sai:
Write = csv.writer(sai, delimiter= ";", quoting=csv.QUOTE_MINIMAL).writerows
for line in ParseLine(arq):
Write([line])
Sample data:
| field 1| field 2 | field 3 |
| sreaodrsa | raesodaso t.thl o| .tdosadot. osa |
| resaodra | rôn. 2x 17/220V | sreao.tttra v |
| esarod sê | raesodaso t.thl o| .tdosadot. osa |
| esarod sa í| raesodaso t.thl o| .tdosadot. osa |
Actual output:
field 1;field 2;field 3
sreaodrsa;raesodaso t.thl o;.tdosadot. osa
resaodra;rôn. 2x 17/22;V | sreao.tttra
In the output we see lines 1 and 2 are as expected. Line 3 got wrong widths, probably due to the multibyte ô. Line 4 throws the following exception:
Traceback (most recent call last):
File "C:\Sample\FindSample.py", line 18, in <module>
for line in ParseLine(arq):
File "C:\Sample\FindSample.py", line 9, in ParseLine
fields = unpack(line)
File "C:\Sample\FindSample.py", line 7, in <lambda>
unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
File "C:\Sample\FindSample.py", line 7, in <genexpr>
unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 11: unexpected end of data
I will need to to perform especific operations on each field, so I can't use a re.sub on the whole file as I was doing before. I would like to keep this code, as it seems efficient and is in the brink of working. If there is some much more efficient way to parse, I could give it a try, though. I need to keep the special characters.
Indeed, the struct approach falls down here because it expects fields to be a fixed number of bytes wide, while your format uses a fixed number of codepoints.
I'd not use struct here at all. Your lines are already decoded to Unicode values, just use slicing to extract your data:
def ParseLine(arquivo):
slices = [slice(1, 13), slice(14, 32), slice(33, 49)]
for line in arquivo:
yield [line[s].strip() for s in slices]
This deals entirely in characters in an already decoded line, rather than bytes. If you have field widths instead of indices, the slice() objects could also be generated:
def widths_to_slices(widths):
pos = 0
for width in widths:
pos += 1 # delimiter
yield slice(pos, pos + width)
pos += width
def ParseLine(arquivo):
widths = (12, 18, 16)
for line in arquivo:
yield [line[s].strip() for s in widths_to_slices(widths)]
Demo:
>>> sample = '''\
... | field 1| field 2 | field 3 |
... | sreaodrsa | raesodaso t.thl o| .tdosadot. osa |
... | resaodra | rôn. 2x 17/220V | sreao.tttra v |
... | esarod sê | raesodaso t.thl o| .tdosadot. osa |
... | esarod sa í| raesodaso t.thl o| .tdosadot. osa |
... '''.splitlines()
>>> def ParseLine(arquivo):
... slices = [slice(1, 13), slice(14, 32), slice(33, 49)]
... for line in arquivo:
... yield [line[s].strip() for s in slices]
...
>>> for line in ParseLine(sample):
... print(line)
...
['field 1', 'field 2', 'field 3']
['sreaodrsa', 'raesodaso t.thl o', '.tdosadot. osa']
['resaodra', 'rôn. 2x 17/220V', 'sreao.tttra v']
['esarod sê', 'raesodaso t.thl o', '.tdosadot. osa']
['esarod sa í', 'raesodaso t.thl o', '.tdosadot. osa']