PyParsing "parseString" class unavailable - python

I am trying to build a quick script that extracts only certain information from invoice PDFs without using regex.
When I try to define the grammar for, say, electric usage, I get an error "cannot import name 'parseString' from 'pyparsing'"
I have tried reinstalling, modifying casing from camel to snake, etc etc but I am at a loss at this point.
Here is the (I think) relevant documentation:
https://pyparsing-docs.readthedocs.io/en/latest/pyparsing.html
The code:
electric_usage = pp.Word(nums) + ',' + pp.Word(nums) + 'kwh'
dates_1 = pp.Word(nums) + '-' + pp.Word(nums) + '-' + pp.Word(nums)
dates_2 = pp.Word(nums) + '/' + pp.Word(nums) + '/' + pp.Word(nums)
for str in pdf_text:
usage_pulled = electric_usage.parseString(pdf_text)
print(usage_pulled)
here is an example of one of the regex patterns that actually seems to work to pull usage values:
'[0-9]+[0-9]+[0-9]+[,]+[0-9]+[0-9]+[0-9]'
and cost:
'[$]+[0-9]+[0-9]+[,]+[0-9]+[0-9]+[0-9]+[.]+[0-9]+[0-9]+$'

Related

Separating a string into numbers and letters in python

I started learning python two days ago. Today I built a web scraping script which pulls data from yahoo finance and puts it in a csv file. The problem I have is that some values are string because yahoo finance displays them as such.
For example: Revenue: 806.43M
When I copy them into the csv I cant use them for calculation so I was wondering if it is possible to separate the "806.43" and "M" while still keeping both to see the unit of the number and put them in two different columns.
for the excel writing I use this command:
f.write(revenue + "," + revenue_value + "\n")
where:
print(revenue)
Revenue (ttm)
print(revenue_value)
806.43M
so in the end I should be able to use a command which looks something like this
f.write(revenue + "," + revenue_value + "," + revenue_unit + "\n")
where revenue_value is 806.43 and revenue_unit is M
Hope someone could help with the problem.
I believe the easiest way is to parse the number as string and convert it to a float based on the unit in the end of the string.
The following should do the trick:
def parse_number(number_str) -> float:
mapping = {
"K": 1000,
"M": 1000000,
"B": 1000000000
}
unit = number_str[-1]
number_float = float(number_str[:-1])
return number_float * mapping[unit]
And here's an example:
my_number = "806.43M"
print(parse_number(my_number))
>>> 806430000.0
You can always try regular expressions.
Here's a pretty good online tool to let you practice using Python-specific standards.
import re
sample = "Revenue (ttm): 806.43M"
# Note: the `(?P<name here>)` section is a named group. That way we can identify what we want to capture.
financials_pattern = r'''
(?P<category>.+?):?\s+? # Capture everything up until the colon
(?P<value>[\d\.]+) # Capture only numeric values and decimal points
(?P<unit>[\w]*)? # Capture a trailing unit type (M, MM, etc.)
'''
# Flags:
# re.I -> Ignore character case (upper vs lower)
# re.X -> Allows for 'verbose' pattern construction, as seen above
res = re.search(financials_pattern, sample, flags = re.I | re.X)
Print our dictionary of values:
res.groupdict()
Output:
{'category': 'Revenue (ttm)',
'value': '806.43',
'unit': 'M'}
We can also use .groups() to list results in a tuple.
res.groups()
Output:
('Revenue (ttm)', '806.43', 'M')
In this case, we'll immediately unpack those results into your variable names.
revenue = None # If this is None after trying to set it, don't print anything.
revenue, revenue_value, revenue_unit = res.groups()
We'll use fancy f-strings to print out both your f.write() call along with the results we've captured.
if revenue:
print(f'f.write(revenue + "," + revenue_value + "," + revenue_unit + "\\n")\n')
print(f'f.write("{revenue}" + "," + "{revenue_value}" + "," + "{revenue_unit}" + "\\n")')
Output:
f.write(revenue + "," + revenue_value + "," + revenue_unit + "\n")
f.write("Revenue (ttm)" + "," + "806.43" + "," + "M" + "\n")

How to concatenate multiple variables?

I have three UV sensors - integers output; one BME280 - float output (temperature and pressure); and one GPS Module - float output.
I need to build a string in this form - #teamname;temperature;pressure;uv_1;uv_2;uv_3;gpscoordinates#
and send them via ser.write at least one time per second- I'm using APC220 Module
Is this the right (and fastest) way to do it?
textstr = str("#" + "teamname" + ";" + str(temperature) + ";" + str(pressure) + ";" + str(uv_1) + ";" + str(uv_2) + ";" + str(uv_3) + "#")
(...)
ser.write(('%s \n'%(textstr)).encode('utf-8'))
You may try something like this:
vars = [teamname, temperature, pressure, uv_1, uv_2, uv_3, gpscoordinates]
joined = ';'.join( map( str, vars ))
ser.write( '#%s# \n', joined )
If using python 3.6+ then you can do this instead
textstr = f"#teamname;{temperature};{pressure};{uv_1};{uv_2};{uv_3}# \n"
(...)
ser.write((textstr).encode('utf-8'))
If teamname and gpscoordinates are also variables then add them the same way
textstr = f"#{teamname};{temperature};{pressure};{uv_1};{uv_2};{uv_3};{gpscoordinates}# \n"
(...)
ser.write((textstr).encode('utf-8'))
For more info about string formatting
https://realpython.com/python-f-strings/
It might improve readability to use python's format:
textstr = "#teamname;{};{};{};{};gpscoordinates#".format(temperature, pressure, uv_1, uv_2, uv_3)
ser.write(('%s \n'%(textstr)).encode('utf-8'))
assuming gpscoordinates is text (it's not in your attempted code). If it's a variable, then replace the text with {} and add it as a param to format.

How to format a long string while following pylint rules?

I have a very simple problem that I have been unable to find the solution to, so I thought I'd try my "luck" here.
I have a string that is created using variables and static text altogether. It is as follows:
filename_gps = 'id' + str(trip_id) + '_gps_did' + did + '_start' + str(trip_start) + '_end' + str(trip_end) + '.json'
However my problem is that pylint is complaining about this string reprensentation as it is too long. And here is the problem. How would I format this string representation over multiple lines without it looking weird and still stay within the "rules" of pylint?
At one point I ended up having it looking like this, however that is incredible "ugly" to look at:
filename_gps = 'id' + str(
trip_id) + '_gps_did' + did + '_start' + str(
trip_start) + '_end' + str(
trip_end) + '.json'
I found that it would follow the "rules" of pylint if I formatted it like this:
filename_gps = 'id' + str(
trip_id) + '_gps_did' + did + '_start' + str(
trip_start) + '_end' + str(
trip_end) + '.json'
Which is much "prettier" to look at, but in case I didn't have the "str()" casts, how would I go about creating such a string?
I doubt that there is a difference between pylint for Python 2.x and 3.x, but if there is I am using Python 3.x.
Don't use so many str() calls. Use string formatting:
filename_gps = 'id{}_gps_did{}_start{}_end{}.json'.format(
trip_id, did, trip_start, trip_end)
If you do have a long expression with a lot of parts, you can create a longer logical line by using (...) parentheses:
filename_gps = (
'id' + str(trip_id) + '_gps_did' + did + '_start' +
str(trip_start) + '_end' + str(trip_end) + '.json')
This would work for breaking up a string you are using as a template in a formatting operation, too:
foo_bar = (
'This is a very long string with some {} formatting placeholders '
'that is broken across multiple logical lines. Note that there are '
'no "+" operators used, because Python auto-joins consecutive string '
'literals.'.format(spam))

How to use pyparsing LineStart?

I'm trying to use pyparsing to parse key:value pairs from the comments in a document. A key starts at the beginning of a line, and a value follows. Values may be continued on multiple lines that begin with whitespace.
import pyparsing as pp
instring = """
-- This is (a) #%^& comment
/*
name1: val
name2: val2 with $*&##) junk
name3: val3: with #)(*% multi-
line: content
*/
"""
comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setDebug()
meta1 = pp.LineStart() + identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd())
meta2 = pp.LineStart() + pp.White() + pp.SkipTo(pp.LineEnd())
metaval = meta1 + pp.ZeroOrMore(meta2)
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.OneOrMore(metaval) + pp.Literal("*/")
if __name__ == "__main__":
p = metalist.parseString(instring)
print(p)
Fails with:
Matched {Empty SkipTo:(LineEnd) Empty} -> ['This is (a) #%^& comment']
File "C:\Users\user\py3\lib\site-packages\pyparsing.py", line 2305, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected start of line (at char 32), (line:4, col:1)
The answer to pyparsing whitespace match issues says
LineStart has always been difficult to work with, but ...
If the parser is at line 4 column 1 (the first key:value pair), then why is it not finding a start of line? What is the correct pyparsing syntax to recognize lines beginning with no whitespace and lines beginning with whitespace?
I think the confusion I have with LineStart is that, for LineEnd, I can look for a '\n' character, but there is no separate character for LineStart. So in LineStart I look to see if the current parser location is positioned just after a '\n'; or if it is currently on a '\n', move past it and still continue. Unfortunately, I implemented this in a place that messes up the reporting location, so you get those weird errors that read like "failed to find a start of line on line X col 1," which really does sound like it should be a successfully matched start of a line. Also, I think I need to revisit this implicit newline-skipping, or for that matter, all whitespace-skipping in general for LineStart.
For now, I've gotten your code to work by expanding your line-starting expression slightly, as:
LS = pp.Optional(pp.LineEnd()) + pp.LineStart()
and replaced the LineStart references in meta1 and meta2 with LS:
comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setDebug()
meta1 = LS + identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd())
meta2 = LS + pp.White() + pp.SkipTo(pp.LineEnd())
metaval = meta1 + pp.ZeroOrMore(meta2)
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.OneOrMore(metaval) + pp.Literal("*/")
If this situation with LineStart leaves you uncomfortable, here is another tactic you can try: using a parse-time condition to only accept identifiers that start in column 1:
comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setName("identifier")
identifier.addCondition(lambda instring,loc,toks: pp.col(loc,instring) == 1)
meta1 = identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd()).setDebug()
meta2 = pp.White().setDebug() + pp.SkipTo(pp.LineEnd()).setDebug()
metaval = meta1 + pp.ZeroOrMore(meta2, stopOn=pp.Literal('*/'))
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.LineEnd() + pp.OneOrMore(metaval) + pp.Literal("*/")
This code does away with LineStart completely, while I figure out just what I want this particular token to do. I also had to modify the ZeroOrMore repetition in metaval so that */ would not be accidentally processed as continued comment content.
Thanks for your patience with this - I am not keen to quickly put out a patched LineStart change and then find that I have overlooked other compatibility or other edge cases that just put me back in the current less-than-great state on this class. But I'll put some effort into clarifying this behavior before putting out 2.1.10.

Python - Transpose columns to rows within data operation and before writing to file

I have developed a public and open source App for Splunk (Nmon performance monitor for Unix and Linux Systems, see https://apps.splunk.com/app/1753/)
A master piece of the App is an old perl (recycled, modified and updated) script automatically launched by the App to convert the Nmon data (which is some kind of custom csv), reading it from stdin and writing out to
formerly formatted csv files by section (a section is a performance monitor)
I want now to fully rewrite this script in Python, which is almost done for a first beta version... BUT i am facing difficulties to transpose data, and i'm afraid not being able to solve it myself.
This is why i am kindly asking for help today.
Here is the difficulty in details:
Nmon generates performance monitors for various sections (cpu, memory, disks...), for many of them there is no big difficulty but extracting the good timestamp and so on.
But for all sections that have "device" notion (such as DISKBUSY in the provided example, which represents the percentage of time disks were busy) have to be transformed and transposed to be later
exploitable
Currently, i am able to generate the data as follows:
Example:
time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0
The goal is to transpose the data such as we will have in the header "time,device,value", example:
time,device,value
26-JUL-2014 11:10:44,sda,4.4
26-JUL-2014 11:10:44,sda1,0.0
26-JUL-2014 11:10:44,sda2,0.0
And so on.
One month ago, I've opened a question for almost the same need (for another app and not exactly the same data, but the same need to transpose columns to rows)
Python - CSV time oriented Transposing large number of columns to rows
I had a very great answer which perfectly did the trick, thus i am unable to recycle the piece of code into this new context.
One of difference is that i want to include the data transposition inside within the code, such that the script only works in memory and avoid dealing with multiple temporary files.
Here is the piece of code:
Note: needs to use Python 2x
###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################
dynamic_section = ["DISKBUSY"]
for section in dynamic_section:
# Set output file
currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'
# Open output for writing
with open(currsection_output, "w") as currsection:
for line in data:
# Extract sections, and write to output
myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
find_section = re.match( myregex, line)
if find_section:
# csv header
# Replace some symbols
line=re.sub("%",'_PCT',line)
line=re.sub(" ",'_',line)
# Extract header excluding data that always has Txxxx for timestamp reference
myregex = '(' + section + ')\,([^T].+)'
fullheader_match = re.search( myregex, line)
if fullheader_match:
fullheader = fullheader_match.group(2)
header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)
if header_match:
header = header_match.group(2)
# Write header
currsection.write('time' + ',' + header + '\n'),
# Extract timestamp
# Nmon V9 and prior do not have date in ZZZZ
# If unavailable, we'll use the global date (AAA,date)
ZZZZ_DATE = '-1'
ZZZZ_TIME = '-1'
# For Nmon V10 and more
timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
if timestamp_match:
ZZZZ_TIME = timestamp_match.group(2)
ZZZZ_DATE = timestamp_match.group(3)
ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME
# For Nmon V9 and less
if ZZZZ_DATE == '-1':
ZZZZ_DATE = DATE
timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
if timestamp_match:
ZZZZ_TIME = timestamp_match.group(2)
ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME
# Extract Data
myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
perfdata_match = re.match( myregex, line)
if perfdata_match:
perfdata = perfdata_match.group(2)
# Write perf data
currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),
# End for
# Open output for reading and show number of line we extracted
with open(currsection_output, "r") as currsection:
num_lines = sum(1 for line in currsection)
print (section + " section: Wrote", num_lines, "lines")
# End for
The line:
currsection.write('time' + ',' + header + '\n'),
will contain the header
And the line:
currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),
contains the data line by line
Note: the final data (header and body data) should in target also contains other information, to simplify things i removed it in the code above
For static sections which does not require the data transposition, the same lines will be:
currsection.write('type' + ',' + 'serialnum' + ',' + 'hostname' + ',' + 'time' + ',' + header + '\n'),
And:
currsection.write(section + ',' + SN + ',' + HOSTNAME + ',' + ZZZZ_timestamp + ',' + perfdata + '\n'),
The great goal would be to be able to transpose the data just after the required definition and before writing it.
Also, performance and minimum system resources called (such as working with temporary files instead of memory) is a requirement to prevent from generating too high cpu load on systems periodically the script.
Could anyone help me to achieve this ? I've looked for it again and again, i'm pretty sure there is multiple ways to achieve this (zip, map, dictionary, list, split...) but i failed to achieve it...
Please be indulgent, this is my first real Python script :-)
Thank you very much for any help !
More details:
testing nmon file
A small testing nmon file can be retrieved here: http://pastebin.com/xHLRbBU0
Current complete script
The current complete script can be retrieved here: http://pastebin.com/QEnXj6Yh
To test the script, it is required to:
export the SPLUNK_HOME variable to anything relevant for you, ex:
mkdir /tmp/nmon2csv
--> place the script and nmon file here, allow execution on script
export SPLUNK_HOME=/tmp/nmon2csv
mkdir -p etc/apps/nmon
And finally:
cat test.nmon | ./nmon2csv.py
Data will be generated in /tmp/nmon2csv/etc/apps/nmon/var/*
Update: Working code using csv module:
###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################
dynamic_section = ["DISKBUSY","DISKBSIZE","DISKREAD","DISKWRITE","DISKXFER","DISKRIO","DISKWRIO","IOADAPT","NETERROR","NET","NETPACKET","JFSFILE","JFSINODE"]
for section in dynamic_section:
# Set output file (will opened after transpose)
currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'
# Open Temp
with TemporaryFile() as tempf:
for line in data:
# Extract sections, and write to output
myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
find_section = re.match( myregex, line)
if find_section:
# csv header
# Replace some symbols
line=re.sub("%",'_PCT',line)
line=re.sub(" ",'_',line)
# Extract header excluding data that always has Txxxx for timestamp reference
myregex = '(' + section + ')\,([^T].+)'
fullheader_match = re.search( myregex, line)
if fullheader_match:
fullheader = fullheader_match.group(2)
header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)
if header_match:
header = header_match.group(2)
# Write header
tempf.write('time' + ',' + header + '\n'),
# Extract timestamp
# Nmon V9 and prior do not have date in ZZZZ
# If unavailable, we'll use the global date (AAA,date)
ZZZZ_DATE = '-1'
ZZZZ_TIME = '-1'
# For Nmon V10 and more
timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
if timestamp_match:
ZZZZ_TIME = timestamp_match.group(2)
ZZZZ_DATE = timestamp_match.group(3)
ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME
# For Nmon V9 and less
if ZZZZ_DATE == '-1':
ZZZZ_DATE = DATE
timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
if timestamp_match:
ZZZZ_TIME = timestamp_match.group(2)
ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME
# Extract Data
myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
perfdata_match = re.match( myregex, line)
if perfdata_match:
perfdata = perfdata_match.group(2)
# Write perf data
tempf.write(ZZZZ_timestamp + ',' + perfdata + '\n'),
# Open final for writing
with open(currsection_output, "w") as currsection:
# Rewind temp
tempf.seek(0)
writer = csv.writer(currsection)
writer.writerow(['type', 'serialnum', 'hostname', 'time', 'device', 'value'])
for d in csv.DictReader(tempf):
time = d.pop('time')
for device, value in sorted(d.items()):
row = [section, SN, HOSTNAME, time, device, value]
writer.writerow(row)
# End for
# Open output for reading and show number of line we extracted
with open(currsection_output, "r") as currsection:
num_lines = sum(1 for line in currsection)
print (section + " section: Wrote", num_lines, "lines")
# End for
The
goal is to transpose the data such as we will have in the header
"time,device,value"
This rough transposition logic looks like this:
text = '''time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0
'''
import csv
for d in csv.DictReader(text.splitlines()):
time = d.pop('time')
for device, value in sorted(d.items()):
print time, device, value
Putting it all together into a complete script looks something like this:
import csv
with open('transposed.csv', 'wb') as destfile:
writer = csv.writer(destfile)
writer.writerow(['time', 'device', 'value'])
with open('data.csv', 'rb') as sourefile:
for d in csv.DictReader(sourcefile):
time = d.pop('time')
for device, value in sorted(d.items()):
row = [time, device, value]
writer.writerow(row)

Categories