creating Dictionary-object from string that looks like dictionaries - python

I have a string in that looks something similiar to the following:
myString = "major: 11, minor: 31, name: A=1,B=1,C=1,P=1, severity: 0, comment: this is down"
I have tried this so far:
dict(elem.split(':') for elem in myString.split(','))
It works fine until it catches the name-element above which can not be split() with ':'.
Element in those format I would like to have as a new dictionary e.g.
myDic = {'major':'11', 'minor': '31', 'name':{'A':'1', 'B':'1', 'C':'1', 'P', '1'}, 'severity': '0', 'comment': 'this is down'}
If possible I would like to avoid complicated parsing as these turn out to be hard to maintain.
Also I do not know the name/amount of the keys or values in the string above. I just know the format. This is not a JSON-response, this is part of a text in a file and I have no control over the current format.

FYI, This is NOT the complete solution ..
If this is the concrete structure of your input, and will be the constant pattern within your source, you can distinguish the comma-separated Tokens.
The difference between major: 11, and name: A=1,B=1,C=1,P=1, is that there is SPACE after the first token which makes the difference from the second token. So simply by adding a space into second split method, you can render your string properly.
So, the code should be something like this:
dict(elem.split(':') for elem in myString.split(', '))
Pay attention to send split method. There is a SPACE and comma ...
Regarding to the JSON format, it needs more work I guess. I have no idea now ..

Here's another suggestion.
Why don't you transform it into a dictionary notation.
E.g. in a first step, you replace everything between a ':' and (comma or end of input) that contains a '=' (and mybe no whitespace, I don't know) by wrapping it in braces and replacing '=' by ':'.
In a second step, wrap everything between a ':' and (comma or end of input) in ', removing trailing and leading whitespace.
Finally, you wrap it all in braces.
I still don't trust that syntax, though... maybe after a few thousand lines have been processed successfully...

At least, this parses the given example correctly...
import re
def parse(s):
rx = r"""(?x)
(\w+) \s* : \s*
(
(?: \w+ = \w+,)*
(?: \w+ = \w+)
|
(?: [^,]+)
)
"""
r = {}
for key, val in re.findall(rx, s):
if '=' in val:
val = dict(x.split('=') for x in val.split(','))
r[key] = val
return r
myString = "major: 11, minor: 31, name: A=1,B=1,C=1,P=1, severity: 0, comment: this is down"
print parse(myString)
# {'comment': 'this is down', 'major': '11', 'name': {'A': '1', 'P': '1', 'C': '1', 'B': '1'}, 'minor': '31', 'severity': '0'}

Related

Parsing Regular expression from YAML file adds extra \

I have a bunch of regular expression I am using to scrape lot of specific fields from a text document. Those all work fine when used directly inside the python script.
But I thought of putting them in a YAML file and reading from there. Here's how it looks:
# Document file for Regular expression patterns for a company invoice
---
issuer: ABCCorp
fields:
invoice_number: INVOICE\s*(\S+)
invoice_date: INVOICE DATE\s*(\S+)
cusotmer_id: CUSTOMER ID\s*(\S+)
origin: ORIGIN\s*(.*)ETD
destination: DESTINATION\s*(.*)ETA
sub_total: SUBTOTAL\s*(\S+)
add_gst: SUBTOTAL\s*(\S+)
total_cost: TOTAL USD\s*(\S+)
description_breakdown: (?s)(DESCRIPTION\s*GST IN USD\s*.+?TOTAL CHARGES)
package_details_fields: (?s)(WEIGHT\s*VOLUME\s*.+?FLIGHT|ROAD REFERENCE)
mawb_hawb: (?s)((FLIGHT|ROAD REFERENCE).*(MAWB|MASTER BILL)\s*.+?GOODS COLLECTED FROM)
When I retrieve it using pyyml in python, it is adding a string quote around that (which is ok as I can add r'' later) but I see it is also adding extra \ in between the regex. That would make the regex go wrong when used in code now
import yaml
with open(os.path.join(TEMPLATES_DIR,"regex_template.yml")) as f:
my_dict = yaml.safe_load(f)
print(my_dict)
{'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)', 'cusotmer_id': 'CUSTOMER ID\\s*(\\S+)', 'origin': 'ORIGIN\\s*(.*)ETD', 'destination': 'DESTINATION\\s*(.*)ETA', 'sub_total': 'SUBTOTAL\\s*(\\S+)', 'add_gst': 'SUBTOTAL\\s*(\\S+)', 'total_cost': 'TOTAL USD\\s*(\\S+)', 'description_breakdown': '(?s)(DESCRIPTION\\s*GST IN USD\\s*.+?TOTAL CHARGES)', 'package_details_fields': '(?s)(WEIGHT\\s*VOLUME\\s*.+?FLIGHT|ROAD REFERENCE)', 'mawb_hawb'
How to read the right regex as I have it in yaml file? Does any string written in yaml file gets a quotation mark around that when read in python because that is a string?
EDIT:
The main regex in yaml file is:
INVOICE\s*(\S+)
Output in dict is:
'INVOICE\\s*(\\S+)'
This is too long to do as a comment.
The backslash character is used to escape special characters. For example:
'\n': newline
'\a': alarm
When you use it before a letter that has no special meaning it is just taken to be a backslash character:
'\s': backslash followed by 's'
But to be sure, whenever you want to enter a backslash character in a string and not have it interpreted as the start of an escape sequence, you double it up:
'\\s': also a backslash followed by a 's'
'\\a': a backslash followed by a 'a'
If you use a r'' type literal, then a backslash is never interpreted as the start of an escape sequence:
r'\a': a backslash followed by 'a' (not an alarm character)
r'\n': a backslash followed by n (not a newline -- however when used in a regex. it will match a newline)
Now here is the punchline:
When you print out these Python objects, such as:
d = {'x': 'ab\sd'}
print(d)
Python will print the string representation of the dictionary and the string will print:
'ab\\sd'. If you just did:
print('ab\sd')
You would see ab\sd. Quite a difference.
Why the difference. See if this makes sense:
d = {'x': 'ab\ncd'}
print(d)
print('ab\ncd')
Results:
d = {'x': 'ab\ncd'}
ab
cd
The bottom line is that when you print a Python object other than a string, it prints a representation of the object showing how you would have created it. And if the object contains a string and that string contains a backslash, you would have doubled up on that backslash when entering it.
Update
To process your my_dict: Since you did not provide the complete value of my_dict, I can only use a truncated version for demo purposes. But this will demonstrate that my_dict has perfectly good regular expressions:
import re
my_dict = {'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)'}}
fields = my_dict['fields']
invoice_number_re = fields['invoice_number']
m = re.search(invoice_number_re, 'blah-blah INVOICE 12345 blah-blah')
print(m[1])
Prints:
12345
If you are going to be using the same regular expressions over and over again, then it is best to compile them:
import re
my_dict = {'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)'}}
#compile the strings to regular expressions
fields = my_dict['fields']
for k, v in fields.items():
fields[k] = re.compile(v)
invoice_number_re = fields['invoice_number']
m = invoice_number_re.search('blah-blah INVOICE 12345 blah-blah')
print(m[1])

How to split and ignore separators in file path string using Python

I have a string like this
LASTSCAN:C:\Users\Bob\Scripts\VisualizeData\doc\placeholder.PNG:1557883221.11
The format of the string is [Command][File path][Timestamp]. Currently it is separated by colons but the file path also has a colon. Other times the format of the string may change but it is always separated by a colon. For instance:
SCAN:2000:25:-12.5:12.5:C:\Users\Potato\potato.PNG:1557884143.93
This string has a signature of [Command][Frames][Speed][Start][Stop][File path][Timestamp]
How do I split the input string to obtain a output like this?
['LASTSCAN', 'C:\Users\Bob\Scripts\VisualizeData\doc\placeholder.PNG', '1557883221.11']
Expected output for 2nd example
['SCAN', '2000', '25', '-12.5', '12.5', 'C:\Users\Potato\potato.PNG', '1557884143.93']
Try splitting on the regex pattern :(?!\\):
input = "LASTSCAN:C:\Users\Bob\Scripts\VisualizeData\doc\placeholder.PNG:1557883221.11"
output = re.split(r':(?!\\)', input)
print(output)
['LASTSCAN', 'C:\\Users\\Bob\\Scripts\\VisualizeData\\doc\\placeholder.PNG', '1557883221.11']
The logic is to split on any colon which is not immediately followed by a path separator. This spares the : in the file path from being targeted as a point for splitting.
If you can be certain the ":" you wish to keep is immediately followed by a "\" and that there will be no other "\" around. You could try something like this.
new = string.split(':')
for i in range(new):
if new[i][0] == "\":
new[i-1] += new.pop(i)
Why not using regex:
import re
s = 'SCAN:2000:25:-12.5:12.5:C:/Users/Potato/potato.PNG:1557884143.93'
print(re.split(':(?!/)',s))
Output:
['SCAN', '2000', '25', '-12.5', '12.5', 'C:/Users/Potato/potato.PNG', '1557884143.93']
Also, at least for me, you have to change \ to /, and in the regex expression as well.

Replace escape sequence characters in a string in Python 3.x

I have used the following code to replace the escaped characters in a string. I have first done splitting by \n and the used re.sub(), but still I dont know what I am missing, the code is not working according to the expectations. I am a newbie at Python, so please don't judge if there are optimisation problems. Here is my code:
#import sys
import re
String = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
splitString = String.split('\n')
replacedStrings = []
i=0
for oneString in splitString:
#oneString = oneString.replace(r'^(.?)*(\\[^n])+(.?)*$', "")
oneString = re.sub(r'^(.?)*(\\[^n])+(.?)*$', "", oneString)
print(oneString)
replacedStrings.insert(i, oneString)
i += 1
print(replacedStrings)
My aim here is: I need the values only (without the escaped sequences) as the split strings.
My approach here is:
I have split the string by \n that gives me array list of separate strings.
Then, I have checked each string using the regex, if the regex matches, then the matched substring is replaced to "".
Then I have pushed those strings to a collection, thinking that it will store the replaced strings in the new array list.
So basically, I am through with 1 and 2, but currently I am stuck at 3. Following is my Output:
1
2
3
4
['1\r\r\t\r', '2\r\r', '3\r\r\r\r', '\r', '\r4', '\r']
You might find it easier to use re.findall here with the simple pattern \S+:
input = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
output = re.findall(r'\S+', input)
print(output)
['1', '2', '3', '4']
This approach will isolate and match any islands of one or more non whitespace characters.
Edit:
Based on your new input data, we can try matching on the pattern [^\r\n\t]+:
input = "jkahdjkah \r\r\t\r\nA: B\r\r\nA : B\r\r\r\r\n\r\n\r4\n\r"
output = re.findall(r'[^\r\n\t]+', input)
print(output)
['jkahdjkah ', 'A: B', 'A : B', '4']
re.sub isn't really the right tool for the job here. What would be on the table is split or re.findall, because you want to repeatedly match/isolate a certain part of your text. re.sub is useful for taking a string and transforming it to something else. It can be used to extract text, but does not work so well for multiple matches.
You were almost there, I would just use string.strip() to replace multiple \r and \n at the start and the end of the strings
String = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
splitString = String.split('\n')
replacedStrings = []
i=0
for oneString in splitString:
s = oneString.strip()
if s != '':
print(s)
replacedStrings.append(s)
print(replacedStrings)
The output will look like
1
2
3
4
['1', '2', '3', '4']
For "jkahdjkah \r\r\t\r\nA: B\r\r\nA : B\r\r\r\r\n\r\n\r4\n\r", the output will be ['jkahdjkah', 'A: B', 'A : B', '4']
I have found one more method, this seems to work fine, it might not be as optimised as the other answers, but its just another way:
import re
splitString = []
String = "jhgdf\r\r\t\r\nA : B\r\r\nA : B\r\r\r\r\n\r\n\rA: B\n\r"
splitString = re.compile('[\r\t\n]+').split(String)
if "" in splitString:
splitString.remove("")
print(splitString)
I added it here, so that people going through the same trouble as me, might want to overlook this approach too.
Following is the Output that I have got after using the above code:
['jhgdf', 'A : B', 'A : B', 'A: B']

How to parse a CSV with commas between parenthesis and missing values

I tried using pyparsing to parse a CSV with:
Commas between parenthesis (or brackets, etc): "a(1,2),b" should return the list ["a(1,2)","b"]
Missing values: "a,b,,c," should return the list ['a','b','','c','']
I worked a solution but it seems "dirty". Mainly, the Optional inside only one of the possible atomics. I think the optional should be independent of the atomics. That is, I feel it should be put somewhere else, for example in the delimitedList optional arguments, but in my trial and error that was the only place that worked and made sense. It could be in any of the possible atomics so I chose the first.
Also, I don't fully understand what originalTextFor is doing but if I remove it it stops working.
Working example:
import pyparsing as pp
# Function that parses a line of columns separated by commas and returns a list of the columns
def fromLineToRow(line):
sqbrackets_col = pp.Word(pp.printables, excludeChars="[],") | pp.nestedExpr(opener="[",closer="]") # matches "a[1,2]"
parens_col = pp.Word(pp.printables, excludeChars="(),") | pp.nestedExpr(opener="(",closer=")") # matches "a(1,2)"
# In the following line:
# * The "^" means "choose the longest option"
# * The "pp.Optional" can be in any of the expressions separated by "^". I put it only on the first. It's used for when there are missing values
atomic = pp.originalTextFor(pp.Optional(pp.OneOrMore(parens_col))) ^ pp.originalTextFor(pp.OneOrMore(sqbrackets_col))
grammar = pp.delimitedList(atomic)
row = grammar.parseString(line).asList()
return row
file_str = \
"""YEAR,a(2,3),b[3,4]
1960,2.8,3
1961,4,
1962,,1
1963,1.27,3"""
for line in file_str.splitlines():
row = fromLineToRow(line)
print(row)
Prints:
['YEAR', 'a(2,3)', 'b[3,4]']
['1960', '2.8', '3']
['1961', '4', '']
['1962', '', '1']
['1963', '1.27', '3']
Is this the right way to do this? Is there a "cleaner" way to use the Optional inside the first atomic?
Working inside-out, I get this:
# chars not in ()'s or []'s - also disallow ','
non_grouped = pp.Word(pp.printables, excludeChars="[](),")
# grouped expressions in ()'s or []'s
grouped = pp.nestedExpr(opener="[",closer="]") | pp.nestedExpr(opener="(",closer=")")
# use OneOrMore to allow non_grouped and grouped together
atomic = pp.originalTextFor(pp.OneOrMore(non_grouped | grouped))
# or based on your examples, you *could* tighten this up to:
# atomic = pp.originalTextFor(non_grouped + pp.Optional(grouped))
originalTextFor recombines the original input text within the leading and trailing boundaries of the matched expressions, and returns a single string. If you leave this out, then you will get all the sub-expressions in a nested list of strings, like ['a',['2,3']]. You could rejoin them with repeated calls to ''.join, but that would collapse out whitespace (or use ' '.join, but that has the opposite problem of potentially introducing whitespace).
To optionalize the elements of the list, just say so in the definition of the delimited list:
grammar = pp.delimitedList(pp.Optional(atomic, default=''))
Be sure to add the default value, else the empty slots will just get dropped.
With these changes I get:
['YEAR', 'a(2,3)', 'b[3,4]']
['1960', '2.8', '3']
['1961', '4', '']
['1962', '', '1']
['1963', '1.27', '3']
What you can do is using regex re, for instance:
>>> import re
>>> re.split(r',\s*(?![^()]*\))', line1)
['a(1,2)', 'b']
>>> re.split(r',\s*(?![^()]*\))', line2)
['a', 'b', '', 'c', '']
import re
with open('44289614.csv') as f:
for line in map(str.strip, f):
l = re.split(',\s*(?![^()[]]*[\)\]])', line)
print(len(l), l)
Output:
3 ['YEAR', 'a(2,3)', 'b[3,4]']
3 ['1960', '2.8', '3']
3 ['1961', '4', '']
3 ['1962', '', '1']
3 ['1963', '1.27', '3']
Modified from this answer.
I also like this answer, which suggests modifying the input slightly and using quotechar of the csv module.

Extract fields from the string in python

I have text line by line which contains many field name and their value seperated by : , if any line does not have any field value then that field would not exist in that line
for example
First line:
A:30 B: 40 TS:1/1/1990 22:22:22
Second line
A:30 TS:1/1/1990 22:22:22
third line
A:30 B: 40
But it is confirmed that at max 3 fields are possible in single line and their name will be A,B,TS.
while writing python script for this, i am facing below issues:
1) I have to extract from each line which are the field exist and what are their values
2) Field value of field TS also have seperator ' '(SPACE).so unable retrieve full value of TS(1/1/1990 22:22:22)
Output valueshould be extracted like that
First LIne:
A=30
B=40
TS=1/1/1990 22:22:22
Second Line:
A=30
TS=1/1/1990 22:22:22
Third Line
A=30
B=40
Please help me in solving this issue.
import re
a = ["A:30 B: 40 TS:1/1/1990 22:22:22", "A:30 TS:1/1/1990 22:22:22", "A:30 B: 40"]
regex = re.compile(r"^\s*(?:(A)\s*:\s*(\d+))?\s*(?:(B)\s*:\s*(\d+))?\s*(?:(TS)\s*:\s*(.*))?$")
for item in a:
matches = regex.search(item).groups()
print {k:v for k,v in zip(matches[::2], matches[1::2]) if k}
will output
{'A': '30', 'B': '40', 'TS': '1/1/1990 22:22:22'}
{'A': '30', 'TS': '1/1/1990 22:22:22'}
{'A': '30', 'B': '40'}
Explanation of the regex:
^\s* # match start of string, optional whitespace
(?: # match the following (optionally, see below)
(A) # identifier A --> backreference 1
\s*:\s* # optional whitespace, :, optional whitespace
(\d+) # any number --> backreference 2
)? # end of optional group
\s* # optional whitespace
(?:(B)\s*:\s*(\d+))?\s* # same with identifier B and number --> backrefs 3 and 4
(?:(TS)\s*:\s*(.*))? # same with id. TS and anything that follows --> 5 and 6
$ # end of string
You could use regular expressions, something like this would work if the order was assumed the same every time, otherwise you would have to match each part individually if you're unsure of the order.
import re
def parseInput(input):
m = re.match(r"A:\s*(\d+)\s*B:\s*(\d+)\s*TS:(.+)", input)
return {"A": m.group(1), "B": m.group(2), "TS": m.group(3)}
print parseInput("A:30 B: 40 TS:1/1/1990 22:22:22")
This prints out {'A': '30', 'B': '40', 'TS': '1/1/1990 22:22:22'} Which is just a dictionary containing the values.
P.S. You should accept some answers and familiarize yourself with the etiquette of site and people will be more willing to help you out.

Categories