How to parse a CSV with commas between parenthesis and missing values - python

I tried using pyparsing to parse a CSV with:
Commas between parenthesis (or brackets, etc): "a(1,2),b" should return the list ["a(1,2)","b"]
Missing values: "a,b,,c," should return the list ['a','b','','c','']
I worked a solution but it seems "dirty". Mainly, the Optional inside only one of the possible atomics. I think the optional should be independent of the atomics. That is, I feel it should be put somewhere else, for example in the delimitedList optional arguments, but in my trial and error that was the only place that worked and made sense. It could be in any of the possible atomics so I chose the first.
Also, I don't fully understand what originalTextFor is doing but if I remove it it stops working.
Working example:
import pyparsing as pp
# Function that parses a line of columns separated by commas and returns a list of the columns
def fromLineToRow(line):
sqbrackets_col = pp.Word(pp.printables, excludeChars="[],") | pp.nestedExpr(opener="[",closer="]") # matches "a[1,2]"
parens_col = pp.Word(pp.printables, excludeChars="(),") | pp.nestedExpr(opener="(",closer=")") # matches "a(1,2)"
# In the following line:
# * The "^" means "choose the longest option"
# * The "pp.Optional" can be in any of the expressions separated by "^". I put it only on the first. It's used for when there are missing values
atomic = pp.originalTextFor(pp.Optional(pp.OneOrMore(parens_col))) ^ pp.originalTextFor(pp.OneOrMore(sqbrackets_col))
grammar = pp.delimitedList(atomic)
row = grammar.parseString(line).asList()
return row
file_str = \
"""YEAR,a(2,3),b[3,4]
1960,2.8,3
1961,4,
1962,,1
1963,1.27,3"""
for line in file_str.splitlines():
row = fromLineToRow(line)
print(row)
Prints:
['YEAR', 'a(2,3)', 'b[3,4]']
['1960', '2.8', '3']
['1961', '4', '']
['1962', '', '1']
['1963', '1.27', '3']
Is this the right way to do this? Is there a "cleaner" way to use the Optional inside the first atomic?

Working inside-out, I get this:
# chars not in ()'s or []'s - also disallow ','
non_grouped = pp.Word(pp.printables, excludeChars="[](),")
# grouped expressions in ()'s or []'s
grouped = pp.nestedExpr(opener="[",closer="]") | pp.nestedExpr(opener="(",closer=")")
# use OneOrMore to allow non_grouped and grouped together
atomic = pp.originalTextFor(pp.OneOrMore(non_grouped | grouped))
# or based on your examples, you *could* tighten this up to:
# atomic = pp.originalTextFor(non_grouped + pp.Optional(grouped))
originalTextFor recombines the original input text within the leading and trailing boundaries of the matched expressions, and returns a single string. If you leave this out, then you will get all the sub-expressions in a nested list of strings, like ['a',['2,3']]. You could rejoin them with repeated calls to ''.join, but that would collapse out whitespace (or use ' '.join, but that has the opposite problem of potentially introducing whitespace).
To optionalize the elements of the list, just say so in the definition of the delimited list:
grammar = pp.delimitedList(pp.Optional(atomic, default=''))
Be sure to add the default value, else the empty slots will just get dropped.
With these changes I get:
['YEAR', 'a(2,3)', 'b[3,4]']
['1960', '2.8', '3']
['1961', '4', '']
['1962', '', '1']
['1963', '1.27', '3']

What you can do is using regex re, for instance:
>>> import re
>>> re.split(r',\s*(?![^()]*\))', line1)
['a(1,2)', 'b']
>>> re.split(r',\s*(?![^()]*\))', line2)
['a', 'b', '', 'c', '']

import re
with open('44289614.csv') as f:
for line in map(str.strip, f):
l = re.split(',\s*(?![^()[]]*[\)\]])', line)
print(len(l), l)
Output:
3 ['YEAR', 'a(2,3)', 'b[3,4]']
3 ['1960', '2.8', '3']
3 ['1961', '4', '']
3 ['1962', '', '1']
3 ['1963', '1.27', '3']
Modified from this answer.
I also like this answer, which suggests modifying the input slightly and using quotechar of the csv module.

Related

Replace escape sequence characters in a string in Python 3.x

I have used the following code to replace the escaped characters in a string. I have first done splitting by \n and the used re.sub(), but still I dont know what I am missing, the code is not working according to the expectations. I am a newbie at Python, so please don't judge if there are optimisation problems. Here is my code:
#import sys
import re
String = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
splitString = String.split('\n')
replacedStrings = []
i=0
for oneString in splitString:
#oneString = oneString.replace(r'^(.?)*(\\[^n])+(.?)*$', "")
oneString = re.sub(r'^(.?)*(\\[^n])+(.?)*$', "", oneString)
print(oneString)
replacedStrings.insert(i, oneString)
i += 1
print(replacedStrings)
My aim here is: I need the values only (without the escaped sequences) as the split strings.
My approach here is:
I have split the string by \n that gives me array list of separate strings.
Then, I have checked each string using the regex, if the regex matches, then the matched substring is replaced to "".
Then I have pushed those strings to a collection, thinking that it will store the replaced strings in the new array list.
So basically, I am through with 1 and 2, but currently I am stuck at 3. Following is my Output:
1
2
3
4
['1\r\r\t\r', '2\r\r', '3\r\r\r\r', '\r', '\r4', '\r']
You might find it easier to use re.findall here with the simple pattern \S+:
input = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
output = re.findall(r'\S+', input)
print(output)
['1', '2', '3', '4']
This approach will isolate and match any islands of one or more non whitespace characters.
Edit:
Based on your new input data, we can try matching on the pattern [^\r\n\t]+:
input = "jkahdjkah \r\r\t\r\nA: B\r\r\nA : B\r\r\r\r\n\r\n\r4\n\r"
output = re.findall(r'[^\r\n\t]+', input)
print(output)
['jkahdjkah ', 'A: B', 'A : B', '4']
re.sub isn't really the right tool for the job here. What would be on the table is split or re.findall, because you want to repeatedly match/isolate a certain part of your text. re.sub is useful for taking a string and transforming it to something else. It can be used to extract text, but does not work so well for multiple matches.
You were almost there, I would just use string.strip() to replace multiple \r and \n at the start and the end of the strings
String = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
splitString = String.split('\n')
replacedStrings = []
i=0
for oneString in splitString:
s = oneString.strip()
if s != '':
print(s)
replacedStrings.append(s)
print(replacedStrings)
The output will look like
1
2
3
4
['1', '2', '3', '4']
For "jkahdjkah \r\r\t\r\nA: B\r\r\nA : B\r\r\r\r\n\r\n\r4\n\r", the output will be ['jkahdjkah', 'A: B', 'A : B', '4']
I have found one more method, this seems to work fine, it might not be as optimised as the other answers, but its just another way:
import re
splitString = []
String = "jhgdf\r\r\t\r\nA : B\r\r\nA : B\r\r\r\r\n\r\n\rA: B\n\r"
splitString = re.compile('[\r\t\n]+').split(String)
if "" in splitString:
splitString.remove("")
print(splitString)
I added it here, so that people going through the same trouble as me, might want to overlook this approach too.
Following is the Output that I have got after using the above code:
['jhgdf', 'A : B', 'A : B', 'A: B']

Accessing delimiter in Python regular expressions split()

In the python re module, I'm making use of re.split()
string = '$aText$bFor$cStack$dOverflow'
parts = re.split(r'\$\w', string)
assert parts == ['Text', 'For', 'Stack', 'Overflow']
My question: is it possible to return the instance of the delimiter at the same time as the list of parts? I'd like to know if the delimiter was $c, $d, etc. preceding the various parts.
I suppose I could do a findall() call first, but that would mean manually calling positions in a list, which would introduce bugs. That also doesn't seem very pythonic.
If you put the pattern in a capture group, the delimiters appear in the results:
>>> string = '$aText$bFor$cStack$dOverflow'
>>> re.split(r'(\$\w)', string)
['', '$a', 'Text', '$b', 'For', '$c', 'Stack', '$d', 'Overflow']

How to convert a string containing a list of values that are not comma-separated to a list?

I'm new to Python and am wondering what is the most elegant way to convert a string of the form "[1 2 3]" to a list? If the string contains a comma-separated list of values, then the solution is simple:
str = "['x', 'y', 'z']"
arr = eval(str)
print isinstance(arr, list) # True
However, this solution doesn't work if the list in the string is not comma separated, e.g. "['x' 'y' 'z']".
Is there a common way to solve this without having to manually parse the string? The solution should not be type dependent, e.g. both "[1 2 3]" and "['multiple words 1' 'multiple words 2']" should be converted normally.
In this case shlex might be a solution.
import shlex
s = "['x' 'y' 'z']"
# First get rid of the opening and closing brackets
s = s.strip('[]')
# Split the string using shell-like syntax
lst = shlex.split(s)
print(type(lst), lst)
# Prints: <class 'list'> ['x', 'y', 'z']
But you'll have to check if it fulfills your requirements.
import re
str = "[1 2 a 'multiple words 1' 'multiple words 2' 'x' 'y' 'z']"
print ([''.join(x) for x in re.findall("'(.*?)'|(\S+)", re.sub(r'^\[(.*)\]', r'\1', str))])
>>> ['1', '2', 'a', 'multiple words 1', 'multiple words 2', 'x', 'y', 'z']
The first obvious step is to get rid of the [...] because they don't add anything useful to the results ...
Then it works because of the regex in findall: this will only match either anything between quotes or any sequence of non-spaces.
We don't want the quotes themselves (or do we? – it is not specified) so the regex grouping allows it to return just the inner parts.
Then we always get pairs of one element empty and one filled (('', '1'), ('', '2') and so on) so we need an additional cleaning loop.
This code cannot see the difference between [1 2 3] and ['1' '2' '3'], but that's no problem as such a variant is not specified in the question.

How to put a condition in regex in python?

I have a regex like --
query = "(A((hh)|(hn)|(n))?)"
and an input inp = "Ahhwps edAn". I want to extract all the matched pattern along with unmatched(remaining) but with preserving order of the input.
The output should look like -- ['Ahh', 'wps ed', 'An'] or ['Ahh', 'w', 'p', 's', ' ', 'e', 'd', 'An'].
I had searched online but found nothing.
How can I do this?
The re.split method may output captured submatches in the resulting array.
Capturing groups are those constructs that are formed with a pair of unescaped parentheses. Your pattern abounds in redundant capturing groups, and re.split will return all of them. You need to remove those unnecessary ones, and convert all capturing groups to non-capturing ones, and just keep the outer pair of parentheses to make the whole pattern a single capturing group.
Use
re.split(r'(A(?:hh|hn|n)?)', s)
Note that there may be an empty element in the output list. Just use filter(None, result) to get rid of the empty values.
The match objects' span() method is really useful for what you're after.
import re
pat = re.compile("(A((hh)|(hn)|(n))?)")
inp = "Ahhwps edAn"
result=[]
i=k=0
for m in re.finditer(pat,inp):
j,k=m.span()
if i<j:
result.append(inp[i:j])
result.append(inp[j:k])
i=k
if i<len(inp):
result.append(inp[k:])
print result
Here's what the output looks like.
['Ahh', 'wps ed', 'An']
This technique handles any non-matching prefix and suffix text as well. If you use an inp value of "prefixAhhwps edAnsuffix", you'll get the output I think you'd want:
['prefix', 'Ahh', 'wps ed', 'An', 'suffix']
You can try this:
import re
import itertools
new_data = list(itertools.chain.from_iterable([re.findall(".{"+str(len(i)/2)+"}", i) for i in inp.split()]))
Output:
['Ahh', 'wps', 'ed', 'An']

creating Dictionary-object from string that looks like dictionaries

I have a string in that looks something similiar to the following:
myString = "major: 11, minor: 31, name: A=1,B=1,C=1,P=1, severity: 0, comment: this is down"
I have tried this so far:
dict(elem.split(':') for elem in myString.split(','))
It works fine until it catches the name-element above which can not be split() with ':'.
Element in those format I would like to have as a new dictionary e.g.
myDic = {'major':'11', 'minor': '31', 'name':{'A':'1', 'B':'1', 'C':'1', 'P', '1'}, 'severity': '0', 'comment': 'this is down'}
If possible I would like to avoid complicated parsing as these turn out to be hard to maintain.
Also I do not know the name/amount of the keys or values in the string above. I just know the format. This is not a JSON-response, this is part of a text in a file and I have no control over the current format.
FYI, This is NOT the complete solution ..
If this is the concrete structure of your input, and will be the constant pattern within your source, you can distinguish the comma-separated Tokens.
The difference between major: 11, and name: A=1,B=1,C=1,P=1, is that there is SPACE after the first token which makes the difference from the second token. So simply by adding a space into second split method, you can render your string properly.
So, the code should be something like this:
dict(elem.split(':') for elem in myString.split(', '))
Pay attention to send split method. There is a SPACE and comma ...
Regarding to the JSON format, it needs more work I guess. I have no idea now ..
Here's another suggestion.
Why don't you transform it into a dictionary notation.
E.g. in a first step, you replace everything between a ':' and (comma or end of input) that contains a '=' (and mybe no whitespace, I don't know) by wrapping it in braces and replacing '=' by ':'.
In a second step, wrap everything between a ':' and (comma or end of input) in ', removing trailing and leading whitespace.
Finally, you wrap it all in braces.
I still don't trust that syntax, though... maybe after a few thousand lines have been processed successfully...
At least, this parses the given example correctly...
import re
def parse(s):
rx = r"""(?x)
(\w+) \s* : \s*
(
(?: \w+ = \w+,)*
(?: \w+ = \w+)
|
(?: [^,]+)
)
"""
r = {}
for key, val in re.findall(rx, s):
if '=' in val:
val = dict(x.split('=') for x in val.split(','))
r[key] = val
return r
myString = "major: 11, minor: 31, name: A=1,B=1,C=1,P=1, severity: 0, comment: this is down"
print parse(myString)
# {'comment': 'this is down', 'major': '11', 'name': {'A': '1', 'P': '1', 'C': '1', 'B': '1'}, 'minor': '31', 'severity': '0'}

Categories