How do I reverse regex substitution? - python

Going back to this example,
Having trouble dealing with similar characters to print different things using regex in Python
I was wondering how would I reverse the regex substitution I did and just print out the original text ?
That is, so if I have
text = "This is my first regex python example yahooa yahoouuee bbbiirdd"
as my original text, then it's output would be:
re.sub text = "tookhookisook isook mookyook fookirooksooktook pookyooktookhookonook..."
And then I want that output to be converted back to the original text.
How do I do that?

Python strings are immutable. You haven't changed the original, only created a new string. Just keep a reference to the original.
Edit
By immutable, I mean that their actual value is frozen once create.
>>> s = "abc"
>>> s[0]
'a'
>>> s[1] = 'd'
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
s[1] = 'd'
TypeError: 'str' object does not support item assignment
>>>
In the example above I can have the variable s reference another object, but the string I assigned to it is constant. So when you do s.replace(), the result is a new string, and the original is unchanged.
>>> s.replace ('a', 'd')
'dbc'
>>> s
'abc'
>>>

It seems that this works:
import re
tu = ('This is my first regex python example '
'yahooa yahoouuee bbbiirdd',
'bbbiirdd',
'fookirooksooktook',
'crrsciencezxxxxxscienceokjjsciencq')
reg = re.compile(r'([bcdfghj-np-tv-z])(\1?)')
dereg = re.compile('science([^aeiou])|([^aeiou])ook')
def Frepl(ma):
g1,g2 = ma.groups()
if g2: return 'science' + g2
else: return g1 + 'ook'
def Fderepl(ma):
g = ma.group(2)
if g: return g
else: return 2*ma.group(1)
for strt in tu:
resu = reg.sub(Frepl , strt)
bakk = dereg.sub(Fderepl, resu)
print ('----------------------------------\n'
'strt = %s\n' 'resu == %s\n'
'bakk == %s\n' 'bakk == start : %s'
% (strt, resu, bakk, bakk==strt))
Edit
First, I updated the above code: I eliminated the re.I flag. It was capturing portions like 'dD' as a repeated letter. so it was transformed to 'scienceD', then back to 'DD'
Secondly, I extended the code with a dictionary.
Instead of replacing a letter with letter+'ook', it replaces according to the letter.
For example, I choosed to replace 'b' with 'BAR', 'c' with 'CORE'.... I put the values of the dictionary uppercased, to have a better view of the result. It may in fact be anything else.
The programs takes care of the case. I put only 'T','Y','X' in the dictionary, it's just for essay.
import re
d = {'b':'BAR','c':'CORE','d':'DEAD','f':'FAN',
'g':'GO','h':'HHH','j':'JIU','k':'KOAN',
'l':'LOW','m':'MY','n':'NERD','p':'PI',
'q':'QIM','r':'ROAR','s':'SING','t':'TIP',
'v':'VIEW','w':'WAVE','x':'XOR',
'y':'YEAR','z':'ZOO',
'T':'tears','Y':'yearling','X':'xylophone'}
ded = dict((v,k) for k,v in d.iteritems())
print ded
tu = ('This is my first regex python example '
'Yahooa yahoouuee bbbiirdd',
'bbbiirdd',
'fookirooksooktook',
'crrsciencezxxxxxXscienceokjjsciencq')
reg = re.compile(r'([bcdfghj-np-tv-zBCDFGHJ-NP-TV-Z])(\1?)')
othergr = '|'.join(ded.keys())
dereg = re.compile('science([^aeiouAEIOU])|(%s)' % othergr)
def Frepl(ma, d=d):
g1,g2 = ma.groups()
if g2: return 'science' + g2
else: return d[g1]
def Fderepl(ma,ded=ded):
g = ma.group(2)
if g: return ded[g]
else: return 2*ma.group(1)
for strt in tu:
resu = reg.sub(Frepl , strt)
bakk = dereg.sub(Fderepl, resu)
print ('----------------------------------\n'
'strt = %s\n' 'resu == %s\n'
'bakk == %s\n' 'bakk == start : %s'
% (strt, resu, bakk, bakk==strt))
result
----------------------------------
strt = This is my first regex python example Yahooa yahoouuee bbbiirdd
resu == tearsHHHiSING iSING MYYEAR FANiROARSINGTIP ROAReGOeXOR PIYEARTIPHHHoNERD eXORaMYPILOWe yearlingaHHHooa YEARaHHHoouuee sciencebBARiiROARscienced
bakk == This is my first regex python example Yahooa yahoouuee bbbiirdd
bakk == start : True
----------------------------------
strt = bbbiirdd
resu == sciencebBARiiROARscienced
bakk == bbbiirdd
bakk == start : True
----------------------------------
strt = fookirooksooktook
resu == FANooKOANiROARooKOANSINGooKOANTIPooKOAN
bakk == fookirooksooktook
bakk == start : True
----------------------------------
strt = crrsciencezxxxxxXscienceokjjsciencq
resu == COREsciencerSINGCOREieNERDCOREeZOOsciencexsciencexXORxylophoneSINGCOREieNERDCOREeoKOANsciencejSINGCOREieNERDCOREQIM
bakk == crrsciencezxxxxxXscienceokjjsciencq
bakk == start : True

You can't "convert" a regex substitution backwards in Python or any other regex implementation.
That is simply because the substitution is a one-way street that returns a new string, and there is no magical reverse function
Here is an illustration using string.replace() :
original_string = 'abc'
newstring = original_string.replace('a','b')
'bbc'
converting newstring into 'abc' wouldn't just be substituting 'b' for 'a'. you can't create a "reverse" regex out of any given regex. if we replaced 'b' with 'a' in this example, the string would be 'aac' -- not bbc.
the regex functions work the same was as string.replace -- they return a new string. they don't return an object that knows the exact state of every regex replacement.
you have two options to do whatever it is you want to do:
1- create a custom class that represents a string and tracks an (infinite?) number of regular expression operations, allowing you to create a diff between each state.
2- do what everyone else does, and what many people here suggest: you simply need to stash the original string ( or a copy of it ) off to the side.
( this is an effort to simplify the answer from #StoryTeller )

Related

Is there a regrex script that can be used to extract texts by defining a start and an end in a text file [duplicate]

Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.
I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.
With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
How to do the same thing in Python?
Using regular expressions - documentation for further reference
import re
text = 'gfgfdAAA1234ZZZuijjk'
m = re.search('AAA(.+?)ZZZ', text)
if m:
found = m.group(1)
# found: 1234
or:
import re
text = 'gfgfdAAA1234ZZZuijjk'
try:
found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
# AAA, ZZZ not found in the original string
found = '' # apply your error handling
# found: 1234
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'
Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.
regular expression
import re
re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)
The above as-is will fail with an AttributeError if there are no "AAA" and "ZZZ" in your_text
string methods
your_text.partition("AAA")[2].partition("ZZZ")[0]
The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.
PS Python Challenge?
Surprised that nobody has mentioned this which is my quick version for one-off scripts:
>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'
you can do using just one line of code
>>> import re
>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')
>>> ['1234']
result will receive list...
import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)
You can use re module for that:
>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)
In python, extracting substring form string can be done using findall method in regular expression (re) module.
>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']
text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'
print(text[text.index(left)+len(left):text.index(right)])
Gives
string
>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')
With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
You could do the same with re.sub function using the same regex.
>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'
In basic sed, capturing group are represented by \(..\), but in python it was represented by (..).
You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.
def FindSubString(strText, strSubString, Offset=None):
try:
Start = strText.find(strSubString)
if Start == -1:
return -1 # Not Found
else:
if Offset == None:
Result = strText[Start+len(strSubString):]
elif Offset == 0:
return Start
else:
AfterSubString = Start+len(strSubString)
Result = strText[AfterSubString:AfterSubString + int(Offset)]
return Result
except:
return -1
# Example:
Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"
print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")
print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")
print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))
# Your answer:
Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"
AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0)
print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))
Using PyParsing
import pyparsing as pp
word = pp.Word(pp.alphanums)
s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
print(match)
which yields:
[['1234']]
One liner with Python 3.8 if text is guaranteed to contain the substring:
text[text.find(start:='AAA')+len(start):text.find('ZZZ')]
Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:
regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'
I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.
Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.
also, you can find all combinations in the bellow function
s = 'Part 1. Part 2. Part 3 then more text'
def find_all_places(text,word):
word_places = []
i=0
while True:
word_place = text.find(word,i)
i+=len(word)+word_place
if i>=len(text):
break
if word_place<0:
break
word_places.append(word_place)
return word_places
def find_all_combination(text,start,end):
start_places = find_all_places(text,start)
end_places = find_all_places(text,end)
combination_list = []
for start_place in start_places:
for end_place in end_places:
print(start_place)
print(end_place)
if start_place>=end_place:
continue
combination_list.append(text[start_place:end_place])
return combination_list
find_all_combination(s,"Part","Part")
result:
['Part 1. ', 'Part 1. Part 2. ', 'Part 2. ']
In case you want to look for multiple occurences.
content ="Prefix_helloworld_Suffix_stuff_Prefix_42_Suffix_andsoon"
strings = []
for c in content.split('Prefix_'):
spos = c.find('_Suffix')
if spos!=-1:
strings.append( c[:spos])
print( strings )
Or more quickly :
strings = [ c[:c.find('_Suffix')] for c in content.split('Prefix_') if c.find('_Suffix')!=-1 ]
Here's a solution without regex that also accounts for scenarios where the first substring contains the second substring. This function will only find a substring if the second marker is after the first marker.
def find_substring(string, start, end):
len_until_end_of_first_match = string.find(start) + len(start)
after_start = string[len_until_end_of_first_match:]
return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]
Another way of doing it is using lists (supposing the substring you are looking for is made of numbers, only) :
string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []
for char in string:
if char in numbersList: output.append(char)
print(f"output: {''.join(output)}")
### output: 1234
Typescript. Gets string in between two other strings.
Searches shortest string between prefixes and postfixes
prefixes - string / array of strings / null (means search from the start).
postfixes - string / array of strings / null (means search until the end).
public getStringInBetween(str: string, prefixes: string | string[] | null,
postfixes: string | string[] | null): string {
if (typeof prefixes === 'string') {
prefixes = [prefixes];
}
if (typeof postfixes === 'string') {
postfixes = [postfixes];
}
if (!str || str.length < 1) {
throw new Error(str + ' should contain ' + prefixes);
}
let start = prefixes === null ? { pos: 0, sub: '' } : this.indexOf(str, prefixes);
const end = postfixes === null ? { pos: str.length, sub: '' } : this.indexOf(str, postfixes, start.pos + start.sub.length);
let value = str.substring(start.pos + start.sub.length, end.pos);
if (!value || value.length < 1) {
throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
}
while (true) {
try {
start = this.indexOf(value, prefixes);
} catch (e) {
break;
}
value = value.substring(start.pos + start.sub.length);
if (!value || value.length < 1) {
throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
}
}
return value;
}
a simple approach could be the following:
string_to_search_in = 'could be anything'
start = string_to_search_in.find(str("sub string u want to identify"))
length = len("sub string u want to identify")
First_part_removed = string_to_search_in[start:]
end_coord = length
Extracted_substring=First_part_removed[:end_coord]
One liners that return other string if there was no match.
Edit: improved version uses next function, replace "not-found" with something else if needed:
import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )
My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:
import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

Pattern search by NOT using Regex algorithm and code in python

Today I had an interview at AMD and was asked a question which I didn't know how to solve it without Regex. Here is the question:
Find all the pattern for the word "Hello" in a text. Consider that there is only ONE char can be in between letters of hello e.g. search for all instances of "h.ello", "hell o", "he,llo", or "hel!lo".
Since you also tagged this question algorithm, I'm just going to show the general approach that I would take when looking at this question, without including any language tricks from python.
1) I would want to split the string into a list of words
2) Loop through each string in the resulting list, checking if the string matches 'hello' without the character at the current index (or if it simply matches 'hello')
3) If a match is found, return it.
Here is a simple approach in python:
s = "h.ello hello h!ello hell.o none of these"
all = s.split()
def drop_one(s, match):
if s == match:
return True # WARNING: Early Return
for i in range(len(s) - 1):
if s[:i] + s[i+1:] == match:
return True
matches = [x for x in all if drop_one(x, "hello")]
print(matches)
The output of this snippet:
['h.ello', 'hello', 'h!ello', 'hell.o']
This should work. I've tried to make it generic. You might have to make some minor adjustments. Let me know if you don't understand any part.
def checkValidity(tlist):
tmpVar = ''
for i in range(len(tlist)):
if tlist[i] in set("hello"):
tmpVar += tlist[i]
return(tmpVar == 'hello')
mStr = "he.llo hehellbo hellox hell.o hello helloxy abhell.oyz"
mWord = "hello"
mlen = len(mStr)
wordLen = len(mWord)+1
i=0
print ("given str = ", mStr)
while i<mlen:
tmpList = []
if mStr[i] == 'h':
for j in range(wordLen):
tmpList.append(mStr[i+j])
validFlag = checkValidity(tmpList)
if validFlag:
print("Match starting at index: ",i, ':', mStr[i:i+wordLen])
i += wordLen
else:
i += 1
else:
i += 1

Generating regex string to be used in re.match()

I am trying to a string to be used as regex String.
In the following code:
_pattern is a pattern like abba and I am trying to check _string follows the _pattern (eg. catdogdogcat)
rxp in the following code is the regular expression that I am trying to create to match to _string (eg. for above example it will be (.+)(.+)\\2\\1 ). Which is being successfully generated. But the re.match() is returning None.
I want to understand why it is not working and how to correct it ?
import re
_pattern = "abba" #raw_input().strip()
_string = "catdogdogcat" #raw_input().strip()
hm = {}
rxp = ""
c = 1
for x in _pattern:
if hm.has_key(x):
rxp += hm[x]
continue
else:
rxp += "(.+)"
hm[x]="\\\\"+str(c)
c+=1
print rxp
#print re.match(rxp,_string) -> (Tried) Not working
#print re.match(r'rxp', _string) -> (Tried) Not working
print re.match(r'%s' %rxp, _string) # (Tried) Not working
Output
(.+)(.+)\\2\\1
None
Expected Output
(.+)(.+)\\2\\1
<_sre.SRE_Match object at 0x000000000278FE88>
The thing is that your regex string variable has double \\ instead of a single one.
You can use
rxp.replace("\\\\", "\\")
in .match like this:
>>> print re.match(rxp.replace("\\\\", "\\"), _string)
<_sre.SRE_Match object at 0x10bf87c68>
>>> print re.match(rxp.replace("\\\\", "\\"), _string).groups()
('cat', 'dog')
EDIT:
You can also avoid getting double \\ like this:
import re
_pattern = "abba" #raw_input().strip()
_string = "catdogdogcat" #raw_input().strip()
hm = {}
rxp = ""
c = 1
for x in _pattern:
if x in hm:
rxp += hm[x]
continue
else:
rxp += "(.+)"
hm[x]="\\" + str(c)
c+=1
print rxp
print re.match(rxp,_string)
You should use string formatting, and not hard-code rxp into the string:
print re.match(r'%s'%rxp, _string)

Python regular expression to extract optional number at the end of string

I'm trying to write a Python regular expression that can parse strings of the type "<name>(<number>)", where <number> is optional.
For example, if I pass 'sclkout', then there is no number at the end, so it should just match 'sclkout'. If the input is 'line7', then is should match 'line' and '7'. The name can also contain numbers inside it, so if I give it 'dx3f', then the output should be 'dx3f', but for 'dx3b0' it should match 'dx3b' and 0.
This is what I first tried:
import re
def do_match(signal):
match = re.match('(\w+)(\d+)?', signal)
assert match
print "Input = " + signal
print "group1 = " + match.group(1)
if match.lastindex == 2:
print "group2 = " + match.group(2)
print ""
# should match 'sclkout'
do_match("sclkout")
# should match 'line' and '7'
do_match("line7")
# should match 'dx4f'
do_match("dx4f")
# should match 'dx3b' and '0'
do_match("dx3b0")
This is of course wrong because of greedy matching in the (\w+) group, so I tried setting that to non-greedy:
match = re.match('(\w+?)(\d+)?', signal)
This however only matches the first letter of the string.
You don't need regex for this:
from itertools import takewhile
def do_match(s):
num = ''.join(takewhile(str.isdigit, reversed(s)))[::-1]
return s[:s.rindex(num)], num
...
>>> do_match('sclkout')
('sclkout', '')
>>> do_match('line7')
('line', '7')
>>> do_match('dx4f')
('dx4f', '')
>>> do_match('dx3b0')
('dx3b', '0')
You can use a possessive quantifier like this:
^(?<name>\w+?)(?<number>\d+)?$
Or ^(\w+?)(\d+)?$, if you don't want the named capture groups.
See live demo here: http://rubular.com/r/44Ntc4mLDY
([a-zA-Z0-9]*[a-zA-Z]+)([0-9]*) is what you want.
import re
test = ["sclkout", "line7", "dx4f", "dx3b0"]
ans = [("sclkout", ""), ("line", "7"), ("dx4f", ""), ("dx3b", "0")]
for t, a in zip(test, ans):
m = re.match(r'([a-zA-Z0-9]*[a-zA-Z]+)([0-9]*)', t)
if m.groups() == a:
print "OK"
else:
print "NG"
output:
OK
OK
OK
OK

How do I split a comma delimited string in Python except for the commas that are within quotes

I am trying to split a comma delimited string in python. The tricky part for me here is that some of the fields in the data themselves have a comma in them and they are enclosed within quotes (" or '). The resulting split string should also have the quotes around the fields removed. Also, some fields can be empty.
Example:
hey,hello,,"hello,world",'hey,world'
needs to be split into 5 parts like below
['hey', 'hello', '', 'hello,world', 'hey,world']
Any ideas/thoughts/suggestions/help with how to go about solving the above problem in Python would be much appreciated.
Thank You,
Vish
Sounds like you want the CSV module.
(Edit: The original answer had trouble with empty fields on the edges due to the way re.findall works, so I refactored it a bit and added tests.)
import re
def parse_fields(text):
r"""
>>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\''))
['hey', 'hello', '', 'hello,world', 'hey,world']
>>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\','))
['hey', 'hello', '', 'hello,world', 'hey,world', '']
>>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\','))
['', 'hey', 'hello', '', 'hello,world', 'hey,world', '']
>>> list(parse_fields(''))
['']
>>> list(parse_fields(','))
['', '']
>>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string'))
['testing', 'quotes not at "the" beginning \'of\' the', 'string']
>>> list(parse_fields('testing,"unterminated quotes'))
['testing', '"unterminated quotes']
"""
pos = 0
exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""")
while True:
m = exp.search(text, pos)
result = m.group(2)
separator = m.group(3)
yield result
if not separator:
break
pos = m.end(0)
if __name__ == "__main__":
import doctest
doctest.testmod()
(['"]?) matches an optional single- or double-quote.
(.*?) matches the string itself. This is a non-greedy match, to match as much as necessary without eating the whole string. This is assigned to result, and it's what we actually yield as a result.
\1 is a backreference, to match the same single- or double-quote we matched earlier (if any).
(,|$) matches the comma separating each entry, or the end of the line. This is assigned to separator.
If separator is false (eg. empty), that means there's no separator, so we're at the end of the string--we're done. Otherwise, we update the new start position based on where the regex finished (m.end(0)), and continue the loop.
The csv module won't handle the scenario of " and ' being quotes at the same time. Absent a module that provides that kind of dialect, one has to get into the parsing business. To avoid reliance on a third party module, we can use the re module to do the lexical analysis, using the re.MatchObject.lastindex gimmick to associate a token type with the matched pattern.
The following code when run as a script passes all the tests shown, with Python 2.7 and 2.2.
import re
# lexical token symbols
DQUOTED, SQUOTED, UNQUOTED, COMMA, NEWLINE = xrange(5)
_pattern_tuples = (
(r'"[^"]*"', DQUOTED),
(r"'[^']*'", SQUOTED),
(r",", COMMA),
(r"$", NEWLINE), # matches end of string OR \n just before end of string
(r"[^,\n]+", UNQUOTED), # order in the above list is important
)
_matcher = re.compile(
'(' + ')|('.join([i[0] for i in _pattern_tuples]) + ')',
).match
_toktype = [None] + [i[1] for i in _pattern_tuples]
# need dummy at start because re.MatchObject.lastindex counts from 1
def csv_split(text):
"""Split a csv string into a list of fields.
Fields may be quoted with " or ' or be unquoted.
An unquoted string can contain both a " and a ', provided neither is at
the start of the string.
A trailing \n will be ignored if present.
"""
fields = []
pos = 0
want_field = True
while 1:
m = _matcher(text, pos)
if not m:
raise ValueError("Problem at offset %d in %r" % (pos, text))
ttype = _toktype[m.lastindex]
if want_field:
if ttype in (DQUOTED, SQUOTED):
fields.append(m.group(0)[1:-1])
want_field = False
elif ttype == UNQUOTED:
fields.append(m.group(0))
want_field = False
elif ttype == COMMA:
fields.append("")
else:
assert ttype == NEWLINE
fields.append("")
break
else:
if ttype == COMMA:
want_field = True
elif ttype == NEWLINE:
break
else:
print "*** Error dump ***", ttype, repr(m.group(0)), fields
raise ValueError("Missing comma at offset %d in %r" % (pos, text))
pos = m.end(0)
return fields
if __name__ == "__main__":
tests = (
("""hey,hello,,"hello,world",'hey,world'\n""", ['hey', 'hello', '', 'hello,world', 'hey,world']),
("""\n""", ['']),
("""""", ['']),
("""a,b\n""", ['a', 'b']),
("""a,b""", ['a', 'b']),
(""",,,\n""", ['', '', '', '']),
("""a,contains both " and ',c""", ['a', 'contains both " and \'', 'c']),
("""a,'"starts with "...',c""", ['a', '"starts with "...', 'c']),
)
for text, expected in tests:
result = csv_split(text)
print
print repr(text)
print repr(result)
print repr(expected)
print result == expected
I fabricated something like this. Very redundant I suppose, but it does the job for me. You have to adapt it a bit to your specifications:
def csv_splitter(line):
splitthese = [0]
splitted = []
splitpos = True
for nr, i in enumerate(line):
if i == "\"" and splitpos == True:
splitpos = False
elif i == "\"" and splitpos == False:
splitpos = True
if i == "," and splitpos == True:
splitthese.append(nr)
splitthese.append(len(line)+1)
for i in range(len(splitthese)-1):
splitted.append(re.sub("^,|\"","",line[splitthese[i]:splitthese[i+1]]))
return splitted

Categories