regex split: ignore delimiter if followed by short substring - python

I have a csv file in which pipes serve as delimiters.
But sometimes a short substring follows the 3rd pipe: up to 2 alphanumeric characters behind it. Then the 3rd pipe should not be interpreted as a delimiter.
example: split on each pipe:
x1 = "as234-HJ123-HG|dfdf KHT werg|XXL|s45dtgIKU|2017-SS0|123.45|asUJY"
=> split after XXL because it is followed by more than 2 characters
examples: split on all pipes except the 3rd if there are less than 3 characters between pipes 3 and 4:
x2 = "as234-H344423-dfX|dfer XXYUyu werg|1g|z4|sweDSgIKU|2017-SS0|123.45|YTf"
x3 = "as234-H3wer23-dZ|df3r Xa12yu wg|a1|2|sweDSgIKU|2017-SS0|123.45|YTf"
=> keep "1g|z4" and "a1|2" together.
My regex attempts only suffice for a substring replacement like this one: It replaces the pipe with a hyphen if it finds it between 2 digits: 3|4 => 3-4.
x = re.sub(r'(?<=\d)\|(?=\d)', repl='-', string=x1, count=1).
My question is:
If after the third pipe follows a short alphanumeric substring no longer than 1 or 2 characters (like Bx, 2, 42, z or 3b), then re.split should ignore the 3rd pipe and continue with the 4th pipe. All other pipes but #3 are unconditional delimiters.

You can use re.sub to add quotechar around the short columns. Then use Python's builtin csv module to parse the text (regex101 of the used expression)
import re
import csv
from io import StringIO
txt = """\
as234-HJ123-HG|dfdf KHT werg|XXL|s45dtgIKU|2017-SS0|123.45|asUJY
as234-H344423-dfX|dfer XXYUyu werg|1g|z4|sweDSgIKU|2017-SS0|123.45|YTf
as234-H3wer23-dZ|df3r Xa12yu wg|a1|2|sweDSgIKU|2017-SS0|123.45|YTf"""
pat = re.compile(r"^((?:[^|]+\|){2})([^|]+\|[^|]{,2}(?=\|))", flags=re.M)
txt = pat.sub(r'\1"\2"', txt)
reader = csv.reader(StringIO(txt), delimiter="|", quotechar='"')
for line in reader:
print(line)
Prints:
['as234-HJ123-HG', 'dfdf KHT werg', 'XXL', 's45dtgIKU', '2017-SS0', '123.45', 'asUJY']
['as234-H344423-dfX', 'dfer XXYUyu werg', '1g|z4', 'sweDSgIKU', '2017-SS0', '123.45', 'YTf']
['as234-H3wer23-dZ', 'df3r Xa12yu wg', 'a1|2', 'sweDSgIKU', '2017-SS0', '123.45', 'YTf']

I adapted Andrej's solution as follows:
Assume that the dataframe has already been imported from csv without parsing.
To split the dataframe's single column 0, apply a function that checks if the 3rd pipe is a qualified delimiter.
pat1 is Andrej's solution for identifying if substring4 after the 3rd pipe is longer than 2 characters or not. If it is short, then substring3, pipe3 and substring4 are enclosed within double quotes in text x (in a dataframe, this result type differs from the list shown by the print loop).
This part could be replaced by a different regex if your own criterion for "delimiter to ignore" differs from the example.
Next I replace the disqualified pipe(s), those between double quotes, with a hyphen: pat2 in re.sub.
The function returns the resulting text y to the new dataframe column "out".
We can get rid of the double quotes in the entire column. They were only needed for the replacements.
Finally, we split column "out" into multiple columns by using all remaining pipe delimiters in str.split.
I suppose my 3 steps could be combined to fewer steps (first enclose 3rd pipe in double quotes if the pipe matches a pattern that disqualifies it as delimiter, then replace the disqualified pipe with a hyphen, then split the text/column). But I'm happy enough that this 3-step solution works.
# identify if 3rd pipe is a valid delimiter:
def cond_replace_3rd_pipe(row):
# put substring3, 3rd pipe and short substring4 between double quotes
pat1 = re.compile(r"^((?:[^|]+\|){2})([^|]+\|[^|]{,2}(?=\|))", flags=re.M)
x = pat1.sub(r'\1"\2"', row[0])
# replaces pipes between double quotes with hyphen
pat2 = r'"(.+)\|(.+)"'
y = re.sub(pat2, r'"\1-\2"', x)
return y
df["out"] = df.apply(cond_replace_3rd_pipe, axis=1, result_type="expand")
df["out"] = df["out"].str.replace('"', "") # double quotes no longer needed
df["out"].str.split('|', expand=True) # split out into separate columns at all remaining pipes

Related

How to split a string in python based on separator with separator as a part of one of the chunks?

Looking for an elegant way to:
Split a string based on a separator
Instead of discarding separator, making it a part of the splitted chunks.
For instance I do have date and time data like:
D2018-4-21T3:55+6
2018-4-4T3:15+6
D2018-11-21T12:45+6:30
Sometimes there's D, sometimes not (however I always want it to be a part of first chunk), no trailing or leading zeros for time and timezone only have ':' sometimes. Point is, it is necessary to split on these 'D, T, +' characters cause the segements might not follow the sae length. If they were it would be easier to just split on the index basis. I want to split them over multiple characters like T and + and have them a part of the data as well like:
['D2018-4-21', 'T3:55', 'TZ+6']
['D2018-4-4', 'T3:15', 'TZ+6']
['D2018-11-21', 'T12:45', 'TZ+6:30']
I know a nicer way would be to clean data first and normalize all rows to follow same pattern but just curious how to do it as it is
For now on my ugly solution looks like:
[i+j for _, i in enumerate(['D','T','TZ']) for __, j in enumerate('D2018-4-21T3:55+6'.replace('T',' ').replace('D', ' ').replace('+', ' +').split()) if _ == __]
Use a regular expression
Reference:
https://docs.python.org/3/library/re.html
(...)
Matches whatever regular expression is inside the parentheses, and
indicates the start and end of a group; the contents of a group can be
retrieved after a match has been performed, and can be matched later
in the string with the \number special sequence, described below. To
match the literals '(' or ')', use ( or ), or enclose them inside a
character class: [(], [)].
import re
a = '''D2018-4-21T3:55+6
2018-4-4T3:15+6
D2018-11-21T12:45+6:30'''
b = a.splitlines()
for i in b:
m = re.search(r'^D?(.*)([T].*?)([-+].*)$', i)
if m:
print(["D%s" % m.group(1), m.group(2), "TZ%s" % m.group(3)])
Result:
['D2018-4-21', 'T3:55', 'TZ+6']
['D2018-4-4', 'T3:15', 'TZ+6']
['D2018-11-21', 'T12:45', 'TZ+6:30']

Replacing semicolon for comma in csv using regex in python

I'm working with a .csv file and, as always, it has format problems. In this case it's a ; separated table, but there's a row that sometimes has semicolons, like this:
code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction; animals;2;2
So there are three cases:
no semicolon -> no problem
word character(non-numeric), semicolon, whitespace, word character(non-numeric)
word character(non-numeric), semicolon, 2xwhitespace, word character(non-numeric)
I turned the .csv into a .txt and then imported it as a string and then I compiled this regex:
re.compile('([^\d\W]);\s+([^\d\W])', re.S)
Which should do. I almost managed to replace those semicolons for commas, doing the following:
def replace_comma(match):
text = match.group()
return text.replace(';', ',')
regex = re.compile('([^\d\W]);\s+([^\d\W])', re.S)
string2 = string.split('\n')
for n,i in enumerate(string2):
if len(re.findall('([^\d\W]);(\s+)([^\d\W])', i))>=1:
string2[n] = regex.sub(replace_comma, i)
This mostly works, but when there's two whitespaces after the semicolon, it leaves an \xa0 after the comma. I have two problems with this approach:
It's not very straightforward
Why is it leaving this \xa0 character ?
Do you know any better way to approach this?
Thanks
Edit: My desired output would be:
code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction, animals;2;2
Edit: Added explanation about turning the file into a string for better manipulation.
For this case I wouldn't use regex, split() and rsplit() with maxpslit= parameter is enough:
data = '''1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction; animals;2;2'''
for line in data.splitlines():
row = line.split(';', maxsplit=1)
row = row[:1] + row[-1].rsplit(';', maxsplit=2)
row[1] = row[1].replace(';', ',')
print(';'.join(row))
Prints:
1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction, animals;2;2

Parsing String by regular expression in python

How can I parse this string in python?
Input String:
someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data
to this
Output array:
['someplace','2018:6:18:0','25.0114','95.2818','2.71164','66.8962','Entire grid contents are set to missing data']
I have already tried with split(' ') but as it is not clear how many spaces are between the sub-strings and inside the last sub-string there may be spaces so this doesn't work.
I need the regular expression.
If you do not provide a sep-character, pythons split(sep=None, maxsplit=-1) (doku) will treat consecutive whitespaces as one whitespace and split by those. You can limit the amount of splits to be done by providing a maxsplit value:
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
spl = data.split(None,6) # dont give a split-char, use 6 splits at most
print(spl)
Output:
['someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164',
'66.8962', 'Entire grid contents are set to missing data']
This will work as long as the first text does not contain any whitespaces.
If the fist text may contain whitespaces, you can use/refine this regex solution:
import re
reg = re.findall(r"([^\d]+?) +?([\d:]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +(.*)$",data)[0]
print(reg)
Output:
('someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164', '66.8962', 'Entire grid contents are set to missing data')
Use f.e.https://regex101.com to check/proof the regex against your other data (follow the link, it uses above regex on sample data)
[A-Z]{1}[a-zA-Z ]{15,45}|[\w|:|.]+
You can test it here https://pythex.org/
Modify 15,45 according to your needs.
Maxsplit works with re.split(), too:
import re
re.split(r"\s+",text,maxsplit=6)
Out:
['someplace',
'2018:6:18:0',
'25.0114',
'95.2818',
'2.71164',
'66.8962',
'Entire grid contents are set to missing data']
EDIT:
If the first and last text parts don't contain digits, we don't need maxsplit and do not have to rely on number of parts with consecutive spaces:
re.split("\s+(?=\d)|(?<=\d)\s+",s)
We cut the string where a space is followed by a digit or vice versa using lookahead and lookbehind.
It is hard to answer your question as the requirements are not very precise. I think I would split the line with the split() function and then join the items when their contents has no numbers. Here is a snippet that works with your lonely sample:
def containsNumbers(s):
return any(c.isdigit() for c in s)
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
lst = data.split()
lst2 = []
i = 0
agg = ''
while i < len(lst):
if containsNumbers(lst[i]):
if agg != '':
lst2.append(agg)
agg = ''
lst2.append(lst[i])
else:
agg += ' '+lst[i]
agg = agg.strip()
if i == len(lst) - 1:
lst2.append(agg)
i += 1
print(lst2)

Regular Expression in Python 3

I am new here and just start using regular expressions in my python codes. I have a string which has 6 commas inside. One of the commas is fallen between two quotation marks. I want to get rid of the quotation marks and the last comma.
The input:
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
I want this output:
string = 'Fruits,Pear,Cherry,Apple,Orange,Cherry'
The output of my code:
string = 'Fruits,Pear,**CherryApple**,Orange,Cherry'
here is my code in python:
if (re.search('"', string)):
matches = re.findall(r'\"(.+?)\"',string);
matches1 = re.sub(",", "", matches[0]);
string = re.sub(matches[0],matches1,string);
string = re.sub('"','',string);
My problem is, I want to give a condition that the code only works for the last bit ("Cherry,") but unfortunately it affects other words in the middle (Cherry,Apple), which has the same text as the one between the quotation marks! That results in reducing the number of commas (from 6 to 4) as it merges two fields (Cherry,Apple) and I want to be left with 5 commas.
fullString = '2000-04-24 12:32:00.000,22186CBD0FDEAB049C60513341BA721B,0DDEB5,COMP,Ch‌​erry Corp.,DE,100,0.57,100,31213C678CC483768E1282A9D8CB524C,365.0‌​0000,business,acquis‌​itions-mergers,acqui‌​sition-bid,interest,‌​acquiree,fact,,,,,,,‌​,,,,,,acquisition-in‌​terest-acquiree,Cher‌​ry Corp. Gets Buyout Offer From Chairman President,FULL-ARTICLE,B5569E,Dow Jones Newswires,0.04,-0.18,0,0,1,0,0,0,0,1,1,5,RPA,DJ,DN2000042400‌​0597,"Cherry Corp. Gets Buyout Offer From Chairman President,"\n'
Many Thanks in advance
For your task you don't need regular expressions, just use replace:
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
new_string = string.replace('"').strip(',')
The best way would be to use the newer regex module where (*SKIP)(*FAIL) is supported:
import regex as re
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
# parts
rx = re.compile(r'"[^"]+"(*SKIP)(*FAIL)|,')
def cleanse(match):
rxi = re.compile(r'[",]+')
return rxi.sub('', match)
parts = [cleanse(match) for match in rx.split(string)]
print(parts)
# ['Fruits', 'Pear', 'Cherry', 'Apple', 'Orange', 'Cherry']
Here you match anything between double quotes and throw it away afterwards, thus only commas outside quotes are used for the split operation. The rest is a list comprehension with a cleaning function.
See a demo on regex101.com.
Why not simply use this:
>>>ans_string=string.replace('"','')[0:-1]
Output
>>>ans_string
'Fruits,Pear,Cherry,Apple,Orange,Cherry'
For the sake of simplicity and algorithmic complexity.
You might consider using the csv module to do this.
Example:
import csv
s='Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
>>> ','.join([e.replace(',','') for row in csv.reader([s]) for e in row])
Fruits,Pear,Cherry,Apple,Orange,Cherry
The csv module will strip the quotes but keep the commas on each quoted field. Then you can just remove that comma that was kept.
This will take care of any modifications desired (remove , for example) on a field by field basis. The fields with quotes and commas could be any field in the string.
If your content is in a csv file, you would do something like this (in pseudo code)
with open(file, 'rb') as csv_fo:
# modify(string) stands for what you want to do to each field...
for row in csv.reader(csv_fo):
new_row=[modify(field) for field in row]
# now do what you need with that row

Define string between comma x and comma y the split all bytes using a comma

I have some data that I am parsing which takes the following format:
8344,5354,Binh Duong,1,0103313333333033133331,1,13333333331,1,00313330133
,8344,7633,TT Ha Noi,2,3330333113333303111303,3,33133331133,2,30333133010
....more data.....
The first record does not start with a comma, but all subsequent rows of data do. I want to take all the numbers between the 4th and 5th comma on the first line and 5th and 6th comma on all other lines and split this string using commas.
So in the above example '0103313333333033133331' should print as '0,1,0,3,3,1,3,3,3,3,3,3,3,0,3,3,1,3,3,3,3,1'. The difficulty is that the length of the string between comma x and y varies depending on what data I am parsing. I've used regex's to isolate the string in question provided it has 16 digits in it, however this is not the case in all items I might be parsing.
As a result using a .format() method with 16 instances of '{},' threw up a tuple index error on items where the string was not 16 bytes long.
Can anyone suggest a method of achieving what I want?
Thanks
I would use str.split() to get the correct field, and str.join() to split it into single characters:
with open('xx.in') as input_file:
for line in input_file:
line = line.strip().strip(',')
line = line.split(',')
field = line[4]
print ','.join(field)
A slightly different approach with a regex that grabs the 5th element of the comma separated line from the end:
>>> import re
>>> lines = ['8344,5354,Binh Duong,1,0103313333333033133331,1,13333333331,1,00313330133',',8344,7633,TT Ha Noi,2,3330333113333303111303,3,33133331133,2,30333133010']
>>> for line in lines:
... num = re.search(r'\d+(?=(?:,[^,]+){4}$)', line).group()
... seq = ','.join(list(num))
... print(seq)
...
0,1,0,3,3,1,3,3,3,3,3,3,3,0,3,3,1,3,3,3,3,1
3,3,3,0,3,3,3,1,1,3,3,3,3,3,0,3,1,1,1,3,0,3
You can use this regex:
^,?\d+,\d+,[\w\s]+,\d+,(\d+)
Working demo
MATCH 1
1. [23-45] `0103313333333033133331`
MATCH 2
1. [97-119] `3330333113333303111303`
Then you can split the content of each group with \d
p = re.compile(ur'(\d)')
test_str = u"0103313333333033133331"
subst = u"\1,"
result = re.sub(p, subst, test_str)
>> 0,1,0,3,3,1,3,3,3,3,3,3,3,0,3,3,1,3,3,3,3,1,

Categories