Finding pattern to extract text from a string using regular expession - python

I have data which I'm reading in a string format as
>>> 26 24 16 Panelboards 10/05/18 26 26 00i Power Distribution Units – Install 10/05/18
I want to seperate '26 24 16', 'Panelboards', 10/05/18 and '26 26 00i', 'Power Distribution Units – Install', 10/05/18 as sub section, name, and date.
Also after every date, new item can begin. In this case, after 10/05/18, new sub section begins.
I have used regular expression to filter out sub section as but it creates unstructuring in my data.
re.split(r'\d\d \d\d \d\d',sentence)
If anyone has solution to efficiently retrieve these 3 features for two items.
Also, I can't use two spaces as regex due to change in structural file

Try:
s = """26 24 16 Panelboards 10/05/18 26 26 00i Power Distribution Units – Install 10/05/18"""
out = re.split(r"\s{2,}", s)
print(out)
Prints:
['26 24 16', 'Panelboards', '10/05/18 26 26 00i', 'Power Distribution Units – Install', '10/05/18']
EDIT: If you want to split the 2nd item, use str.split() with maxsplit=1:
from itertools import chain
s = """26 24 16 Panelboards 10/05/18 26 26 00i Power Distribution Units – Install 10/05/18"""
out = re.split(r"\s{2,}", s)
out = list(chain(out[:2], out[2].split(maxsplit=1), out[3:]))
print(out)
Prints:
['26 24 16', 'Panelboards', '10/05/18', '26 26 00i', 'Power Distribution Units – Install', '10/05/18']

You can use
\b(?P<subsection>\d+(?:\s+\d\w*)+)\s+(?P<name>.*?)\s+(?P<date>\d{1,2}/\d{1,2}/\d{2})\b
See the regex demo. Details:
\b - word boundary
(?P<subsection>\d+(?:\s+\d\w*)+) - Group "subsection": one or more digits and then one or more occurrences of one or more whitespaces followed with a digit and then zero or more word chars
\s+ - one or mor whitespaces
(?P<name>.*?) - Group "name": zero or more chars other than line break chars as few as possible
\s+ - one or mor whitespaces
(?P<date>\d{1,2}/\d{1,2}/\d{2}) - Group "date": one or two digits, /, one or two digits, /, two digits
\b - word boundary
See a Python demo:
import re
pattern = r"\b(?P<subsection>\d+(?:\s+\d\w*)+)\s+(?P<name>.*?)\s+(?P<date>\d{1,2}/\d{1,2}/\d{2})\b"
text = "26 24 16 Panelboards 10/05/18 26 26 00i Power Distribution Units – Install 10/05/18"
print([x.groupdict() for x in re.finditer(pattern, text)])
Output:
[
{'subsection': '26 24 16', 'name': 'Panelboards', 'date': '10/05/18'},
{'subsection': '26 26 00i', 'name': 'Power Distribution Units – Install', 'date': '10/05/18'}
]

Be as specific as you can:
/^(\d\d \d\d \d\d) +(.+?) +(\d\d\/\d\d\/\d\d)$/
Match group 1 for the subsection, 2 for the name and 3 for the date.
If you need to split the string first into each line, you could hook that into the end of the date:
\/\d\d\s

Related

Removing signs and repeating numbers

I want to remove all signs from my dataframe to leave it in either one of the two formats: 100-200 or 200
So the salaries should either have a single hyphen between them if a range of salaries if given, otherwise a clean single number.
I have the following data:
import pandas as pd
import re
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
'£26,000 - £28,000/annum plus bonus',
'£21,000/annum',
'£26,768 - £30,136/annum Attractive benefits package',
'£33/hour',
'£18,500 - £20,500/annum Inc Bonus - Study Support + Bens',
'£27,500 - £30,000/annum £27,500 to £30,000 + Study',
'£35,000 - £40,000/annum',
'£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
'£19,000 - £24,000/annum Study Support',
'£30,000 - £35,000/annum',
'£44,000 - £66,000/annum + 15% Bonus + Excellent Benefits. L',
'£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df)
Here's what I have tried to remove some of the signs:
salary = []
for i in data.salary:
space = re.sub(" ",'',i)
lower = re.sub("[a-z]",'',space)
upper = re.sub("[A-Z]",'',lower)
bracket = re.sub("/",'',upper)
comma = re.sub(",", '', bracket)
plus = re.sub("\+",'',comma)
percentage = re.sub("\%",'', plus)
dot = re.sub("\.",'', percentage)
bracket1 = re.sub("\(",'',dot)
bracket2 = re.sub("\)",'',bracket1)
salary.append(bracket2)
Which gives me:
'£26768-£30136',
'£26000-£28000',
'£21000',
'£26768-£30136',
'£33',
'£18500-£20500-',
'£27500-£30000£27500£30000',
'£35000-£40000',
'£24000-£27000',
'£19000-£24000',
'£30000-£35000',
'£44000-£6600015',
'£75-£90£75-£90'
However, I have some repeating numbers, essentially I want anything after the first range of values removed, and any sign besides the hyphen between the two numbers.
Expected output:
'26768-30136',
'26000-28000',
'21000',
'26768-30136',
'33',
'18500-20500',
'27500-30000',
'35000-40000',
'24000-27000',
'19000-24000',
'30000-35000',
'44000-66000',
'75-90
Another way using pandas.Series.str.partition with replace:
data["salary"].str.partition("/")[0].str.replace("[^\d-]+", "", regex=True)
Output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
Name: 0, dtype: object
Explain:
It assumes that you are only interested in the parts upto /; it extracts everything until /, than removes anything but digits and hypen
You can use
data['salary'].str.split('/', n=1).str[0].replace('[^\d-]+','', regex=True)
# 0 26768-30136
# 1 26000-28000
# 2 21000
# 3 26768-30136
# 4 33
# 5 18500-20500
# 6 27500-30000
# 7 35000-40000
# 8 24000-27000
# 9 19000-24000
# 10 30000-35000
# 11 44000-66000
# 12 75-90
Here,
.str.split('/', n=1) - splits into two parts with the first / char
.str[0] - gets the first item
.replace('[^\d-]+','', regex=True) - removes all chars other than digits and hyphens.
A more precise solution is to extract the £num(-£num)? pattern and remove all non-digits/hyphens:
data['salary'].str.extract(r'£(\d+(?:,\d+)*(?:\.\d+)?(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)?)')[0].str.replace(r'[^\d-]+', '', regex=True)
Details:
£ - a literal char
\d+(?:,\d+)*(?:\.\d+)? - one or more digits, followed with zero or more occurrences of a comma and one or more digits and then an optional sequence of a dot and one or more digits
(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)? - an optional occurrence of a hyphen enclosed with zero or more whitespaces (\s*-\s*), then a £ char, and a number pattern described above.
You can do it in only two regex passes.
First extract the monetary amounts with a regex, then remove the thousands separators, finally, join the output by group keeping only the first two occurrences per original row.
The advantage of this solution is that is really only extracts monetary digits, not other possible numbers that would be there if the input is not clean.
(data['salary'].str.extractall(r'£([,\d]+)')[0] # extract £123,456 digits
.str.replace(r'\D', '', regex=True) # remove separator
.groupby(level=0).apply(lambda x: '-'.join(x[:2])) # join first two occurrences
)
output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
You can use replace with a pattern and optional capture groups to match the data format, and use those groups in the replacement.
import pandas as pd
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
'£26,000 - £28,000/annum plus bonus',
'£21,000/annum',
'£26,768 - £30,136/annum Attractive benefits package',
'£33/hour',
'£18,500 - £20,500/annum Inc Bonus - Study Support + Bens',
'£27,500 - £30,000/annum £27,500 to £30,000 + Study',
'£35,000 - £40,000/annum',
'£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
'£19,000 - £24,000/annum Study Support',
'£30,000 - £35,000/annum',
'£44,000 - £66,000/annum + 15% Bonus + Excellent Benefits. L',
'£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df).salary.replace(
r"^£(\d+)(?:,(\d+))?(?:\s*(-)\s*£(\d+)(?:,(\d+))?)?/.*",
r"\1\2\3\4\5", regex=True
)
print(data)
The pattern matches
^ Start of string
£ Match literally
(\d+) Capture 1+ digits in group 1
(?:,(\d+))?Optionally capture 1+ digits in group 2 that is preceded by a comma to match the data format
(?: Non capture group to match as a whole
\s*(-)\s*£ capture - between optional whitespace chars in group 3 and match £
(\d+)(?:,(\d+))? The same as previous, now in group 4 and group 5
)? Close non capture group and make it optional
See a regex demo.
Output
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90

How to find the words correspond to month and replace it with numerical?

How to find the words that correspond to the month "January, February, March,.. etc." and replace them with numerical "01, 02, 03,.."
I tried the code below
def transformMonths(string):
rep = [("May", "05"), ("June", "06")]
for pat, repl in rep:
s = re.sub(pat, repl, string)
return s
print( transformMonths('I was born on June 24 and my sister was born on May 17') )
My code provides this result ('I was born on 06 24 and my sister was born on May 17')
However, I want the output to be like this ('I was born on 06 24 and my sister was born on 05 17')
You are performing the replacement on the initial (unmodified) string at each iteration so you end up with only one month name being replaced. You can fix that by assigning string instead of s in the loop (and return string at the end).
Note that your approach does not require a regular expression and could use a simple string replace: string = string.replace(pat,repl).
In both cases, because the replacement does not take into account word boundaries, the function would replace partial words such as:
"Mayor Smith was elected on May 25" --> "05or Smith was elected on 05 25".
You can fix that in your regular expression by adding \b before and after each month name. This will ensure that the month names are only found if they are between word boundaries.
The re.sub can perform multiple replacements with varying values if you give it a function instead of a fixed string. So you can build a combined regular expression that will find all the months and replace the words that are found using a dictionary:
import re
def numericMonths(string):
months = {"January":"01", "Ffebruary":"02","March":"03", "April":"04",
"May":"05", "June":"06", "July":"07", "August":"08",
"September":"09","October":"10", "November":"11","December":"12"}
pattern = r"\b("+"|".join(months)+r")\b" # all months as distinct words
return re.sub(pattern,lambda m:months[m.group()],string)
output:
numericMonths('I was born on June 24 and my sister was born on May 17')
'I was born on 06 24 and my sister was born on 05 17'

how find regex pattern word in outre word in python?

I have the string like this :
str = '4 167213860 Mar 7 2017 10:37:42 +00:00 c7600rsp72043-advipservicesk9-mz-obs_v151_3_s1_RLS10_ES5'
I want to recover only one part of this word (c7600rsp72043-advipservicesk9-mz-obs_v151_3_s1_RLS10_ES5)
I looking for the regex pattern, but I can't find. I do something like that in python :
import re
str = '4 167213860 Mar 7 2017 10:37:42 +00:00 c7600rsp72043-advipservicesk9-mz-obs_v151_3_s1_RLS10_ES5'
output = re.findall(r'[a-z0-9]rsp[a-zA-Z0-9_-]+$',string)
This return me []
If some one of you can help me I will be very happy.
Use a regex that gets all adjacent non whitespace at the end of the string: \S+$
string = '4 167213860 Mar 7 2017 10:37:42 +00:00 c7600rsp72043-advipservicesk9-mz-obs_v151_3_s1_RLS10_ES5'
output = re.findall(r'\S+$',string)
Working example: https://regex101.com/r/lXFRNT/1
#Ruzhim's answer is good, but if you want to keep on doing it the way you thought about it you could just replace the "rsp" bit with a \w+
output = re.findall(r'[a-z0-9]\w+[a-zA-Z0-9_-]+$', str)
>>>['c7600rsp72043-advipservicesk9-mz-obs_v151_3_s1_RLS10_ES5']

Replacing numbers in various formats with a word

I have a long sentence and I want to replace all numbers with a particular word. The numbers come in different formats, e.g.,
36
010616
010516 - 300417
01-04
2011 12
Is there function in Python for replacing these types of occurences with a word (say, "integer"), or how does the regex look for these?
Example:
"This is a 10 sentence with date 010616 and intervals 06-08 200-209 01 - 09 in years 2012 26"
should become
"This is a NUMBER sentence with date NUMBER and intervals NUMBER NUMBER NUMBER in years NUMBER NUMBER"
Using Regex.
import re
s = "This is a 10 sentence with date 010616 and intervals 06-08 200-209 01 - 09 in years 2012 26"
print( re.sub("\d+", "NUMBER", s) )
Output:
This is a NUMBER sentence with date NUMBER and intervals NUMBER-NUMBER NUMBER-NUMBER NUMBER - NUMBER in years NUMBER NUMBER
re.sub('((?<=^)|(?<= ))[0-9- ]+(?=$| )', 'NUMBER', s)
'This is a NUMBER sentence with date NUMBER and intervals NUMBER in years NUMBER'
what it does is:
looking for numbers with minus signs and spaces [0-9- ]+
with space or beginning of string before match ((?<=^)|(?<= ))
and space or end of string after match (?=$| )

Python string split without common delimiter

I am fairly new to Python. An external simulation software I use gives me reports which include data in the following format:
1 29 Jan 2013 07:33:19.273 29 Jan 2013 09:58:10.460 8691.186
I am looking to split the above data into four strings namely;
'1', '29 Jan 2013 07:33:19.273', '29 Jan 2013 09:58:10.460', '8691.186'
I cannot use str.split since it splits out the date into multiple strings. There appears to be four white spaces between 1 and the first date and between the first and second dates. I don't know if this is four white spaces or tabs.
Using '\t' as a delimiter on split doesn't do much. If I specify ' ' (4 spaces) as a delimiter, I get the first three strings. I also then get an empty string and leading spaces in the final string. There are 10 spaces between the second date and the number.
Any suggestions on how to deal with this would be much helpful!
Thanks!
You can split on more than one space with a simple regular expression:
import re
multispace = re.compile(r'\s{2,}') # 2 or more whitespace characters
fields = multispace.split(inputline)
Demonstration:
>>> import re
>>> multispace = re.compile(r'\s{2,}') # 2 or more whitespace characters
>>> multispace.split('1 29 Jan 2013 07:33:19.273 29 Jan 2013 09:58:10.460 8691.186')
['1', '29 Jan 2013 07:33:19.273', '29 Jan 2013 09:58:10.460', '8691.186']
If the data is fixed width you can use character addressing in the string
n=str[0]
d1=str[2:26]
d2=str[27:51]
l=str[52:]
However, if Jan 02 is shown as Jan 2 this may not work as the width of the string may be variable

Categories