Apply Regex to df add values in new column

Apply Regex to df add values in new column - python

This is my dataset:
BlaBla 128 MB EE
ADTD 6 gb DTS
EEEDC 2GB RS
STA 12MB DFA
BBNB 32 mb YED
From this data set I would like to extract the number of MB/GB and the unit MB/GB. Therefore I have created the following Regex:
(\d*)\s?(MB|GB)
The code that I have created so that the regex will be applied to my df is:
pattern = re.compile(r'(\d*)\s?(MB|GB)')
invoice_df['mbs'] = invoice_df['Rate Plan'].apply(lambda x: pattern.search(x).group(1))
invoice_df['unit'] = invoice_df['Rate Plan'].apply(lambda x: pattern.search(x).group(2))
However when applying the regex to my df it give the following error message:
AttributeError: 'NoneType' object has no attribute 'group'
What can I do to solve this?

Since some of the entries may have no match, the re.search fails (returns no match) for them. You need to account for those cases inside the lambda:
apply(lambda x: pattern.search(x).group(1) if pattern.search(x) else "")
I also advise to use
(?i)(\d+)\s*([MGK]B)
It will find 1+ digits (\d+, Group 1) followed with 0+ whitespaces (\s*) and will match KB, GB, MB into Group 2 (([MGK]B)) in a case-insensitive way.

You just need to check that something has been found before requesting the groups :
import re
inputs = ["BlaBla 128 MB EE",
"ADTD 6 gb DTS",
"EEEDC 2GB RS",
"STA 12MB DFA",
"BBNB 32 mb YED",
"Nothing to find here"]
pattern = re.compile("(\d+)\s*([MG]B)", re.IGNORECASE)
for input in inputs:
match = re.search(pattern, input)
if match:
mbs = match.group(1)
unit = match.group(2)
print (mbs, unit.upper())
else:
print "Nothing found for : %r" % input
# ('128', 'MB')
# ('6', 'GB')
# ('2', 'GB')
# ('12', 'MB')
# ('32', 'MB')
# Nothing found for : 'Nothing to find here'
With your code :
pattern = re.compile("(\d+)\s*([MG]B)", re.IGNORECASE)
match = re.search(pattern, invoice_df['Rate Plan'])
if match:
invoice_df['mbs'] = match.group(1)
invoice_df['unit'] = match.group(2)
It's more readable than a lambda IMHO, and it doesn't execute the search twice.

Related

How to match whole word in a string

I would like to pickup whole words in a string that separated by space, comma or period.
text = 'OTC GLUCOSAM-CHOND-MSM1-C-MANG-BOR test, dosage uncertain'
p = r"(?i)\b([A-Za-z]+[\s*|\,|\.]+)\b"
for m in regex.finditer(p, str(text)):
print (m.group())
I expect to get:
OTC
GLUCOSAM-CHOND-MSM1-C-MANG-BOR
test
dosage
uncertain
but what I got:
OTC
BOR
test,
dosage

To get a list of the words that you want, you can use the findall() function of the remodule. Also, try changing the regular expressions to the one showed below:
text = 'OTC GLUCOSAM-CHOND-MSM1-C-MANG-BOR test, dosage uncertain'
result = re.findall('[\w]+[-?[\w]+]*', text)
print(result)
# outputs: ['OTC', 'GLUCOSAM-CHOND-MSM1-C-MANG-BOR', 'test', 'dosage', 'uncertain']

import re
text = 'OTC GLUCOSAM-CHOND-MSM1-C-MANG-BOR test, dosage uncertain'
p = r"[a-zA-Z-\d]*"
for m in re.finditer(p, str(text)):
if len(m.group().strip()) > 0:
print(m.group())

ValueError: not enough values to unpack (expected 2, got 1), Splitting string into two parts with split() didn't work

I have a string : 5kg.
I need to make the numerical and the textual parts apart. So, in this case, it should produce two parts : 5 and kg.
For that I wrote a code:
grocery_uom = '5kg'
unit_weight, uom = grocery_uom.split('[a-zA-Z]+', 1)
print(unit_weight)
Getting this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-66-23a4dd3345a6> in <module>()
1 grocery_uom = '5kg'
----> 2 unit_weight, uom = grocery_uom.split('[a-zA-Z]+', 1)
3 #print(unit_weight)
4
5
ValueError: not enough values to unpack (expected 2, got 1)
print(uom)
Edit:
I wrote this:
unit_weight, uom = re.split('[a-zA-Z]+', grocery_uom, 1)
print(unit_weight)
print('-----')
print(uom)
Now I am getting this output:
5
-----
How to store the 2nd part of the string to a var?
Edit1:
I wrote this which solved my purpose (Thanks to Peter Wood):
unit_weight = re.split('([a-zA-Z]+)', grocery_uom, 1)[0]
uom = re.split('([a-zA-Z]+)', grocery_uom, 1)[1]

You don't want to split on the "kg", because that means it's not part of the actual data. Although looking at the docs, I see you can include them https://docs.python.org/3/howto/regex.html But the split pattern is intended to be a separater.
Here's an example of just making a pattern for exactly what you want:
import re
pattern = re.compile(r'(?P<weight>[0-9]+)\W*(?P<measure>[a-zA-Z]+)')
text = '5kg'
match = pattern.search(text)
print (match.groups())
weight, measure = match.groups()
print (weight, measure)
print ('the weight is', match.group('weight'))
print ('the unit is', match.group('measure'))
print (match.groupdict())
output
('5', 'kg')
5 kg
the weight is 5
the unit is kg
{'weight': '5', 'measure': 'kg'}

You need to use regex split rather than simple string split and the precise pattern you are looking for splitting is this,
(?<=\d)(?=[a-zA-Z]+)
Basically the point where is preceded by digit, hence this regex (?<=\d) and followed by alphabets, hence this regex (?=[a-zA-Z]+) and it can be seen in this demo with pink marker.
Check the pink marker from where the split will take place
Also, here is your modified Python code,
import re
grocery_uom = '5kg'
unit_weight, uom = re.split(r'(?<=\d)(?=[a-zA-Z]+)', grocery_uom, 1)
print('unit_weight: ', unit_weight, 'uom: ', uom)
Prints,
unit_weight: 5 uom: kg
Also, if there can be optional space between the number and units, you can better use this regex which will optionally consume the space too during split,
(?<=\d)\s*(?=[a-zA-Z]+)
Demo allowing optional space

*updated to allow for bigger numbers, such as "1,000"
Try this.
import re
grocery_uom = '5kg'
split_str = re.split(r'([0-9,?]+)([a-zA-Z]+)', grocery_uom, 1)
unit_weight, uom = split_str[1:3]
## Output: 5 kg

How to extract set of substrings from a paragraph of string

Say I have a string:
output='[{ "id":"b678792277461" ,"Responses":{"SUCCESS":{"sh xyz":"sh xyz\\n Name Age Height Weight\\n Ana \\u003c15 \\u003e 163 47\\n 43\\n DEB \\u003c23 \\u003e 155 \\n Grey \\u003c53 \\u003e 143 54\\n 63\\n Sch#"},"FAILURE":{},"BLACKLISTED":{}}}]'
This is just an example but I have much longer output which is response from an api call.
I want to extract all names (ana, dab, grey) and put in a separate list.
how can I do it?
json_data = json.loads(output)
json_data = [{'id': 'b678792277461', 'Responses': {'SUCCESS': {'sh xyz': 'sh xyz\n Name Age Height Weight\n Ana <15 > 163 47\n 43\n DEB <23 > 155 \n Grey <53 > 143 54\n 63\n Sch#'}, 'FAILURE': {}, 'BLACKLISTED': {}}}]
1) I have tried re.findall('\\n(.+)\\u',output)
but this didn't work because it says "incomplete sequence u"
2)
start = output.find('\\n')
end = output.find('\\u', start)
x=output[start:end]
But I couldn't figure out how to run this piece of code in loop to extract names
Thanks

The \u object is not a letter and it cannot be matched. It is a part of a Unicode sequence. The following regex works, but it is kind of quirky. It looks for the beginning of each line, except for the first one, until the first space.
output = json_data[0]['Responses']['SUCCESS']['sh xyz']
pattern = "\n\s*([a-z]+)\s+"
result = re.findall(pattern, output, re.M | re.I)
#['Name', 'Ana', 'DEB', 'Grey']
Explanation of the pattern:
start at a new line (\n)
skip all spaces, if any (\s*)
collect one or more letters ([a-z]+)
skip at least one space (\s+)
Unfortunately, "Name" is also recognized as a name. If you know that it is always present in the first line, slice the list of the results:
result[1:]
#['Ana', 'DEB', 'Grey']

I use regexr.com and play around with the regular expression until I get it right and then covert that into Python.
https://regexr.com/
I'm assuming the \n is the newline character here and I'll bet your \u error is caused by a line break. To use the multiline match in Python, you need to use that flag when you compile.
\n(.*)\n - this will be greedy and grab as many matches as possible (In the example it would grab the entire \nAna through 54\n
[{ "id":"678792277461" ,"Responses": {Name Age Height Weight\n Ana \u00315 \u003163 47\n 43\n Deb \u00323 \u003155 60 \n Grey \u00353 \u003144 54\n }]
import re
a = re.compile("\\n(.*)\\n", re.MULTILINE)
for responses in a.match(source):
match = responses.split("\n")
# match[0] should be " Ana \u00315 \u003163 47"
# match[1] should be " Deb \u00323 \u003155 60" etc.

regex matching multiple repeating groups

I have the following string:
s = " 3434 garbage workorders: 138 waiting, 2 running, 3 failed, 134 completed"
I would like to parse the statuses and counts after "workorders". I've tried the following regex:
r = r"workorders:( (\d+) (\w+),?)*"
but this only returns the last group. How can I return all groups?
p.s. I know I could do this in python, but was wondering if there's a pure regex solution
>>> s = " 3434 garbage workorders: 138 waiting, 2 running, 3 failed, 134 completed"
>>> r = r"workorders:( (\d+) (\w+),?)*"
>>> re.findall(r, s)
[(' 134 completed', '134', 'completed')]
>>>
output should be close to
[('138', 'waiting'), ('2', 'running'), ('3', 'failed'), ('134', 'completed')]

For the text in the example, you could try it like this:
(?:(\d+) (\w+)(?=,|$))+
Explanation
A non capturing group (?:
A capturing group for one or more digits (\d+)
A white space
A capturing group for one or more word characters (\w+)
A positive lookhead which asserts that what follows is either a comma or the end of the string (?=,|$)
Close the non capturing group and repeat that one or more times )+
Demo
That would give you:
[('138', 'waiting'), ('2', 'running'), ('3', 'failed'), ('134', 'completed')]

this should work for your particular case:
re.findall('[:,] (\d+)', s)

In my experience, I found it better to use regex after you process the string as much as possible; regex on an arbitrary string will only cause headaches.
In your case, try splitting on ':' (or even workorders:) and getting the stuff after to get only the counts of statuses. After that, it's easy to get the counts for each status.
s = " 3434 garbage workorders: 138 waiting, 2 running, 3 failed, 134
completed"
statuses = s.split(':') #['3434 garbage workorders', ' 138 waiting, 2 running, 3 failed, 134 completed']
statusesStr = ''.join(statuses[1]) # ' 138 waiting, 2 running, 3 failed, 134 completed'
statusRe = re.compile("(\d+)\s*(\w+)")
statusRe.findall(statusesStr) #[('138', 'waiting'), ('2', 'running'), ('3', 'failed'), ('134', 'completed')]
Edit: changed expression to meet desired outcome and more robust

Answer that will only look at regex that are after :
re.findall(r'(?: )\d+ \w+')

This will give you your output exactly.
map = re.findall(r'(\d+) ([A-Za-z]+)', s.split("workorders:")[1])
You can then bust this init.
x = {v: int(k) for k, v in map}

Parsing string outside parenthetical expression

I have the following text:
s1 = 'Promo Tier 77 (4.89 USD)'
s2 = 'Promo (11.50 USD) Tier 1 Titles Only'
From this I want to pull out the number that is not included in the parenthetical. It would be:
s1 --> '77'
s2 --> '1'
I am currently using the weak regex re.findall('\s\d+\s',s1). What would be the correct regex? Something like re.findall('\d+',s1) but excluding anything within the parenthetical.
>>> re.findall('\d+',s1)
['77', '4', '89'] # two of these numbers are within the parenthetical.
# I only want '77'

One way that I find useful is to use the alternation operator in context placing what you want to exclude on the left side, (saying throw this away, it's garbage) and place what you want to match in a capturing group on the right side.
Then you can combine this with filter or use a list comprehension to remove the empty list items that the regular expression engine picks up from the expression on the left side of the alternation operator.
>>> import re
>>> s = """Promo (11.50 USD) Tier 1 Titles Only
Promo (11.50 USD) (10.50 USD, 11.50 USD) Tier 5
Promo Tier 77 (4.89 USD)"""
>>> filter(None, re.findall(r'\([^)]*\)|(\d+)', s))
['1', '5', '77']

You could make a temporary string that has the parenthesis section removed, then run your code. I used a space so that numbers before and after the missing string section can't be joined.
>>> import re
>>> s = 'Promo Tier 77 (11.50 USD) Tier 1 Titles Only'
>>> temp = re.sub(r'\(.*?\)', ' ', s)
Promo Tier 77 Tier 1 Titles Only
>>> re.findall('\d+', temp)
['77', '1']
And you could of course shorten this to a single line.

Do some splitting on your strings. eg pseudocode
s1 = "Promo Tier 77 (4.89 USD)"
s = s1.split(")")
for ss in s :
if "(" in ss: # check for the open brace
if the number in ss.split("(")[0]: # split at the open brace and do your regex
print the number

(\b\d+\b)(?=(?:[^()]*\([^)]*\))*[^()]*$)
Try this.Grab the capture.See demo.
http://regex101.com/r/gT6kI4/7

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apply Regex to df add values in new column - python

Related

How to match whole word in a string

ValueError: not enough values to unpack (expected 2, got 1), Splitting string into two parts with split() didn't work

How to extract set of substrings from a paragraph of string

regex matching multiple repeating groups

Parsing string outside parenthetical expression

Categories

Resources