Understanding a regular expression in a series extract function in pandas - python

I have the following code:
import pandas as pd
s = pd.Series(['toy story (1995)', 'the pirates (2014)'])
print(s.str.extract('.*\((.*)\).*',expand = True))
with output:
0
0 1995
1 2014
I understand that the extract function is pulling the values between the parentheses for both series objects. However I do not understand how. What exactly does '.*\((.*)\).*' mean? I think that the asterisks represent wild card characters but beyond that I am quite confused as to what is actually going on with this expression.

.*\( matches everything up until the first (
\).* matches everything from ) until the end
(.*) returns everything in between the first two matches

.* Match any number of characters
\( Match one opening parenthesis
(.*) Match any number of characters into the first capturing group
\) Match a closing parenthesis
.* Match any number of characters
This notation is called a regular expression, and I guess Pandas uses regexes in the extract function so you can get more precise data. Things inside capturing groups would be returned.
You can learn more about regexes at the Wikipedia page.
Here's a test example using your regex.

Related

Matching two identical groups of characters in pandas with some number of characters in between

I'm trying to extract two numbers of interest from a string of docket text in a pandas dataframe. Here's an example with a couple of the idiosyncrasies that exist in the data
import pandas as pd
df = pd.DataFrame(["Fee: $ 15,732, and Expenses: $1,520.62."])
I used regexr to test some ideas and the closest I've been able to come up with is something along the lines of
df[0].str.extract("(\${0,2}\s*(\d+[,\.]*){1,5})")
Which returns:
0 1
0 $15,732,, 732,,
The problems I'm running into are making characters optional while capturing the groups (i.e. I don't know how to get rid of the inner parenthesis because if I make it brackets then I get an error). And then ideally I'd be able to match the other set of numbers too.
I used regexr and while I can make regular expressions that match what I want, I'm struggling with the grouping part so that I can capture both while not needing to use a cumbersome function like apply with re.
There are sometimes numbers that show up again later in the report that include dates, other numbers, etc... So I'm trying to find a pretty controlled sequence (Can't get too liberal with the .*'s haha)
The string I ended up writing after the hint provided in the comments is:
\$((?:\d+(?:[,\.])*)+).*?\$((?:\d+(?:[,\.])*)+). The non-matching groups is what I hadn't understood before. I thought non-matching groups meant that it would somehow remove the parts that matched from the group but really what it means is that it's a group of characters that don't count as a group (not that they'll be removed from a group).
I appreciate the feedback I got this post!
I am not sure if the text stays the same across all of the values but you can use the following regex:
r'Fee: \$\s?([\d,.]+), and Expenses:\s*\$\s?([\d,.]+)\.'
returning two matching groups:
15,732
1,520.62
You can also abstract the text:
r'\w+:\s*\$\s?([\d,.]+),(\s*\w+)+:\s*\$\s?([\d,.]+)\.'
with the same result.
You can use
df[0].str.extract(r"(\$\s*\d+(?:[,.]\d+)*)") # To get the first value
df[0].str.extractall(r"(\$\s*\d+(?:[,.]\d+)*)") # To get all values
df[0].str.findall(r"\$\s*\d+(?:[,.]\d+)*") # To get all values
The str.extract pattern is wrapped with a capturing group so that the method could return any value, it requires at least one capturing group in the regex pattern.
The regex matches
\$ - a $ char
\s* - zero or more whitespaces
\d+ - one or more digits
(?:[,.]\d+)* - a non-capturing group matching zero or more repetitions of a comma/dot and then one or more digits.
See the regex demo.

Python regex conditional, don't match if

Sorry for the somewhat unhelpful title, I'm having a really hard time explaining this issue.
I have a list of unique identifiers that can appear in a number of different ways and I'm trying to use regex to normalize them so I can compare across several databases. Here are some examples of them:
AB1201
AB-1201
AB1201-T
AB-12-01L1
AB1201-TER
AB1201 Transit
I've written a line of code that pulls out all hypens and spaces, and the used this regex:
([a-zA-Z]{2}[\d]{4})(L\d|Transit|T$)?
This works exactly as expected, returning a list looking like this:
AB1201
AB1201
AB1201T
AB1201L1
AB1201
AB1201T
The issue is, I have one identifier that looks like this: AB1201-02. I need this to be raised as an exception, and not included as a match.
Any ideas? I'm happy to provide more clarification if necessary. Thanks!
From Regex101 online tester
You can exclude matching the following hyphen and a digit (?!-\d) using a negative lookahead.
If it should start at the beginning of the string, you could use an anchor ^
Note that you could write [\d] as \d
^([a-zA-Z]{2}\d{4})(?!-\d)(L\d|Transit|T$)?
The pattern will look like
^ Start of string
( Capture group 1
[a-zA-Z]{2}\d{4} Match 2 times a-zA-Z and 4 digits
) Close group
(?!-\d) Negative lookahead, assert what is directly to the right is not - and a digit
(L\d|Transit|T$)? Optional capture group 2
Regex demo
Try this regular expression
^([a-zA-Z]{2}[\d]{4})(?!-\d)(L\d|Transit|T|-[A-Z]{3})?$
I have added the (?!...) Negative Lookahead to avoid matching with the -02.
(?!...) Negative Lookahead: Starting at the current position in the expression, ensures that the given pattern will not match. Does not consume characters.
You can view a demo on this link.

Extracting the inside of an expression using REGEX

I currently have this regular expression that I use to match the result of an SQL query: [^\\n]+(?=\\r\\n\\r\\n\(1 rows affected\)). However, it is not working as intended....
'\r\n----------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------\r\nCS: GPS
on Date.
\r\n\r\n(1 rows affected)\r\n'
What I get from the expression above is Date whereas I would want to match CS: GPS on Date. It's fine if there's leading and following spaces... Nothing Python's trim can't handle. How do I change my regular expression so that the match is done properly?
Thanks in advance.
Edit: The Python version I am using is Python 3.6
You get your current match because the character class [^\\n]+ matches 1+ times any char except \ or n.
Then the positive lookahead asserts what is on the right is \r\n\r\n(1 rows affected) which results in matching Date.
See https://regex101.com/r/wDzq8l/1
You could use a non greedy .+? in a capturing group and match what follows instead of using a positive lookahead.
In the code use re.DOTALL to let the dot match a newline.
-\\r\\n(.+?) ?\\r\\n\\r\\n\(\d+ rows affected\)
Regex demo
Maybe, some expression similar to:
-{5,}\s*([A-Za-z][^.]+\.)
would extract that or somewhat similar to that.
Demo
Test
import re
regex = r'-{5,}\s*([A-Za-z][^.]+\.)'
string = '''
----------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------
CS: GPS
on Date.
\r\n\r\n(1 rows affected)\r\n
'''
print(re.findall(regex, string, re.DOTALL))
Output
['CS: GPS\non Date.']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Python: Extracting URLs using regex or other means

I’m stumped on a problem. I have a large data frame where two of the columns are like this:
pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'], ['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
What I’m trying to do is leave only the URL including the word “twitter” left in each cell and remove the rest. The pattern is that the URLs I want always include the word “twitter” and ends with “/” + a one-digit number. In the cases where there are two identical URLs in the same cell then only one should remain. Like this:
Test2 = pd.DataFrame([['a', 'https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
Test2
I’m new to Python and after a lot of googling I’ve started to understand that something called regex is the answer but that is as far as I come. One of the postings here at Stackoverflow led me to regex101.com and after playing around this is as far as I’ve come and it doesn't work:
r’^[https]+(:)(//)(.*?)(/)(\d)’
Can anyone tell me how to solve this problem?
Thanks in advance.
Regular expressions are certainly handy for such tasks. Refer to this question and online tools such as regex101 to learn more.
Your current pattern is incorrect because:
^ Matches the following pattern at the start of string.
[https]+ This is a character set, meaning it will match h, s, ps, therefore any combination of one or more letters present in the [] brackets, and not just the strings http and https which is what you are after.
(:) You don't need to put this : in a capturing group here.
(//) / Needs to be escaped in regex, \/. No need for capturing group here either.
(.*?) The .*? combo is often misused when a negated character set [^] could be used instead.
(/) As discussed above.
(\d) Matches and captures a digit. The capturing group here is also redundant for your task.
You may use the following expression:
https?:\/\/twitter\.com[^,]+(?<=\/\d$)
https? Matches literal substrings http or https.
:\/\/twitter\.com Matches literal substring ://twitter.com.
[^,]+ Anything that is not a comma, one or more.
(?<=\/\d$) Positive lookbehind. Assert that a / followed by a digit \d is present at the end of the string $.
Regex demo here.
Python demo:
import pandas as pd
df = pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
df['URLs'] = df['URLs'].str.findall(r"https?:\/\/twitter\.com[^,]+(?<=\/\d$)").str[0]
print(df)
Prints:
ID URLs
0 a https://twitter.com/dog_rates/status/890971913173991426/photo/1
1 b https://twitter.com/dog_rates/status/890971913173991426/photo/1
2 c https://twitter.com/dog_rates/status/890971913173991430/video/1

extract string betwen two strings in pandas

I have a text column that looks like:
http://start.blabla.com/landing/fb603?&mkw...
I want to extract "start.blabla.com"
which is always between:
http://
and:
/landing/
namely:
start.blabla.com
I do:
df.col.str.extract('http://*?\/landing')
But it doesn't work.
What am I doing wrong?
Your regex matches http:/, then 0+ / symbols as few as possible and then /landing.
You need to match and capture the characters (The extract method accepts a regular expression with at least one capture group.) after http:// other than /, 1 or more times. It can be done with
http://([^/]+)/landing
^^^^^^^
where [^/]+ is a negated character class that matches 1+ occurrences of characters other than /.
See the regex demo
Just to answer a question you didn't ask, if you wanted to extract several portions of the string into separate columns, you'd do it this way:
df.col.str.extract('http://(?P<Site>.*?)/landing/(?P<RestUrl>.*)')
You'd get something along the lines of:
Site RestUrl
0 start.blabla.com fb603?&mkw...
To understand how this regex (and any other regex for that matter) is constructed I suggest you take a look at the excellent site regex101. I constructed a snippet where you can see the above regex in action here.

Categories