How to grab a string inside a pandas dataframe using a regex - python

I am trying to regex out a certain string inside my pandas df.
Say I have a df like so:
a b
0 foo foo AA123 bar 4
1 foo foo BB245 bar 5
2 foo CA234 bar bar 5
How would I get this df:
a b
0 AA123 4
1 BB245 5
2 CA234 5
One method I tried was df.replace({'(\w{3}\d{3})': ?}) but wasn't sure what to put for the second parameter.

You could use the regex-based Series.str.extract function to keep just the matching group. You also need a fix to your regex - the cardinality for the \w elements should be 2. In the end the code would be:
df["a"] = df["a"].str.extract('(\w{2}\d{3})', expand=False)
The expand=False is to indicate you don't want str.extract to return a DataFrame, which it does by default in order to accommodate multiple regex groups (it returns one column per group). Since you already know there is just one regex group here, for convenience you specify expand=False to get back a Series you can immediately assign to df["a"]. If there were more than one regex group, the function would return a DataFrame no matter what you specified for expand, and you would index into it to get the column/group you wanted.

Related

extract values from column in dataframe

I have the following dataframe:
A
url/3gth33/item/PO151302
url/3jfj6/item/S474-3
url/dfhk34j/item/4964114989191
url/sdfkj3k4/place/9b81f6fd
url/as3f343d/thing/ecc539ec
I'm looking to extract anything with /item/ and its subsequent value.
The end result should be:
item
/item/PO151302
/item/S474-3
/item/4964114989191
here is what I've tried:
df['A'] = df['A'].str.extract(r'(/item/\w+\D+\d+$)')
This is returning what I need except the integer only values.
Based on the regex docs I'm reading this should grab all instances.
What am I missing here?
Use /item/.+ to match /item/ and anything after. Also, if you put ?P<foo> at the beginning of a group, e.g. (?P<foo>...), the column for that matched group in the returned dataframe of captures will be named what's inside the <...>:
item = df['A'].str.extract('(?P<item>/item/.+)').dropna()
Output:
>>> item
item
0 /item/PO151302
1 /item/S474-3
2 /item/4964114989191
This is not a regex solution but it could come handy in some situations.
keyword = "/item/"
df["item"] = ((keyword + df["A"].str.split(keyword).str[-1]) *
df["A"].str.contains(keyword))
which returns
A item
0 url/3gth33/item/PO151302 /item/PO151302
1 url/3jfj6/item/S474-3 /item/S474-3
2 url/dfhk34j/item/4964114989191 /item/4964114989191
3 url/sdfkj3k4/place/9b81f6fd
4 url/as3f343d/thing/ecc539ec
5
And in case you want only the rows where item is not empty you could use
df[df["item"].ne("")][["item"]]

using rsplit on pandas dataframe column to separate based on second instance of a delimiter

I have a column of a pandas dataframe that I would like to split and expand into a new dataframe based on the second instance of a delimiter. I was splitting based on the last instance of the delimiter, but unfortunately there are a handful of instances in ~80k rows that have 4 '_' instead of 3.
For example, I have a dataframe with multiple columns where the one I would like to split into a new dataframe looks like the following:
df.head()
gene
0 NM_000000_foo_blabla
1 NM_000001_bar
and I want to split & expand it such that it separates to this:
(Desired)
df2.head()
col1 col2
0 NM_000000 foo_bar
1 NM_000001 foo
In using my current code:
df2 = df['gene'].str.rsplit('_', 1, expand=True).rename(lambda x: f'col{x + 1}', axis=1)
I get this:
(Actual)
df2.head()
col1 col2
0 NM_000000_foo bar
1 NM_000001 foo
Is there a simple way to achieve this my modifying the line of code I'm already using? I tried playing with the number of splits in rsplit but couldn't achieve the result I was looking for. Thanks!
Since your data seems to be fairly well defined, you can extract on the second instance of the delimiter using a regular expression.
df['gene'].str.extract(r'(?:[^_]+_){2}(.*)')
0
0 foo_blabla
1 bar
You can generalize this to be any delimiter, and match it any number of times using a simple function:
def build_regex(delimiter, num_matches=1):
return rf'(?:[^{delimiter}]+{delimiter}){{{num_matches}}}(.*)'
>>> build_regex('_', 2)
'(?:[^_]+_){2}(.*)'
>>> df['gene'].str.extract(build_regex('_', 2))
0
0 foo_blabla
1 bar
>>> df['gene'].str.extract(build_regex('_', 3))
0
0 blabla
1 NaN
Regex Explanation
(?: # non capture group
[^_]+ # match anything but _ one or more times
_ # match _
){2} # match this group 2 times
( # start of capture group 1
.* # match anything greedily
) # end of matching group 1
If there wasn't guaranteed to be text before either of the first two delimiters, you can also make the not assertion match 0 or more times:
(?:[^_]*_){2}(.*)
Just replace 2nd '_' by your custom deliminator and split on it
df.gene.str.replace(r'([^_]+_[^_]+)_', r'\1|').str.split('|', expand=True)
Out[488]:
0 1
0 NM_000000 foo_blabla
1 NM_000001 bar

Pattern Match in List of Strings, Create New Column in pandas

I have a pandas dataframe with the following general format:
id,product_name_extract
1,00012CDN
2,14311121NDC
3,NDC37ba
4,47CD27
I also have a list of product codes I would like to match (unfortunately, I have to do NLP extraction, so it will not be a clean match) and then create a new column with the matching list value:
product_name = ['12CDN','21NDC','37ba','7CD2']
id,product_name_extract,product_name_mapped
1,00012CDN,12CDN
2,14311121NDC,21NDC
3,NDC37ba,37ba
4,47CD27,7CD2
I am not too worried about there being collisions.
This would be easy enough if I just needed a True/False indicator using contains and the list values concatenated together with "|" for alternation, but I am a bit stumped now on how I would create a column value of the exact match. Any tips or trick appreciated!
Since you're not worried about collisions, you can join your product_name list with the | operator, and use that as a regex:
df['product_name_mapped'] = (df.product_name_extract.str
.findall('|'.join(product_name))
.str[0])
Result:
>>> df
id product_name_extract product_name_mapped
0 1 00012CDN 12CDN
1 2 14311121NDC 21NDC
2 3 NDC37ba 37ba
3 4 47CD27 7CD2

How to drop empty rows from a DataFrame when 'pd.notnull' does not work? Python

I have a DataFrame with two columns 'A' and 'B'. My goal is to delete rows where 'B' is empty. Others have recommended to use df[pd.notnull(df['B'])]. For example here: Python: How to drop a row whose particular column is empty/NaN?
However, somehow this does not work in this case. Why not and how to solve this?
A B
0 Lorema Ipsuma
1 Corpusa Dominusa
2 Loremb
3 Corpusc Dominusc
4 Loremd
5 Corpuse Dominuse
This is the desired result:
A B
0 Lorema Ipsuma
1 Corpusa Dominusa
2 Corpusc Dominusc
3 Corpuse Dominuse
Basically, you could have whitespaces, tabs or even a \n in these blank cells.
For all those cases, you can strip values first, and then remove the rows, i.e.
df[df.B.str.strip().ne("") & df.B.notnull()]
I believe this should cover all cases.

Pandas add column to df based on list of regex patterns

I have a dataframe that looks like this:
Sentence bin_class
"i wanna go to sleep. too late to take seroquel." 1
"Adam and Juliana are leaving me for 43 days take me with youuuu!" 0
And I also have a list of regex patterns I want to use on these sentences. What I want to do is re.search every pattern in my list on every every sentence in the dataframe and create a new column in the data frame that has a 1 if there is a matching regex and a zero otherwise. I have been able to run the regex patterns against the sentences in the dataframe to create a list of matches but am not sure how to create a new column on the data frame.
matches = []
for x in df['sentence']:
for i in regex:
match = re.search(i,x)
if match:
matches.append((x,i))
You can probably use the str.count string method. A small example:
In [25]: df
Out[25]:
Sentence bin_class
0 i wanna go to sleep. too late to take seroquel. 1
1 Adam and Juliana are leaving me for 43 days ta... 0
In [26]: df['Sentence'].str.count(pat='to')
Out[26]:
0 3
1 0
Name: Sentence, dtype: int64
This method also accepts a regex pattern. If you just want the occurence and not the count, contains is probably enough:
In [27]: df['Sentence'].str.contains(pat='to')
Out[27]:
0 True
1 False
Name: Sentence, dtype: bool
So with this you can loop through your regex patterns and then each time add a column with the above.
See the documentation on this for more examples: http://pandas.pydata.org/pandas-docs/stable/text.html#testing-for-strings-that-match-or-contain-a-pattern

Categories