Pandas DataFrame - Extract string between two strings and include the first delimiter - python

I've the following strings in column on a dataframe:
"LOCATION: FILE-ABC.txt"
"DRAFT-1-FILENAME-ADBCD.txt"
And I want to extract everything that is between the word FILE and the ".". But I want to include the first delimiter. Basically I am trying to return the following result:
"FILE-ABC"
"FILENAME-ABCD"
For that I am using the script below:
df['field'] = df.string_value.str.extract('FILE/(.w+)')
But I am not able to return the desired information (always getting NA).
How can I do this?

you can accomplish this all within the regex without having to use string slicing.
df['field'] = df.string_value.str.extract('(FILE.*(?=.txt))')
FILE is the what we begin the match on
.* grabs any number of characters
(?=) is a lookahead assertion that matches without
consuming.
Handy regex tool https://pythex.org/

If the strings will always end in .txt then you can try with the following:
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Example:
import pandas as pd
text = ["LOCATION: FILE-ABC.txt","DRAFT-1-FILENAME-ADBCD.txt"]
data = {'index':[0,1],'string_value':text}
df = pd.DataFrame(data)
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Output:
index string_value field
0 0 LOCATION: FILE-ABC.txt FILE-ABC
1 1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD

You can make a capturing group that captures from (including) 'FILE' greedily to the last period. Or you can make it not greedy so it stops at the first . after FILE.
import pandas as pd
df = pd.DataFrame({'string_value': ["LOCATION: FILE-ABC.txt", "DRAFT-1-FILENAME-ADBCD.txt",
"BADFILENAME.foo.txt"]})
df['field_greedy'] = df['string_value'].str.extract('(FILE.*)\.')
df['field_not_greedy'] = df['string_value'].str.extract('(FILE.*?)\.')
print(df)
string_value field_greedy field_not_greedy
0 LOCATION: FILE-ABC.txt FILE-ABC FILE-ABC
1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD FILENAME-ADBCD
2 BADFILENAME.foo.txt FILENAME.foo FILENAME

Related

extracting a string from between to strings in dataframe

im trying to extract a value from my data frame
i have a column ['Desc'] it contains sentences in the folowing format
_000it_ZZZ$$$-
_0780it_ZBZT$$$-
_011it_BB$$$-
_000it_CCCC$$$-
I want to extract the string between 'it_' and '$$$'
I have tried this code but does not seem to work
# initializing substrings
sub1 = "it_"
sub2 = "$$$"
# getting index of substrings
idx1 = df['DESC'].find(sub1)
idx2 = df['DESC'].find(sub2)
# length of substring 1 is added to
# get string from next character
df['results'] = df['DESC'][idx1 + len(sub1) + 1: idx2]
I would appreciate your help
You can use str.extract to get the desired output in your new column.
import pandas as pd
import re
df = pd.DataFrame({
'DESC' : ["_000it_ZZZ$$$-", "_0780it_ZBZT$$$-", "_011it_BB$$$-", "_000it_CCCC$$$-", "_000it_123$$$-"]
})
pat = r"(?<=it_)(.+)(?=[\$]{3}-)"
df['results'] = df['DESC'].str.extract(pat)
print(df)
DESC results
0 _000it_ZZZ$$$- ZZZ
1 _0780it_ZBZT$$$- ZBZT
2 _011it_BB$$$- BB
3 _000it_CCCC$$$- CCCC
4 _000it_123$$$- 123
You can see the regex pattern on Regex101 for more details.
You could try using a regex pattern. It matches your cases you listed here, but I can't guarantee that it will generalize to all possible patterns.
import re
string = "_000it_ZZZ$$$-"
p = re.compile(r"(?<=it_)(.*)(?<!\W)")
m = p.findall(string)
print(m) # ['_ZZZ']
The pattern looks for it in the string and then stops untill it meets a non-word character.

Python dataframe : strip part of string, on each column row, if it is in specific format [duplicate]

I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers

Extract particular string which appears in multiple lines in cell Pandas

I have to extract string which starts with "Year" and finishes with "\n", but for each line that appears in a cell in Pandas data frame.
Additionally, I want to remove \n at the end of cell.
This is data frame:
df
Column1
not_important1\nnot_important2\nE012-855 Year-1972\nE012-856 Year-1983\nnot_important3\nE012-857 Year-1977\nnot_important4\nnot_important5\nE012-858 Year-2012\n
not_important6\nnot_important7\nE013-200 Year-1982\nE013-201 Year-1984\nnot_important8\nE013-202 Year-1987\n
not_important9\nnot_important10\nE014-652 Year-1988\nE014-653 Year-1980\nnot_important11\nE014-654 Year-1989\n
This is what I want to get:
df
Column1
Year-1972\nYear-1983\nYear-1977\nYear-2012
Year-1982\nYear-1984\nYear-1987
Year-1988\nYear-1980\nYear-1989
How to do this?
You can use findall with this regex r'Year.*?\\n' to catch the substrings. Then create a string from the list of the found elements with ''.join and then remove the last \n with [:-2] :
import re
df['Column1'] = df['Column1'].apply(lambda x: ''.join(re.findall('Year.*?\\n', x))[:-2])
Or, if after the 4 digits of the year there is always \n, you can do in this way:
df['Column1'] = df['Column1'].apply(lambda x: '\n'.join(re.findall('Year-\d\d\d\d', x)))

How to replace character in row of dataframe?

I open raw data using pandas
df=pd.read_cvs(file)
Here's part of my dataframe look like:
37280 7092|156|Laboratory Data|A648C751-A4DD-4CZ2-85
47981 7092|156|Laboratory Data|Z22CD01C-8Z4B-4ZCB-8B
57982 7092|156|Laboratory Data|C12CE01C-8F4B-4CZB-8B
I'd like to replace all pipe('|') into tab ('\t')
So I tried :
df.replace('|','\t')
But it never works. How could I do this?
Many thanks!
The replace method on data frame by default is meant to replace values exactly match the string provided; You need to specify regex=True to replace patterns, and since | is a special character in regex, an escape is needed here:
df1 = df.replace("\|", "\t", regex=True)
df1
# 0 1
#0 37280 7092\t156\tLaboratory Data\tA648C751-A4DD-4CZ2-85
#1 47981 7092\t156\tLaboratory Data\tZ22CD01C-8Z4B-4ZCB-8B
#2 57982 7092\t156\tLaboratory Data\tC12CE01C-8F4B-4CZB-8B
If we print the cell, the tab are printed as expected:
print(df1[1].iat[0])
# 7092 156 Laboratory Data A648C751-A4DD-4CZ2-85
Just need to set the variable to itself:
df = df.replace('|', '\t')

pandas applying regex to replace values

I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers

Categories