extracting a string from between to strings in dataframe

extracting a string from between to strings in dataframe - python

im trying to extract a value from my data frame
i have a column ['Desc'] it contains sentences in the folowing format
_000it_ZZZ$$$-
_0780it_ZBZT$$$-
_011it_BB$$$-
_000it_CCCC$$$-
I want to extract the string between 'it_' and '$$$'
I have tried this code but does not seem to work
# initializing substrings
sub1 = "it_"
sub2 = "$$$"
# getting index of substrings
idx1 = df['DESC'].find(sub1)
idx2 = df['DESC'].find(sub2)
# length of substring 1 is added to
# get string from next character
df['results'] = df['DESC'][idx1 + len(sub1) + 1: idx2]
I would appreciate your help

You can use str.extract to get the desired output in your new column.
import pandas as pd
import re
df = pd.DataFrame({
'DESC' : ["_000it_ZZZ$$$-", "_0780it_ZBZT$$$-", "_011it_BB$$$-", "_000it_CCCC$$$-", "_000it_123$$$-"]
})
pat = r"(?<=it_)(.+)(?=[\$]{3}-)"
df['results'] = df['DESC'].str.extract(pat)
print(df)
DESC results
0 _000it_ZZZ$$$- ZZZ
1 _0780it_ZBZT$$$- ZBZT
2 _011it_BB$$$- BB
3 _000it_CCCC$$$- CCCC
4 _000it_123$$$- 123
You can see the regex pattern on Regex101 for more details.

You could try using a regex pattern. It matches your cases you listed here, but I can't guarantee that it will generalize to all possible patterns.
import re
string = "_000it_ZZZ$$$-"
p = re.compile(r"(?<=it_)(.*)(?<!\W)")
m = p.findall(string)
print(m) # ['_ZZZ']
The pattern looks for it in the string and then stops untill it meets a non-word character.

Related

Regex replace first two letters within column in python

I have a dataframe such as
COL1
A_element_1_+_none
C_BLOCA_element
D_element_3
element_'
BasaA_bloc
B_basA_bloc
BbasA_bloc
and I would like to remove the first 2 letters within each row of COL1 only if they are within that list :
the_list =['A_','B_','C_','D_']
Then I should get the following output:
COL1
element_1_+_none
BLOCA_element
element_3
element_'
BasaA_bloc
basA_bloc
BbasA_bloc
So far I tried the following :
df['COL1']=df['COL1'].str.replace("A_","")
df['COL1']=df['COL1'].str.replace("B_","")
df['COL1']=df['COL1'].str.replace("C_","")
df['COL1']=df['COL1'].str.replace("D_","")
But it also remove the pattern such as in row2 A_ and does not remove only the first 2 letters...

If the values to replace in the_list always have that format, you could also consider using str.replace with a simple pattern matching an uppercase char A-D followed by an underscore at the start of the string ^[A-D]_
import pandas as pd
strings = [
"A_element_1_+_none ",
"C_BLOCA_element ",
"D_element_3",
"element_'",
"BasaA_bloc",
"B_basA_bloc",
"BbasA_bloc"
]
df = pd.DataFrame(strings, columns=["COL1"])
df['COL1'] = df['COL1'].str.replace(r"^[A-D]_", "")
print(df)
Output
COL1
0 element_1_+_none
1 BLOCA_element
2 element_3
3 element_'
4 BasaA_bloc
5 basA_bloc
6 BbasA_bloc

You can also use apply() function from pandas. So if the string is with the concerned patterns, we ommit the two first caracters else return the whole string.
d["COL1"] = d["COL1"].apply(lambda x: x[2:] if x.startswith(("A_","B_","C_","D_")) else x)

How to remove unique character based on the same index via regex

while learning through SO's one of the question, where using regex to extract values.
I am wondering how we can implement a regex to remove all the characters if the are same in every row and matching the same index position.
Below is the DataFrame:
print(df)
column1
0 [b,e,c]
1 [e,a,c]
2 [a,b,c]
regex :
df.column1.str.extract(r'(\w,\w)')
print(df)
column1
0 b,e
1 e,a
2 a,b
In the above regex it extract the characters needed but i want to preserve [] this as well.

You can use
df['column2'] = df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True)
df['column2'] = '[' + df['column1'].str.extract(r'(\w,\w)') + ']'
In the .str.replace approach, the (?s).*?\[(\w,\w).* matches any zero or more chars as few as possible, then a [, then captures a word char + comma + a word char into Group 1 (\1) and then the rest of the string and replaces the match with [ + Group 1 value + ].
In the second approach, [ and ] are added to the result of the extraction, this solution is best for your toy examples here.
Here is a Pandas test:
>>> import pandas as pd
>>> df = pd.DataFrame({'column1':['[b,e,c]']})
>>> df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True)
0 [b,e]
Name: column1, dtype: object
>>> '[' + df['column1'].str.extract(r'(\w,\w)') + ']'
0
0 [b,e]

Python dataframe : strip part of string, on each column row, if it is in specific format [duplicate]

I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks

You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.

You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000

You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo

You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)

In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers

Pandas DataFrame - Extract string between two strings and include the first delimiter

I've the following strings in column on a dataframe:
"LOCATION: FILE-ABC.txt"
"DRAFT-1-FILENAME-ADBCD.txt"
And I want to extract everything that is between the word FILE and the ".". But I want to include the first delimiter. Basically I am trying to return the following result:
"FILE-ABC"
"FILENAME-ABCD"
For that I am using the script below:
df['field'] = df.string_value.str.extract('FILE/(.w+)')
But I am not able to return the desired information (always getting NA).
How can I do this?

you can accomplish this all within the regex without having to use string slicing.
df['field'] = df.string_value.str.extract('(FILE.*(?=.txt))')
FILE is the what we begin the match on
.* grabs any number of characters
(?=) is a lookahead assertion that matches without
consuming.
Handy regex tool https://pythex.org/

If the strings will always end in .txt then you can try with the following:
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Example:
import pandas as pd
text = ["LOCATION: FILE-ABC.txt","DRAFT-1-FILENAME-ADBCD.txt"]
data = {'index':[0,1],'string_value':text}
df = pd.DataFrame(data)
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Output:
index string_value field
0 0 LOCATION: FILE-ABC.txt FILE-ABC
1 1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD

You can make a capturing group that captures from (including) 'FILE' greedily to the last period. Or you can make it not greedy so it stops at the first . after FILE.
import pandas as pd
df = pd.DataFrame({'string_value': ["LOCATION: FILE-ABC.txt", "DRAFT-1-FILENAME-ADBCD.txt",
"BADFILENAME.foo.txt"]})
df['field_greedy'] = df['string_value'].str.extract('(FILE.*)\.')
df['field_not_greedy'] = df['string_value'].str.extract('(FILE.*?)\.')
print(df)
string_value field_greedy field_not_greedy
0 LOCATION: FILE-ABC.txt FILE-ABC FILE-ABC
1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD FILENAME-ADBCD
2 BADFILENAME.foo.txt FILENAME.foo FILENAME

Extract substring from string using Python and regex

I have a pandas dataframe containing very long strings in the 'page' column that I am trying to extract a substring from:
Example string: /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
Using regex, I am having a hard time determining how to extract the string between the two ampersands and removing all other characters part of the greater string.
So far, my code looks like this:
import pandas as pd
import re
dataset = pd.read_excel(r'C:\Users\example.xlsx')
dataframe = pd.DataFrame(dataset)
dataframe['Page'] = format = re.search(r'&(.*)&',str(dataframe['Page']))
dataframe.to_excel(r'C\Users\output.xlsx)
The code above runs but does not output anything to my new spreadsheet.
Thank you in advance.

You can extract the query string from the URL with urllib.parse.urlparse, then parse it with urllib.parse.parse_qs:
>>> from urllib.parse import urlparse, parse_qs
>>> path = '/ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0'
>>> query_string = urlparse(path).query
>>> parse_qs(query)
{'search_query': ['example one'], 'y': ['0'], 'x': ['0']}
EDIT: To extract the query_string from all pages in the Page column:
dataframe['Page'] = dataframe['Page'].apply(lambda page: parse_qs(urlparse(page).query)['search_query'][0])

You can try this
(?<=&).*?(?=&)
Explanation
(?<=&) - Positive lookbehind. Matches &.
(.*?) - Matches anything except newline. (Lazy method).
(?=&) - Positive lookahead matches &.
Demo

Fast and efficient pandas method.
Example data:
temp,page
1, /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
2, /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
3, /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
Code:
df = example.data # from above
df["query"] = df['page'].str.split("&", expand=True)[1].str.split("=", expand=True)[1]
print(df)
Example output:
temp \
0 1
1 2
2 3
page \
0 /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
1 /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
2 /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
query
0 example one
1 example one
2 example one
If you would like to label your columns based on the key=value pair, that would be a different extract afterwords.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

extracting a string from between to strings in dataframe - python

Related

Regex replace first two letters within column in python

How to remove unique character based on the same index via regex

Python dataframe : strip part of string, on each column row, if it is in specific format [duplicate]

Pandas DataFrame - Extract string between two strings and include the first delimiter

Extract substring from string using Python and regex

Categories

Resources