Extract substring from string using Python and regex - python

I have a pandas dataframe containing very long strings in the 'page' column that I am trying to extract a substring from:
Example string: /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
Using regex, I am having a hard time determining how to extract the string between the two ampersands and removing all other characters part of the greater string.
So far, my code looks like this:
import pandas as pd
import re
dataset = pd.read_excel(r'C:\Users\example.xlsx')
dataframe = pd.DataFrame(dataset)
dataframe['Page'] = format = re.search(r'&(.*)&',str(dataframe['Page']))
dataframe.to_excel(r'C\Users\output.xlsx)
The code above runs but does not output anything to my new spreadsheet.
Thank you in advance.

You can extract the query string from the URL with urllib.parse.urlparse, then parse it with urllib.parse.parse_qs:
>>> from urllib.parse import urlparse, parse_qs
>>> path = '/ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0'
>>> query_string = urlparse(path).query
>>> parse_qs(query)
{'search_query': ['example one'], 'y': ['0'], 'x': ['0']}
EDIT: To extract the query_string from all pages in the Page column:
dataframe['Page'] = dataframe['Page'].apply(lambda page: parse_qs(urlparse(page).query)['search_query'][0])

You can try this
(?<=&).*?(?=&)
Explanation
(?<=&) - Positive lookbehind. Matches &.
(.*?) - Matches anything except newline. (Lazy method).
(?=&) - Positive lookahead matches &.
Demo

Fast and efficient pandas method.
Example data:
temp,page
1, /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
2, /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
3, /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
Code:
df = example.data # from above
df["query"] = df['page'].str.split("&", expand=True)[1].str.split("=", expand=True)[1]
print(df)
Example output:
temp \
0 1
1 2
2 3
page \
0 /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
1 /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
2 /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
query
0 example one
1 example one
2 example one
If you would like to label your columns based on the key=value pair, that would be a different extract afterwords.

Related

extracting a string from between to strings in dataframe

im trying to extract a value from my data frame
i have a column ['Desc'] it contains sentences in the folowing format
_000it_ZZZ$$$-
_0780it_ZBZT$$$-
_011it_BB$$$-
_000it_CCCC$$$-
I want to extract the string between 'it_' and '$$$'
I have tried this code but does not seem to work
# initializing substrings
sub1 = "it_"
sub2 = "$$$"
# getting index of substrings
idx1 = df['DESC'].find(sub1)
idx2 = df['DESC'].find(sub2)
# length of substring 1 is added to
# get string from next character
df['results'] = df['DESC'][idx1 + len(sub1) + 1: idx2]
I would appreciate your help
You can use str.extract to get the desired output in your new column.
import pandas as pd
import re
df = pd.DataFrame({
'DESC' : ["_000it_ZZZ$$$-", "_0780it_ZBZT$$$-", "_011it_BB$$$-", "_000it_CCCC$$$-", "_000it_123$$$-"]
})
pat = r"(?<=it_)(.+)(?=[\$]{3}-)"
df['results'] = df['DESC'].str.extract(pat)
print(df)
DESC results
0 _000it_ZZZ$$$- ZZZ
1 _0780it_ZBZT$$$- ZBZT
2 _011it_BB$$$- BB
3 _000it_CCCC$$$- CCCC
4 _000it_123$$$- 123
You can see the regex pattern on Regex101 for more details.
You could try using a regex pattern. It matches your cases you listed here, but I can't guarantee that it will generalize to all possible patterns.
import re
string = "_000it_ZZZ$$$-"
p = re.compile(r"(?<=it_)(.*)(?<!\W)")
m = p.findall(string)
print(m) # ['_ZZZ']
The pattern looks for it in the string and then stops untill it meets a non-word character.

How can I optimally replace the dataframe

I have a list of words in dataframe which I would like to replace with empty string.
I have a column named source which I have to clean properly.
e.g replace 'siliconvalley.co' to 'siliconvalley'
I created a list which is
list = ['.com','.co','.de','.co.jp','.co.uk','.lk','.it','.es','.ua','.bg','.at','.kr']
and replace them with empty string
for l in list:
df['source'] = df['source'].str.replace(l,'')
In the output, I am getting 'silinvalley' which means it has also replaced 'co' instead of '.co'
I want the code to replace the data which is exactly matching the pattern. Please help!
This would be one way. Would have to be careful with the order of replacement. If '.co' comes before '.co.uk' you don't get the desired result.
df["source"].replace('|'.join([re.escape(i) for i in list_]), '', regex=True)
Minimal example:
import pandas as pd
import re
list_ = ['.com','.co.uk','.co','.de','.co.jp','.lk','.it','.es','.ua','.bg','.at','.kr']
df = pd.DataFrame({
'source': ['google.com', 'google.no', 'google.co.uk']
})
pattern = '|'.join([re.escape(i) for i in list_])
df["new_source"] = df["source"].replace(pattern, '', regex=True)
print(df)
# source new_source
#0 google.com google
#1 google.no google.no
#2 google.co.uk google

Pandas DataFrame - Extract string between two strings and include the first delimiter

I've the following strings in column on a dataframe:
"LOCATION: FILE-ABC.txt"
"DRAFT-1-FILENAME-ADBCD.txt"
And I want to extract everything that is between the word FILE and the ".". But I want to include the first delimiter. Basically I am trying to return the following result:
"FILE-ABC"
"FILENAME-ABCD"
For that I am using the script below:
df['field'] = df.string_value.str.extract('FILE/(.w+)')
But I am not able to return the desired information (always getting NA).
How can I do this?
you can accomplish this all within the regex without having to use string slicing.
df['field'] = df.string_value.str.extract('(FILE.*(?=.txt))')
FILE is the what we begin the match on
.* grabs any number of characters
(?=) is a lookahead assertion that matches without
consuming.
Handy regex tool https://pythex.org/
If the strings will always end in .txt then you can try with the following:
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Example:
import pandas as pd
text = ["LOCATION: FILE-ABC.txt","DRAFT-1-FILENAME-ADBCD.txt"]
data = {'index':[0,1],'string_value':text}
df = pd.DataFrame(data)
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Output:
index string_value field
0 0 LOCATION: FILE-ABC.txt FILE-ABC
1 1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD
You can make a capturing group that captures from (including) 'FILE' greedily to the last period. Or you can make it not greedy so it stops at the first . after FILE.
import pandas as pd
df = pd.DataFrame({'string_value': ["LOCATION: FILE-ABC.txt", "DRAFT-1-FILENAME-ADBCD.txt",
"BADFILENAME.foo.txt"]})
df['field_greedy'] = df['string_value'].str.extract('(FILE.*)\.')
df['field_not_greedy'] = df['string_value'].str.extract('(FILE.*?)\.')
print(df)
string_value field_greedy field_not_greedy
0 LOCATION: FILE-ABC.txt FILE-ABC FILE-ABC
1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD FILENAME-ADBCD
2 BADFILENAME.foo.txt FILENAME.foo FILENAME

Pandas how to create unique 4 character string from longer string

I have a a pandas dataframe with strings. I would like to shorten them so I have decided to remove the vowels. Then my next step is to take the first four characters of the string but I am running into collisions. Is there a smarter way to do this so that I can try not to have repeatable strings but also keep to 4 character strings?
import pandas as pd
import re
d = {'test': ['gregorypolanco','franciscoliriano','chrisarcher', 'franciscolindor']}
df = pd.DataFrame(data=d)
def remove_vowels(r):
result = re.sub(r'[AEIOU]', '', r, flags=re.IGNORECASE)
return result
no_vowel = pd.DataFrame(df['test'].apply(remove_vowels))
no_vowel['test'].str[0:4]
Output:
0 grgr
1 frnc
2 chrs
3 frnc
Name: test, dtype: object
From the above you can see that 'franciscoliriano' and 'franciscolindor' are the same when shortened.

Return multiple matches of regular expression within a string in python pandas

I am trying to extract all matches contained in between "><" in a string
The code below only returns the first match in the string.
In:
import pandas as pd
import re
df = pd.Series(['<option value="85">APOE</option><option value="636">PICALM1<'])
reg = '(>([A-Z])\w+<)'
df2 = df.str.extract(reg)
print df2
Out:
0 1
0 >APOE< A
I would like to return "APOE" and "PICALM1" and not just "APOE"
Thanks for your help!
import re
import pandas as pd
df['new_col'] = df['old_col'].str.findall(r'>([A-Z][^<]+)<')
This will store all matches as a list in new_col of dataframe.
No need for pandas.
df = '<option value="85">APOE</option><option value="636">PICALM1<'
reg = '>([A-Z][^<]+)<'
print re.findall(reg,df)
['APOE', 'PICALM1']
Parsing HTML with regular expressions may not be the best idea, have you considered using BeautifulSoup?

Categories