How can I optimally replace the dataframe

How can I optimally replace the dataframe - python

I have a list of words in dataframe which I would like to replace with empty string.
I have a column named source which I have to clean properly.
e.g replace 'siliconvalley.co' to 'siliconvalley'
I created a list which is
list = ['.com','.co','.de','.co.jp','.co.uk','.lk','.it','.es','.ua','.bg','.at','.kr']
and replace them with empty string
for l in list:
df['source'] = df['source'].str.replace(l,'')
In the output, I am getting 'silinvalley' which means it has also replaced 'co' instead of '.co'
I want the code to replace the data which is exactly matching the pattern. Please help!

This would be one way. Would have to be careful with the order of replacement. If '.co' comes before '.co.uk' you don't get the desired result.
df["source"].replace('|'.join([re.escape(i) for i in list_]), '', regex=True)
Minimal example:
import pandas as pd
import re
list_ = ['.com','.co.uk','.co','.de','.co.jp','.lk','.it','.es','.ua','.bg','.at','.kr']
df = pd.DataFrame({
'source': ['google.com', 'google.no', 'google.co.uk']
})
pattern = '|'.join([re.escape(i) for i in list_])
df["new_source"] = df["source"].replace(pattern, '', regex=True)
print(df)
# source new_source
#0 google.com google
#1 google.no google.no
#2 google.co.uk google

Related

Python dataframe : strip part of string, on each column row, if it is in specific format [duplicate]

I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks

You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.

You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000

You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo

You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)

In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers

regexp match in pandas

In want to execute a regexp match on a dataframe column in order to modify the content of the column.
For example, given this dataframe:
import pandas as pd
df = pd.DataFrame([['abra'], ['charmender'], ['goku']],
columns=['Name'])
print(df.head())
I want to execute the following regex match:
CASE
WHEN REGEXP_MATCH(Landing Page,'abra') THEN "kadabra"
WHEN REGEXP_MATCH(Landing Page,'charmender') THEN "charmaleon"
ELSE "Unknown" END
My solution is the following:
df.loc[df['Name'].str.contains("abra", na=False), 'Name'] = "kadabra"
df.loc[df['Name'].str.contains("charmender", na=False), 'Name'] = "charmeleon"
df.head()
It works but I do not know if there is a better way of doing it.
Moreover, I have to rewrite all the regex cases line by line in Python. Is there a way to execute the regex directly in Pandas?

Are you looking for map:
df['Name'] = df['Name'].map({'abra':'kadabra','charmender':'charmeleon'})
Output:
Name
0 kadabra
1 charmeleon
2 NaN
Update: For partial matches:
df = pd.DataFrame([['this abra'], ['charmender'], ['goku']],
columns=['Name'])
replaces = {'abra':'kadabra','charmender':'charmeleon'}
df['Name'] = df['Name'].str.extract(fr"\b({'|'.join(replaces.keys())})\b")[0].map(replaces)
And you get the same output (with different dataframe)

Pandas DataFrame - Extract string between two strings and include the first delimiter

I've the following strings in column on a dataframe:
"LOCATION: FILE-ABC.txt"
"DRAFT-1-FILENAME-ADBCD.txt"
And I want to extract everything that is between the word FILE and the ".". But I want to include the first delimiter. Basically I am trying to return the following result:
"FILE-ABC"
"FILENAME-ABCD"
For that I am using the script below:
df['field'] = df.string_value.str.extract('FILE/(.w+)')
But I am not able to return the desired information (always getting NA).
How can I do this?

you can accomplish this all within the regex without having to use string slicing.
df['field'] = df.string_value.str.extract('(FILE.*(?=.txt))')
FILE is the what we begin the match on
.* grabs any number of characters
(?=) is a lookahead assertion that matches without
consuming.
Handy regex tool https://pythex.org/

If the strings will always end in .txt then you can try with the following:
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Example:
import pandas as pd
text = ["LOCATION: FILE-ABC.txt","DRAFT-1-FILENAME-ADBCD.txt"]
data = {'index':[0,1],'string_value':text}
df = pd.DataFrame(data)
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Output:
index string_value field
0 0 LOCATION: FILE-ABC.txt FILE-ABC
1 1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD

You can make a capturing group that captures from (including) 'FILE' greedily to the last period. Or you can make it not greedy so it stops at the first . after FILE.
import pandas as pd
df = pd.DataFrame({'string_value': ["LOCATION: FILE-ABC.txt", "DRAFT-1-FILENAME-ADBCD.txt",
"BADFILENAME.foo.txt"]})
df['field_greedy'] = df['string_value'].str.extract('(FILE.*)\.')
df['field_not_greedy'] = df['string_value'].str.extract('(FILE.*?)\.')
print(df)
string_value field_greedy field_not_greedy
0 LOCATION: FILE-ABC.txt FILE-ABC FILE-ABC
1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD FILENAME-ADBCD
2 BADFILENAME.foo.txt FILENAME.foo FILENAME

Removing URL from a column in Pandas Dataframe

I have a small dataframe and am trying to remove the url from the
end of the string in the Links column. I have tried the following code and it works on columns where the url is on its own. The problem is that as soon as there are sentences before the url the code won't remove those urls
Here is the data: https://docs.google.com/spreadsheets/d/10LV8BHgofXKTwG-MqRraj0YWez-1vcwzzTJpRhdWgew/edit?usp=sharing (link to spreadsheet)
import pandas as pd
df = pd.read_csv('TestData.csv')
df['Links'] = df['Links'].replace(to_replace=r'^https?:\/\/.*[\r\n]*',value='',regex=True)
df.head()
Thanks!

Try this:
import re
df['cleanLinks'] = df['Links'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])
Output:
df['cleanLinks']
cleanLinks
0 random words to see if it works now
1 more stuff that doesn't mean anything
2 one last try please work

Try a cleaner regex:
df['example'] = df['example'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)
Before implementing regex in pandas .replace() or anywhere else for that matter you should test the pattern using re.sub() on a single basic string example. When faced with a big problem, break it down into a smaller one.
Additionally we could go with the str.replace method:
df['status_message'] = df['status_message'].str.replace('http\S+|www.\S+', '', case=False)

For Dataframe df, URLs can be removed by using cleaner regex as follows:
df = pd.read_csv('./data-set.csv')
print(df['text'])
def clean_data(dataframe):
#replace URL of a text
dataframe['text'] = dataframe['text'].str.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ')
clean_data(df)
print(df['text']);

pandas applying regex to replace values

I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks

You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.

You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000

You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo

You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)

In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I optimally replace the dataframe - python

Related

Python dataframe : strip part of string, on each column row, if it is in specific format [duplicate]

regexp match in pandas

Pandas DataFrame - Extract string between two strings and include the first delimiter

Removing URL from a column in Pandas Dataframe

pandas applying regex to replace values

Categories

Resources