Replace part of string in column DataFrame - python

Problem: Currently, I have a column of a dataframe with results like 1.Goalkeeper, 4.Midfield...and I can't change partially replace the string.
Objective: My goal is to replace it with 1.GK, 4.MD...but it doesn't make the replacement. It seems as if these lines are not written. Any ideas?
The code works if the input is the same as the replacement. For example, Goalkeeper, Midfield... but it doesn't work when I prefix it with ( number + dot).
CODE
df2['Posicion'].replace({'Goalkeeper':'GK','Left-Back':'LI','Defensive Midfield':'MCD'
,'Right Midfield':'MD','Attacking Midfield':'MP','Right Winger':'ED','Centre-Forward':'DC',
'Centre-Back':'DFC','Right-Back':'LD','Central Midfield':'MC','Second Striker':'SD',
'Left Midfield':'MI','Left Winger':'EI','N':'','None':'','Sweeper':'DFC'}, inplace=True)

regex=True will do the trick here.
df2 = pd.DataFrame({
'Posicion' : ['1.Goalkeeper', '2.Midfield', '3.Left Winger']
})
df2['Posicion'].replace({'Goalkeeper':'GK',
'Left Winger':'EI',
'N':'',
'None':'',
'Sweeper':'DFC'},
regex=True,
inplace=True)
Output:
Posicion
0 1.GK
1 2.Midfield
2 3.EI

Related

Pandas: replace string with special characters

I have a dataframe (see after) in which I have two columns containing either a list of patients or an empty list (like that [''] ). I want to remove the empty list
What i have:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
['']
['']
[Patient1]
What i want:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
[Patient1]
I try several thing like :
variants["Homozygous_list"].replace("['']","", regex=True, inplace=True)
or
variants["Homozygous_list"].replace("\[''\]","", regex=True, inplace=True)
or
variants["Homozygous_list"] = variants["Homozygous_list"].replace("['']","", regex=True)
etc but nothing seems to work.
If you really have lists of strings, use applymap:
df = df.applymap(lambda x: '' if x==[''] else x) # or pd.NA in place of ''
output:
Homozygous_list heterozygous_list
0 [Patient1, Patient2]
1 [Patient1]
used input:
df = pd.DataFrame({'Homozygous_list ': [['Patient1','Patient2'], ['']],
'heterozygous_list': [[''], ['Patient1']]})

How can I optimally replace the dataframe

I have a list of words in dataframe which I would like to replace with empty string.
I have a column named source which I have to clean properly.
e.g replace 'siliconvalley.co' to 'siliconvalley'
I created a list which is
list = ['.com','.co','.de','.co.jp','.co.uk','.lk','.it','.es','.ua','.bg','.at','.kr']
and replace them with empty string
for l in list:
df['source'] = df['source'].str.replace(l,'')
In the output, I am getting 'silinvalley' which means it has also replaced 'co' instead of '.co'
I want the code to replace the data which is exactly matching the pattern. Please help!
This would be one way. Would have to be careful with the order of replacement. If '.co' comes before '.co.uk' you don't get the desired result.
df["source"].replace('|'.join([re.escape(i) for i in list_]), '', regex=True)
Minimal example:
import pandas as pd
import re
list_ = ['.com','.co.uk','.co','.de','.co.jp','.lk','.it','.es','.ua','.bg','.at','.kr']
df = pd.DataFrame({
'source': ['google.com', 'google.no', 'google.co.uk']
})
pattern = '|'.join([re.escape(i) for i in list_])
df["new_source"] = df["source"].replace(pattern, '', regex=True)
print(df)
# source new_source
#0 google.com google
#1 google.no google.no
#2 google.co.uk google

Python dataframe : strip part of string, on each column row, if it is in specific format [duplicate]

I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers

How to replace character in row of dataframe?

I open raw data using pandas
df=pd.read_cvs(file)
Here's part of my dataframe look like:
37280 7092|156|Laboratory Data|A648C751-A4DD-4CZ2-85
47981 7092|156|Laboratory Data|Z22CD01C-8Z4B-4ZCB-8B
57982 7092|156|Laboratory Data|C12CE01C-8F4B-4CZB-8B
I'd like to replace all pipe('|') into tab ('\t')
So I tried :
df.replace('|','\t')
But it never works. How could I do this?
Many thanks!
The replace method on data frame by default is meant to replace values exactly match the string provided; You need to specify regex=True to replace patterns, and since | is a special character in regex, an escape is needed here:
df1 = df.replace("\|", "\t", regex=True)
df1
# 0 1
#0 37280 7092\t156\tLaboratory Data\tA648C751-A4DD-4CZ2-85
#1 47981 7092\t156\tLaboratory Data\tZ22CD01C-8Z4B-4ZCB-8B
#2 57982 7092\t156\tLaboratory Data\tC12CE01C-8F4B-4CZB-8B
If we print the cell, the tab are printed as expected:
print(df1[1].iat[0])
# 7092 156 Laboratory Data A648C751-A4DD-4CZ2-85
Just need to set the variable to itself:
df = df.replace('|', '\t')

pandas applying regex to replace values

I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers

Categories