Create new column of cleaned string data with Python/pandas - python

I have a DataFrame with some user input (it's supposed to just be a plain email address), along with some other values, like this:
import pandas as pd
from pandas import Series, DataFrame
df = pd.DataFrame({'input': ['Captain Jean-Luc Picard <picard#starfleet.com>','deanna.troi#starfleet.com','data#starfleet.com','William Riker <riker#starfleet.com>'],'val_1':[1.5,3.6,2.4,2.9],'val_2':[7.3,-2.5,3.4,1.5]})
Due to a bug, the input sometimes has the user's name as well as brackets around the email address; this needs to be fixed before continuing with the analysis.
To move forward, I want to create a new column that has cleaned versions of the emails: if the email contains names/brackets then remove those, else just give the already correct email.
There are numerous examples of cleaning string data with Python/pandas, but I've yet to find successfully implement any of these suggestions. Here are a few examples of what I've tried:
# as noted in pandas docs, turns all non-matching strings into NaN
df['cleaned'] = df['input'].str.extract('<(.*)>')
# AttributeError: type object 'str' has no attribute 'contains'
df['cleaned'] = df['input'].apply(lambda x: str.extract('<(.*)>') if str.contains('<(.*)>') else x)
# AttributeError: 'DataFrame' object has no attribute 'str'
df['cleaned'] = df[df['input'].str.contains('<(.*)>')].str.extract('<(.*)>')
Thanks!

Use np.where to use the str.extract for those rows that contain the embedded email, for the else condition just return the 'input' value:
In [63]:
df['cleaned'] = np.where(df['input'].str.contains('<'), df['input'].str.extract('<(.*)>'), df['input'])
df
Out[63]:
input val_1 val_2 \
0 Captain Jean-Luc Picard <picard#starfleet.com> 1.5 7.3
1 deanna.troi#starfleet.com 3.6 -2.5
2 data#starfleet.com 2.4 3.4
3 William Riker <riker#starfleet.com> 2.9 1.5
cleaned
0 picard#starfleet.com
1 deanna.troi#starfleet.com
2 data#starfleet.com
3 riker#starfleet.com

If you want to use regular expressions:
import re
rex = re.compile(r'<(.*)>')
def fix(s):
m = rex.search(s)
if m is None:
return s
else:
return m.groups()[0]
fixed = df['input'].apply(fix)

Related

pandas split string with $ special text style

I have an excel, the data has two $ , when I read it using pandas, it will convert them to a very special text style.
import pandas as pd
df = pd.DataFrame({ 'Bid-Ask':['$185.25 - $186.10','$10.85 - $11.10','$14.70 - $15.10']})
after pd.read_excel
df['Bid'] = df['Bid-Ask'].str.split('−').str[0]
above code doesn't work due to $ make my string into a special style text.the Split function doesn't work.
my expected result
Do not split. Using str.extract is likely the most robust:
df[['Bid', 'Ask']] = df['Bid-Ask'].str.extract(r'(\d+(?:\.\d+)?)\D*(\d+(?:\.\d+)?)')
Output:
Bid-Ask Bid Ask
0 $185.25 - $186.10 185.25 186.10
1 $10.85 - $11.10 10.85 11.10
2 $14.70 - $15.10 14.70 15.10
There is a non-breaking space (\xa0) in your string. That's why the split doesn't work.
I copied the strings (of your df) one by one into an Excel file and then imported it with pd.read_excel.
The column looks like this then:
repr(df['Bid-Ask'])
'0 $185.25\xa0- $186.10\n1 $10.85\xa0- $11.10\n2 $14.70\xa0- $15.10\nName: Bid-Ask, dtype: object'
Before splitting you can replace that and it'll work.
df['Bid-Ask'] = df['Bid-Ask'].astype('str').str.replace('\xa0',' ', regex=False)
df[['Bid', 'Ask']] = df['Bid-Ask'].str.replace('$','', regex=False).str.split('-',expand = True)
print(df)
Bid-Ask Bid Ask
0 $185.25 - $186.10 185.25 186.10
1 $10.85 - $11.10 10.85 11.10
2 $14.70 - $15.10 14.70 15.10
You have to use the lambda function and apply the method together to split the column values into two and slice the value
df['Bid'] = df['Bid-Ask'].apply(lambda x: x.split('-')[0].strip()[1:])
df['Ask'] = df['Bid-Ask'].apply(lambda x: x.split('-')[1].strip()[1:])
output:
Bid-Ask Bid Ask
0 185.25− 186.10 185.25 186.1
1 10.85− 11.10 10.85 11.1
2 14.70− 15.10 14.70 15.1

How do I create a new column in pandas from the difference of two string columns?

How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?
I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".
I've tried doing:
import pandas as pd
data = pd.read_csv("AddressFile.csv")
data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
data['Address Difference']
but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).
I've also tried:
data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')
but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.
Any help would be appreciated.
Thanks
Using replace with regex
data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")
I'd use a function that we can map across inputs. This should be fast.
The function will use str.find to see if the other string is a subset. If the result of str.find is -1 then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.
def rm(x, y):
i = x.find(y)
if i > -1:
j = len(y)
return x[:i] + x[i+j:]
else:
return x
df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]
df
BAD_ADR1 GOOD_ADR1 Address Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
You can replace the bad address part from good address
df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()
Bad_Address Good_Address Address_Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101

How can i create a ruleset to assign values to specific columns, based on searching substrings, in Pandas?

I am a complete newbie to Python (and Pandas library) and need to recreate some SQL code in it.
My task is quite simple of the face of it, I have a few columns, and i need to search them for specific strings, and if they exist then a value is placed in category columns.
e.g.
import pandas as pd
phone_ds= [('IPHONE_3UK_CONTRACT', 968), ('IPHONE_O2_SIMONLY', 155), ('ANDROID_3UK_PAYG', 77), , ('ANDROID_VODAF_CONTRACT', 973)]
a = pd.DataFrame(data=phone_ds, columns=['Names', 'qty'])
def f(a):
if a['Names'].str.contains('3UK'):
company = 'Three'
if a['Names'].str.contains('iPhone'):
OS = 'iOS'
.
.
.
etc
Is there a better (more efficient) way than listing if statements?
How would i go about adding the results into new columns?
Thanks
I'd do it this way:
In [32]: d = {'3UK':'Three', '(?:IPHONE|IPAD).*':'iOS',
'VODAF.*':'Vodafone', 'PAY.*':'PayG'}
In [33]: a[['OS','Company','Payment']] = \
a.Names.str.upper().str.split('_', expand=True).replace(d, regex=True)
In [34]: a
Out[34]:
Names qty OS Company Payment
0 IPHONE_3UK_CONTRACT 968 iOS Three CONTRACT
1 IPHONE_O2_SIMONLY 155 iOS O2 SIMONLY
2 ANDROID_3UK_PAYG 77 ANDROID Three PayG
3 ANDROID_VODAF_CONTRACT 973 ANDROID Vodafone CONTRACT
Found a way to do this, but not sure if it is most efficient. If would follow the same logic as i posted above, if that it would create a function with rules. The rules would look in a list of pre-defined search words, then create a new column for the rules.
Each column would require its own function, so to add 3 columns for Phone, Carrier, Contract Type, i created 3 functions.
shown below:
android_phones = ['samsung','xperia','google']
iphone= ['iphone','apple']
def OS_rules(raw_Df):
val=''
if any(word in raw_Df['Names'].lower() for word in android_phones):
val='android'
elif any(word in raw_Df['Names'].lower() for word in iphone):
val='iPhone'
else: val = 'Handset'
return val
df.loc[:,'OS_Type']=df.apply(OS_rules,axis=1)

Pandas - Retrieve Value from df.loc

Using pandas I have a result (here aresult) from a df.loc lookup that python is telling me is a 'Timeseries'.
sample of predictions.csv:
prediction id
1 593960337793155072
0 991960332793155071
....
code to retrieve one prediction
predictionsfile = pandas.read_csv('predictions.csv')
idtest = 593960337793155072
result = (predictionsfile.loc[predictionsfile['id'] == idtest])
aresult = result['prediction']
A result retreives a data format that cannot be keyed:
In: print aresult
11 1
Name: prediction, dtype: int64
I just need the prediction, which in this case is 1. I've tried aresult['result'], aresult[0] and aresult[1] all to no avail. Before I do something awful like converting it to a string and strip it out, I thought I'd ask here.
A series requires .item() to retrieve its value.
print aresult.item()
1

Replace WhiteSpace with a 0 in Pandas (Python 3)

simple question here -- how do I replace all of the whitespaces in a column with a zero?
For example:
Name Age
John 12
Mary
Tim 15
into
Name Age
John 12
Mary 0
Tim 15
I've been trying using something like this but I am unsure how Pandas actually reads whitespace:
merged['Age'].replace(" ", 0).bfill()
Any ideas?
merged['Age'] = merged['Age'].apply(lambda x: 0 if x == ' ' else x)
Use the built in method convert_objects and set param convert_numeric=True:
In [12]:
# convert objects will handle multiple whitespace, this will convert them to NaN
# we then call fillna to convert those to 0
df.Age = df[['Age']].convert_objects(convert_numeric=True).fillna(0)
df
Out[12]:
Name Age
0 John 12
1 Mary 0
2 Tim 15
Here's an answer modified from this, more thorough question. I'll make it a little bit more Pythonic and resolve your basestring issue.
def ws_to_zero(maybe_ws):
try:
if maybe_ws.isspace():
return 0
else:
return maybe_ws
except AttributeError:
return maybe_ws
d.applymap(ws_to_zero)
where d is your dataframe.
if you want to use NumPy, then you can use the below snippet:
import numpy as np
df['column_of_interest'] = np.where(df['column_of_interest']==' ',0,df['column_of_interest']).astype(float)
While Paulo's response is excellent, my snippet above may be useful when multiple criteria are required during advanced data manipulation.

Categories