Hello I have this file :
date;category_name;item_number;item_description;bottlevolume_ml;state_bottle_retail;bottles_sold;volume_sold_gallons
11/04/2015;APRICOT$ BRANDIES;54436;$Mr. Boston Apricot Brandy;750;6.75;12;2.38
03/02/2016;BLENDED WHISKIES;27605;Tin Cup;750;$20.63;2;0.40
02/11/2016;STRAIGHT BOURBON WHISKIES;19067;Jim Beam;1000;$18.89;24;6.34
02/03/2016;AMERICAN COCKTAILS;59154;1800 Ultimate Margarita;1750;$14.25;6;2.77
08/18/2015;VODKA 80 PROOF;35918;Five O'clock Vodka;1750;$10.80;12;5.55
I would like to remove the $ using panda.
I tried this :
import pandas as pd
import numpy as np
df = pd.read_csv('data2.csv', delimiter=';')
df.date = [x.strip('$') for x in df.date]
df.category_name = [x.strip('$') for x in df.category_name]
df.item_number = [x.strip('$') for x in df.ite_number]
But I would like using pandas to remove from all my columns the $
Any ideas ?
Thank you !
for c in df.select_dtypes('object').columns:
df[c] = df[c].str.replace('$', '')
Explanation:
If a column has a '$', it will be a object-type column. It's useful to select only these, because then you can use .str.replace (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html) to find all '$"-signs in that column and replace it with an empty string.
Nothe that this solution also removes'$' in the middle of the string (in contrast to the .strip method you've used so far).
This should work.
df = df.apply(lambda x: x.str.strip('$') if x.dtype == "object" else x)
Related
for example, how to replace <Isis/> with twins in the first row in the whole table?
I try to use the following codes, but Python indicates:"TypeError: replace() argument 1 must be str, not None"
import pandas as pd
import re
df = pd.read_csv('train.csv')
p = re.compile('<\w+/>')
df['original'] = df.apply(lambda x: x['original'].replace(
p.match(x['original']), str(x['edit'])), axis = 1)
print(df.head())
I hope powerful friends help me, very anxious, thank you!
I expect the code can return the DataFrame format, and "France is ‘ hunting down its citizens who joined ’ without trial in Iraq" can be changed to "France is ‘ hunting down its citizens who joined twins ’ without trial in Iraq".
can you try:
import re
df['original'] = df.apply(lambda x: re.sub("<.*?>", x['edit'], x['original']),axis=1)
I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers
I have a column "Country" in a data frame, I would like to group the "Country" column with only two options: "Mainland China" and " Others". I have tried different options e.g. filter, etc. No one works. How should I do it?
Here is the dataset https://drive.google.com/file/d/17DY8f-Jxba0Ky5iOUQqEZehhoWNO3vzR/view?usp=sharing
FYI, I have already grouped different provinces in China as one country "Mainland China"
Thanks for your help!
I think the quickest way to change the value would be using .loc instead of apply since .loc is optimized for pandas.
df.loc[df.Country != 'Mainland China', 'Country'] = 'Others'
Try (and group by Country):
import numpy as np
df["Country"]=np.where(df["Country"].eq("Mainland China"), "Mainland China", "Other")
Edit
timeit (please note I didn't do .loc[] as lambda doesn't support assignment - feel free to suggest a way of adding it):
import pandas as pd
import numpy as np
import timeit
from timeit import Timer
#proportion-wise that's the dataframe, as per OP's question
df=pd.DataFrame({"Country": ["Mainland China"]*398+["a", "b","c"]*124})
df["otherCol"]=2
df["otherCol2"]=3
#shuffle
df2=df.copy().sample(frac=1)
df3=df2.copy()
df4=df3.copy()
op2=Timer(lambda: np.where(df2["Country"].eq("Mainland China"), "Mainland China", "Other"))
op3=Timer(lambda: df3.Country.map(lambda x: x if x == 'Mainland China' else 'Others'))
op4=Timer(lambda: df4["Country"].apply(lambda x: x if x == "Mainland China" else "Others"))
print(op2.timeit(number=1000))
print(op3.timeit(number=1000))
print(op4.timeit(number=1000))
Returns:
2.1856687490362674 #numpy
2.2388894270407036 #map
2.4437739049317315 #apply
Try using apply:
dataframe["Country"] = dataframe["Country"].apply(lambda x: x if x == "Mainland China" else "Others")
Assuming df is your pandas dataframe.
You could do:
df['Country'] = df.Country.map(lambda x: x if x == 'Mainland China' else 'Others')
I'm reading a CSV file into a DataFrame. I need to strip whitespace from all the stringlike cells, leaving the other cells unchanged in Python 2.7.
Here is what I'm doing:
def remove_whitespace( x ):
if isinstance( x, basestring ):
return x.strip()
else:
return x
my_data = my_data.applymap( remove_whitespace )
Is there a better or more idiomatic to Pandas way to do this?
Is there a more efficient way (perhaps by doing things column wise)?
I've tried searching for a definitive answer, but most questions on this topic seem to be how to strip whitespace from the column names themselves, or presume the cells are all strings.
Stumbled onto this question while looking for a quick and minimalistic snippet I could use. Had to assemble one myself from posts above. Maybe someone will find it useful:
data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
You could use pandas' Series.str.strip() method to do this quickly for each string-like column:
>>> data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
>>> data
values
0 ABC
1 DEF
2 GHI
>>> data['values'].str.strip()
0 ABC
1 DEF
2 GHI
Name: values, dtype: object
We want to:
Apply our function to each element in our dataframe - use applymap.
Use type(x)==str (versus x.dtype == 'object') because Pandas will label columns as object for columns of mixed datatypes (an object column may contain int and/or str).
Maintain the datatype of each element (we don't want to convert everything to a str and then strip whitespace).
Therefore, I've found the following to be the easiest:
df.applymap(lambda x: x.strip() if type(x)==str else x)
When you call pandas.read_csv, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.
For example, here's "data.csv":
In [19]: !cat data.csv
1.5, aaa, bbb , ddd , 10 , XXX
2.5, eee, fff , ggg, 20 , YYY
(The first line ends with three spaces after XXX, while the second line ends at the last Y.)
The following uses pandas.read_csv() to read the files, with the regular expression ' *, *' as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv().)
In [20]: import pandas as pd
In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')
In [22]: df
Out[22]:
0 1 2 3 4 5
0 1.5 aaa bbb ddd 10 XXX
1 2.5 eee fff ggg 20 YYY
The "data['values'].str.strip()" answer above did not work for me, but I found a simple work around. I am sure there is a better way to do this. The str.strip() function works on Series. Thus, I converted the dataframe column into a Series, stripped the whitespace, replaced the converted column back into the dataframe. Below is the example code.
import pandas as pd
data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
print ('-----')
print (data)
data['values'].str.strip()
print ('-----')
print (data)
new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new
print ('-----')
print (new)
Here is a column-wise solution with pandas apply:
import numpy as np
def strip_obj(col):
if col.dtypes == object:
return (col.astype(str)
.str.strip()
.replace({'nan': np.nan}))
return col
df = df.apply(strip_obj, axis=0)
This will convert values in object type columns to string. Should take caution with mixed-type columns. For example if your column is zip codes with 20001 and ' 21110 ' you will end up with '20001' and '21110'.
This worked for me - applies it to the whole dataframe:
def panda_strip(x):
r =[]
for y in x:
if isinstance(y, str):
y = y.strip()
r.append(y)
return pd.Series(r)
df = df.apply(lambda x: panda_strip(x))
I found the following code useful and something that would likely help others. This snippet will allow you to delete spaces in a column as well as in the entire DataFrame, depending on your use case.
import pandas as pd
def remove_whitespace(x):
try:
# remove spaces inside and outside of string
x = "".join(x.split())
except:
pass
return x
# Apply remove_whitespace to column only
df.orderId = df.orderId.apply(remove_whitespace)
print(df)
# Apply to remove_whitespace to entire Dataframe
df = df.applymap(remove_whitespace)
print(df)
I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers