Convert DataFrame with 'N/As' to float to compute percent change - python

I am trying convert the following DataFrame (contains several 'N/As') to float so that I can perform a percent change operation:
d = pd.DataFrame({"A":['N/A','$10.00', '$5.00'],
"B":['N/A', '$10.00', '-$5.00']})
Ultimately, I would like the result to be:
(UPDATE: I do not want to remove the original N/A values. I'd like to keep them there as placeholders.)
Because there aren't any flags for dealing with negative numbers, I cannot use:
pct_change(-1)
So, I need to use:
d['A'].diff(-1)/d['A'].shift(-1).abs()
But, I get the error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
For a first step, I am trying to convert the data from object/string to float, but the output is unexpected (to me). I am getting float 'NaNs' instead of the actual number.
>d['A_float'] = pd.to_numeric(d['A'], errors='coerce')
>d
A B A_float
0 N/A N/A NaN
1 $10.00 -$100.00 NaN
2 $5.00 -$5.00 NaN
>d.dtypes
A object
B object
A_float float64
dtype: object
As a simple test, I tried subtracting '1' from the value, but still got float 'NaN'.
>d['A_float_minus1_test'] = pd.to_numeric(d['A'], errors='coerce')-1
>d
A B A_float A_float_minus1_test
0 N/A N/A NaN NaN
1 $10.00 -$100.00 NaN NaN
2 $5.00 -$5.00 NaN NaN
>d.dtypes
A object
B object
A_float float64
A_float_minus1_test float64
dtype: object
Is there a simple way to get the following result? The way I am thinking is to individually change each DataFrame column to float, then perform the operation. There must be an easier way.
Desired output:
(UPDATE: I do not want to remove the original N/A values. I'd like to keep them there as placeholders.)
Thanks!

To convert your columns from string to float, you can use apply, like such:
d['A_float'] = d['A'].apply(lambda x: float(x.split('$')[1]) if x != '' else 0.0)
The x.split('$')[1] is used to remove the $ character (and eventually the minus before).
Then I am not sure of what your are trying to do, but if you are trying to compute the percentage of A from B, you can use np.vectorize like this:
d['Percent'] = np.vectorize(percent)(d['A'],d['B'])
def percent(p1, p2):
return (100 * p2) / p1

import pandas as pd
d = pd.DataFrame({"A":['N/A','$10.00', '$5.00'],
"B":['N/A', '$10.00', '-$5.00']})
# Covert to number, remove '$', assign to new columns
d[['dA','dB']] = d[['A','B']].apply(lambda s: s.str.replace('$','')).apply(pd.to_numeric, errors='coerce')
# Perform calculations across desired column
d[['dA','dB']] = d[['dA','dB']].diff(-1)/d[['dA','dB']].shift(-1).abs()
print(d)
A B dA dB
0 N/A N/A NaN NaN
1 $10.00 $10.00 1.0 3.0
2 $5.00 -$5.00 NaN NaN

Related

Pandas groupby - dataframe's column disappearing

I have the following data frame called "new_df":
dato uttak annlegg Merd ID Leng BW CF F B H K
0 2020-12-15 12_20 LL 3 1 48.0 1200 1.085069 0.0 2.0 0.0 NaN
1 2020-12-15 12_20 LL 3 2 43.0 830 1.043933 0.0 1.0 0.0 NaN
columns are:
'dato', 'uttak', 'annlegg', 'Merd', 'ID', 'Leng', 'BW', 'CF', 'F', 'B', 'H', 'K'
when I do:
new_df.groupby(['annlegg','Merd'],as_index=False).mean()
I got all means except the column "BW" like this:
annlegg Merd ID Leng CF F B H K
0 KH 1 42.557143 56.398649 1.265812 0.071770 1.010638 0.600000 0.127907
1 KH 2 42.683794 56.492228 1.270522 0.021978 0.739130 0.230769 0.075862
2 KH 3 42.177866 35.490119 1.125416 0.000000 0.384146 0.333333 0.034483
Column "BW" just disappeared when I groupby, no matter "as_index" True or False, why is that?
It appears the content as the BW column does not have a numerical type but an object type instead, which is used for storing strings for instance. Thus when applying groupby and meanaggregation function, tour column disappears has computing the mean value of an object (think of a string does not make sense in general).
You should start by converting your BW column :
First method : pd.to_numeric
This first method will safely convert all your column to float objects.
new_df['BW'] = pd.to_numeric(new_df['BW'])
Second method : df.astype
If you do not want to convert your data to float (for instance, you know that this column only contains int, or if floating point precision does not interest you), you can use the astype method which allows you to convert to almost any type you want :
new_df['BW'] = new_df['BW'].astype(float) # Converts to float
new_df['BW'] = new_df['BW'].astype(int) # Converts to integer
You can eventually apply your groupby and aggregation as you did !
That's probably due to the wrong data type. You can try this.
new_df = new_df.convert_dtypes()
new_df.groupby(['annlegg','Merd'],as_index=False).mean()
You can check dtype via:
new_df.dtype
You can try .agg() function to target specific columns.
new_df.groupby(['annlegg','Merd']).agg({'BW':'mean'})

Pandas evaluate a string ratio into a float

I have the following dataframe:
Date Ratios
2009-08-23 2:1
2018-08-22 2:1
2019-10-24 2:1
2020-10-28 3:2
I want to convert the ratios into floats, so 2:1 becomes 2/1 becomes 0.5, 3:2 becomes 0.66667.
I used the following formula
df['Ratios'] = 1/pd.eval(df['Ratios'].str.replace(':','/'))
But I keep getting this error TypeError: unsupported operand type(s) for /: 'int' and 'list'
What's wrong with my code and how do I fix it?
Dont use pd.eval for Series, because if more like 100 rows it return ugly error, so need convert each value separately:
df['Ratios'] = 1/df['Ratios'].str.replace(':','/').apply(pd.eval)
But also your error seems some non numeric values together with :.
Error for 100+ rows:
AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'
If not working and still error you can try test if data are correct in custom function:
print (df)
Date Ratios
0 2009-08-23 2:1r
1 2018-08-22 2:1
2 2019-10-24 2:1
3 2020-10-28 3:2
def f(x):
try:
pd.eval(x)
return False
except:
return True
df = df[df['Ratios'].str.replace(':','/').apply(f)]
print (df)
Date Ratios
0 2009-08-23 2:1r
An alternate solution using Series.str.split if your data is in correct format,
s = df['Ratios'].str.split(':')
df['Ratios'] = s.str[1].astype(float) / s.str[0].astype(float)
# print(df)
Date Ratios
0 2009-08-23 0.500000
1 2018-08-22 0.500000
2 2019-10-24 0.500000
3 2020-10-28 0.666667
I have written a function which will help you in generating the decimal fractions from ratios.
def convertFraction(x):
arr = x.split(":")
fraction = float(arr[0])/float(arr[1])
return fraction
df["Fractions"] = df["Ratios"].apply(lambda x:convertFraction(x))
print(df)

Pandas join.fillna of two data frames replaces all all values with anf not only nan

The following code will update the number of items in stock based on the index. The table dr with the old stock holds >1000 values. The updated data frame grp1 contains the number of sold items. I would like to subtract data frame grp1 from data frame dr and update dr. Everything is fine until I want to join grp1 to dr with Panda's join and fillna. First of all datatypes are changed from int to float and not only the NaN but also the notnull values are replaced by 0. Is this a problem with not matching indices?
I tried to make the dtypes uniform but this has not changed anything. Removing fillna while joining the two dataframes returns NaN for all columns.
dr has the following format (example):
druck_pseudonym lager_nr menge_im_lager
80009359 62808 1
80009360 62809 10
80009095 62810 0
80009364 62811 11
80009365 62812 10
80008572 62814 10
80009072 62816 18
80009064 62817 13
80009061 62818 2
80008725 62819 3
80008940 62820 12
dr.dtypes
lager_nr int64
menge_im_lager int64
dtype: object
and grp1 (example):
LagerArtikelNummer1 ArtMengen1
880211066 1
80211070 1
80211072 2
80211073 2
80211082 2
80211087 4
80211091 1
80211107 2
88889272 1
88889396 1
ArtMengen1 int64
dtype: object
#update list with "nicht_erledigt"
dr_update = dr.join(grp1).fillna(0)
dr_update["menge_im_lager"] = dr_update["menge_im_lager"] - dr_update["ArtMengen1"]
This returns:
lager_nr menge_im_lager ArtMengen1
druck_pseudonym
80009185 44402 26.0 0.0
80009184 44403 2.0 0.0
80009182 44405 16.0 0.0
80008894 44406 32.0 0.0
80008115 44407 3.0 0.0
80008974 44409 16.0 0.0
80008380 44411 4.0 0.0
dr_update.dtypes
lager_nr int64
menge_im_lager float64
ArtMengen1 float64
dtype: object
Editing after comment, indices are object.
Your indices are string objects. You need to convert these to numeric. Use
dr.index = pd.to_numeric(dr.index)
grp1.index = pd.to_numeric(grp1.index)
dr.sort_index()
grp1.sort_index()
Then try the rest...
You can filter the old stock 'dr' dataframe to match the sold stock, then substract, and assing back to the original filtered dataframe.
# Filter the old stock dataframe so that you have matching index to the sold dataframe.
# Restrict just for menge_im_lager. Then subtract the sold stock
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] = (
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] - grp1["ArtMengen1"]
)
If I understand correctly, firstly you want the non-matching indices to be in your final dataset and you want your final dataset to be integers. You can use 'outer' join and astype int for your dataset.
So, at the join you can do it this way:
dr.join(grp1,how='outer').fillna(0).astype(int)

Pandas convert data type from object to float

I read some weather data from a .csv file as a dataframe named "weather". The problem is that the data type of one of the columns is object. This is weird, as it indicates temperature. How do I change it to having a float data type? I tried to_numeric, but it can't parse it.
weather.info()
weather.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 304 entries, 2017-01-01 to 2017-10-31
Data columns (total 2 columns):
Temp 304 non-null object
Rain 304 non-null float64
dtypes: float64(1), object(1)
memory usage: 17.1+ KB
Temp Rain
Date
2017-01-01 12.4 0.0
2017-02-01 11 0.6
2017-03-01 10.4 0.6
2017-04-01 10.9 0.2
2017-05-01 13.2 0.0
You can use pandas.Series.astype
You can do something like this :
weather["Temp"] = weather.Temp.astype(float)
You can also use pd.to_numeric that will convert the column from object to float
For details on how to use it checkout this link :http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.to_numeric.html
Example :
s = pd.Series(['apple', '1.0', '2', -3])
print(pd.to_numeric(s, errors='ignore'))
print("=========================")
print(pd.to_numeric(s, errors='coerce'))
Output:
0 apple
1 1.0
2 2
3 -3
=========================
dtype: object
0 NaN
1 1.0
2 2.0
3 -3.0
dtype: float64
In your case you can do something like this:
weather["Temp"] = pd.to_numeric(weather.Temp, errors='coerce')
Other option is to use convert_objects
Example is as follows
>> pd.Series([1,2,3,4,'.']).convert_objects(convert_numeric=True)
0 1
1 2
2 3
3 4
4 NaN
dtype: float64
You can use this as follows:
weather["Temp"] = weather.Temp.convert_objects(convert_numeric=True)
I have showed you examples because if any of your column won't have a number then it will be converted to NaN... so be careful while using it.
I tried all methods suggested here but sadly none worked. Instead, found this to be working:
df['column'] = pd.to_numeric(df['column'],errors = 'coerce')
And then check it using:
print(df.info())
I eventually used:
weather["Temp"] = weather["Temp"].convert_objects(convert_numeric=True)
It worked just fine, except that I got the following message.
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: FutureWarning:
convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
You can try the following:
df['column'] = df['column'].map(lambda x: float(x))
First check your data cuz you may get an error if you have ',' instead of '.'
if so, you need to transform every ',' into '.' with a function :
def replacee(s):
i=str(s).find(',')
if(i>0):
return s[:i] + '.' + s[i+1:]
else :
return s
then you need to apply this function on every row in your column :
dfOPA['Montant']=dfOPA['Montant'].apply(replacee)
then the convert function will work fine :
dfOPA['Montant'] = pd.to_numeric(dfOPA['Montant'],errors = 'coerce')
Eg, For Converting $40,000.00 object to 40000 int or float32
Follow this step by step :
$40,000.00 ---(**1**. remove $)---> 40,000.00 ---(**2**. remove , comma)---> 40000.00 ---(**3**. remove . dot)---> 4000000 ---(**4**. remove empty space)---> 4000000 ---(**5**. Remove NA Values)---> 4000000 ---(**6**. now this is object type so, convert to int using .astype(int) )---> 4000000 ---(**7**. divide by 100)---> 40000
Implementing code In Pandas
table1["Price"] = table1["Price"].str.replace('$','')<br>
table1["Price"] = table1["Price"].str.replace(',','')<br>
table1["Price"] = table1["Price"].str.replace('.','')<br>
table1["Price"] = table1["Price"].str.replace(' ','')
table1 = table1.dropna()<br>
table1["Price"] = table1["Price"].astype(int)<br>
table1["Price"] = table1["Price"] / 100<br>
Finally it's done

Pandas converting column of strings and NaN (floats) to integers, keeping the NaN [duplicate]

This question already has answers here:
Convert Pandas column containing NaNs to dtype `int`
(27 answers)
Closed 3 years ago.
I have problems in converting a column which contains both numbers of 2 digits in string format (type: str) and NaN (type: float64). I want to obtain a new column made this way: NaN where there was NaN and integer numbers where there was a number of 2 digits in string format.
As an example: I want to obtain column Yearbirth2 from column YearBirth1 like this:
YearBirth1 #numbers here are formatted as strings: type(YearBirth1[0])=str
34 # and NaN are floats: type(YearBirth1[2])=float64.
76
Nan
09
Nan
91
YearBirth2 #numbers here are formatted as integers: type(YearBirth2[0])=int
34 #NaN can remain floats as they were.
76
Nan
9
Nan
91
I have tried this:
csv['YearBirth2'] = (csv['YearBirth1']).astype(int)
And as I expected i got this error:
ValueError: cannot convert float NaN to integer
So I tried this:
csv['YearBirth2'] = (csv['YearBirth1']!=NaN).astype(int)
And got this error:
NameError: name 'NaN' is not defined
Finally I have tried this:
csv['YearBirth2'] = (csv['YearBirth1']!='NaN').astype(int)
NO error, but when I checked the column YearBirth2, this was the result:
YearBirth2:
1
1
1
1
1
1
Very bad.. I think the idea is right but there is a problem to make Python able to understand what I mean for NaN.. Or maybe the method I tried is wrong..
I also used pd.to_numeric() method, but this way i obtain floats, not integers..
Any help?!
Thanks to everyone!
P.S: csv is the name of my DataFrame;
Sorry if I am not so clear, I am on improving with English language!
You can use to_numeric, but is impossible get int with NaN values - they are always converted to float: see na type promotions.
df['YearBirth2'] = pd.to_numeric(df.YearBirth1, errors='coerce')
print (df)
YearBirth1 YearBirth2
0 34 34.0
1 76 76.0
2 Nan NaN
3 09 9.0
4 Nan NaN
5 91 91.0

Categories