Pandas evaluate a string ratio into a float - python

I have the following dataframe:
Date Ratios
2009-08-23 2:1
2018-08-22 2:1
2019-10-24 2:1
2020-10-28 3:2
I want to convert the ratios into floats, so 2:1 becomes 2/1 becomes 0.5, 3:2 becomes 0.66667.
I used the following formula
df['Ratios'] = 1/pd.eval(df['Ratios'].str.replace(':','/'))
But I keep getting this error TypeError: unsupported operand type(s) for /: 'int' and 'list'
What's wrong with my code and how do I fix it?

Dont use pd.eval for Series, because if more like 100 rows it return ugly error, so need convert each value separately:
df['Ratios'] = 1/df['Ratios'].str.replace(':','/').apply(pd.eval)
But also your error seems some non numeric values together with :.
Error for 100+ rows:
AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'
If not working and still error you can try test if data are correct in custom function:
print (df)
Date Ratios
0 2009-08-23 2:1r
1 2018-08-22 2:1
2 2019-10-24 2:1
3 2020-10-28 3:2
def f(x):
try:
pd.eval(x)
return False
except:
return True
df = df[df['Ratios'].str.replace(':','/').apply(f)]
print (df)
Date Ratios
0 2009-08-23 2:1r

An alternate solution using Series.str.split if your data is in correct format,
s = df['Ratios'].str.split(':')
df['Ratios'] = s.str[1].astype(float) / s.str[0].astype(float)
# print(df)
Date Ratios
0 2009-08-23 0.500000
1 2018-08-22 0.500000
2 2019-10-24 0.500000
3 2020-10-28 0.666667

I have written a function which will help you in generating the decimal fractions from ratios.
def convertFraction(x):
arr = x.split(":")
fraction = float(arr[0])/float(arr[1])
return fraction
df["Fractions"] = df["Ratios"].apply(lambda x:convertFraction(x))
print(df)

Related

Pandas groupby - dataframe's column disappearing

I have the following data frame called "new_df":
dato uttak annlegg Merd ID Leng BW CF F B H K
0 2020-12-15 12_20 LL 3 1 48.0 1200 1.085069 0.0 2.0 0.0 NaN
1 2020-12-15 12_20 LL 3 2 43.0 830 1.043933 0.0 1.0 0.0 NaN
columns are:
'dato', 'uttak', 'annlegg', 'Merd', 'ID', 'Leng', 'BW', 'CF', 'F', 'B', 'H', 'K'
when I do:
new_df.groupby(['annlegg','Merd'],as_index=False).mean()
I got all means except the column "BW" like this:
annlegg Merd ID Leng CF F B H K
0 KH 1 42.557143 56.398649 1.265812 0.071770 1.010638 0.600000 0.127907
1 KH 2 42.683794 56.492228 1.270522 0.021978 0.739130 0.230769 0.075862
2 KH 3 42.177866 35.490119 1.125416 0.000000 0.384146 0.333333 0.034483
Column "BW" just disappeared when I groupby, no matter "as_index" True or False, why is that?
It appears the content as the BW column does not have a numerical type but an object type instead, which is used for storing strings for instance. Thus when applying groupby and meanaggregation function, tour column disappears has computing the mean value of an object (think of a string does not make sense in general).
You should start by converting your BW column :
First method : pd.to_numeric
This first method will safely convert all your column to float objects.
new_df['BW'] = pd.to_numeric(new_df['BW'])
Second method : df.astype
If you do not want to convert your data to float (for instance, you know that this column only contains int, or if floating point precision does not interest you), you can use the astype method which allows you to convert to almost any type you want :
new_df['BW'] = new_df['BW'].astype(float) # Converts to float
new_df['BW'] = new_df['BW'].astype(int) # Converts to integer
You can eventually apply your groupby and aggregation as you did !
That's probably due to the wrong data type. You can try this.
new_df = new_df.convert_dtypes()
new_df.groupby(['annlegg','Merd'],as_index=False).mean()
You can check dtype via:
new_df.dtype
You can try .agg() function to target specific columns.
new_df.groupby(['annlegg','Merd']).agg({'BW':'mean'})

How to use: split 'hh:mm:ss' string to milliseconds float in every row of a dataframe in Python?

I'm fairly new to python and I struggle with ML-Problem where I want to convert a column of running paces 'hh:mm:ss' format to milliseconds. The paces are Type: String and Milliseconds should be in Type: float afterwards.
I have figured out how to convert single values with the following function:
import datetime
def convertMinToMs(s):
hr, min, sec = map(float, s.split(':'))
milliseconds = ((min * 60)*1000) + (sec*1000)
return milliseconds
millisec = convertMinToMs(dataset['Avg Pace'].iloc[0])
I have no idea how to do that for a series of data. I tried to pass in the series with removing the .iloc[0] but this results into following errror:
AttributeError: 'Series' object has no attribute 'split'
Shortest possible answer:
dataset['Avg Pace'].apply(convertMinToMs)
Try to use the function of pandas:
dataset['Avg Pace'] = pd.to_datetime(dataset['Avg Pace'], format="%H:%M:%S")
Then you can get whatever yu want from those datetime objects.
Hope it works
Convert column to DataFrame with 3 columns, cast to floats and then multiple second and third column:
df = dataset['Avg Pace'].str.split(':', expand=True).astype(float)
print (df)
0 1 2
0 0.0 15.0 12.0
1 10.0 1.0 30.0
millisec = ((df[1] * 60)*1000) + (df[2]*1000)
print (millisec)
0 1 2
0 0.0 15.0 12.0
1 10.0 1.0 30.0
But if need also miliseconds with hours convert values to timedeltas by to_timedelta, then to native format in nanoseconds and divide for ms:
millisec = pd.to_timedelta(dataset['Avg Pace']).values.astype(np.int64) / 10**6
print (millisec)
[ 912000. 36090000.]

Convert DataFrame with 'N/As' to float to compute percent change

I am trying convert the following DataFrame (contains several 'N/As') to float so that I can perform a percent change operation:
d = pd.DataFrame({"A":['N/A','$10.00', '$5.00'],
"B":['N/A', '$10.00', '-$5.00']})
Ultimately, I would like the result to be:
(UPDATE: I do not want to remove the original N/A values. I'd like to keep them there as placeholders.)
Because there aren't any flags for dealing with negative numbers, I cannot use:
pct_change(-1)
So, I need to use:
d['A'].diff(-1)/d['A'].shift(-1).abs()
But, I get the error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
For a first step, I am trying to convert the data from object/string to float, but the output is unexpected (to me). I am getting float 'NaNs' instead of the actual number.
>d['A_float'] = pd.to_numeric(d['A'], errors='coerce')
>d
A B A_float
0 N/A N/A NaN
1 $10.00 -$100.00 NaN
2 $5.00 -$5.00 NaN
>d.dtypes
A object
B object
A_float float64
dtype: object
As a simple test, I tried subtracting '1' from the value, but still got float 'NaN'.
>d['A_float_minus1_test'] = pd.to_numeric(d['A'], errors='coerce')-1
>d
A B A_float A_float_minus1_test
0 N/A N/A NaN NaN
1 $10.00 -$100.00 NaN NaN
2 $5.00 -$5.00 NaN NaN
>d.dtypes
A object
B object
A_float float64
A_float_minus1_test float64
dtype: object
Is there a simple way to get the following result? The way I am thinking is to individually change each DataFrame column to float, then perform the operation. There must be an easier way.
Desired output:
(UPDATE: I do not want to remove the original N/A values. I'd like to keep them there as placeholders.)
Thanks!
To convert your columns from string to float, you can use apply, like such:
d['A_float'] = d['A'].apply(lambda x: float(x.split('$')[1]) if x != '' else 0.0)
The x.split('$')[1] is used to remove the $ character (and eventually the minus before).
Then I am not sure of what your are trying to do, but if you are trying to compute the percentage of A from B, you can use np.vectorize like this:
d['Percent'] = np.vectorize(percent)(d['A'],d['B'])
def percent(p1, p2):
return (100 * p2) / p1
import pandas as pd
d = pd.DataFrame({"A":['N/A','$10.00', '$5.00'],
"B":['N/A', '$10.00', '-$5.00']})
# Covert to number, remove '$', assign to new columns
d[['dA','dB']] = d[['A','B']].apply(lambda s: s.str.replace('$','')).apply(pd.to_numeric, errors='coerce')
# Perform calculations across desired column
d[['dA','dB']] = d[['dA','dB']].diff(-1)/d[['dA','dB']].shift(-1).abs()
print(d)
A B dA dB
0 N/A N/A NaN NaN
1 $10.00 $10.00 1.0 3.0
2 $5.00 -$5.00 NaN NaN

Pandas convert data type from object to float

I read some weather data from a .csv file as a dataframe named "weather". The problem is that the data type of one of the columns is object. This is weird, as it indicates temperature. How do I change it to having a float data type? I tried to_numeric, but it can't parse it.
weather.info()
weather.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 304 entries, 2017-01-01 to 2017-10-31
Data columns (total 2 columns):
Temp 304 non-null object
Rain 304 non-null float64
dtypes: float64(1), object(1)
memory usage: 17.1+ KB
Temp Rain
Date
2017-01-01 12.4 0.0
2017-02-01 11 0.6
2017-03-01 10.4 0.6
2017-04-01 10.9 0.2
2017-05-01 13.2 0.0
You can use pandas.Series.astype
You can do something like this :
weather["Temp"] = weather.Temp.astype(float)
You can also use pd.to_numeric that will convert the column from object to float
For details on how to use it checkout this link :http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.to_numeric.html
Example :
s = pd.Series(['apple', '1.0', '2', -3])
print(pd.to_numeric(s, errors='ignore'))
print("=========================")
print(pd.to_numeric(s, errors='coerce'))
Output:
0 apple
1 1.0
2 2
3 -3
=========================
dtype: object
0 NaN
1 1.0
2 2.0
3 -3.0
dtype: float64
In your case you can do something like this:
weather["Temp"] = pd.to_numeric(weather.Temp, errors='coerce')
Other option is to use convert_objects
Example is as follows
>> pd.Series([1,2,3,4,'.']).convert_objects(convert_numeric=True)
0 1
1 2
2 3
3 4
4 NaN
dtype: float64
You can use this as follows:
weather["Temp"] = weather.Temp.convert_objects(convert_numeric=True)
I have showed you examples because if any of your column won't have a number then it will be converted to NaN... so be careful while using it.
I tried all methods suggested here but sadly none worked. Instead, found this to be working:
df['column'] = pd.to_numeric(df['column'],errors = 'coerce')
And then check it using:
print(df.info())
I eventually used:
weather["Temp"] = weather["Temp"].convert_objects(convert_numeric=True)
It worked just fine, except that I got the following message.
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: FutureWarning:
convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
You can try the following:
df['column'] = df['column'].map(lambda x: float(x))
First check your data cuz you may get an error if you have ',' instead of '.'
if so, you need to transform every ',' into '.' with a function :
def replacee(s):
i=str(s).find(',')
if(i>0):
return s[:i] + '.' + s[i+1:]
else :
return s
then you need to apply this function on every row in your column :
dfOPA['Montant']=dfOPA['Montant'].apply(replacee)
then the convert function will work fine :
dfOPA['Montant'] = pd.to_numeric(dfOPA['Montant'],errors = 'coerce')
Eg, For Converting $40,000.00 object to 40000 int or float32
Follow this step by step :
$40,000.00 ---(**1**. remove $)---> 40,000.00 ---(**2**. remove , comma)---> 40000.00 ---(**3**. remove . dot)---> 4000000 ---(**4**. remove empty space)---> 4000000 ---(**5**. Remove NA Values)---> 4000000 ---(**6**. now this is object type so, convert to int using .astype(int) )---> 4000000 ---(**7**. divide by 100)---> 40000
Implementing code In Pandas
table1["Price"] = table1["Price"].str.replace('$','')<br>
table1["Price"] = table1["Price"].str.replace(',','')<br>
table1["Price"] = table1["Price"].str.replace('.','')<br>
table1["Price"] = table1["Price"].str.replace(' ','')
table1 = table1.dropna()<br>
table1["Price"] = table1["Price"].astype(int)<br>
table1["Price"] = table1["Price"] / 100<br>
Finally it's done

time delta in pandas dataframe

Have a question regarding how to create a day count type of column in pandas. Given a list of dates, I want to be able to calculate the difference from one date to the previous date in days. Now, I can do this with simple subtraction and it will return me a timedelta object I think. What if I just want an integer number of days. Using .days seems to work with two dates but I can't get it work with a column.
Let's say I do,
df['day_count'] = (df['INDEX_DATE'] - df['INDEX_DATE'].shift(1))
INDEX_DATE day_count
0 2009-10-06 NaT
1 2009-10-07 1 days
2 2009-10-08 1 days
3 2009-10-09 1 days
4 2009-10-12 3 days
5 2009-10-13 1 days
I get '1 days'....I only want 1.
I can use .day like this which does return me a number, but it won't work handling an entire column.
(df['INDEX_DATE'][1] - df['INDEX_DATE'][0]).days
If I try something like this:
df['day_count'] = (df['INDEX_DATE'] - df['INDEX_DATE'].shift(1)).days
I get an error of
AttributeError: 'Series' object has no attribute 'days'
I can work around '1 days' but I'm thinking there must be a better way to do this.
Try this:
In [197]: df['day_count'] = df.INDEX_DATE.diff().dt.days
In [198]: df
Out[198]:
INDEX_DATE day_count
0 2009-10-06 NaN
1 2009-10-07 1.0
2 2009-10-08 1.0
3 2009-10-09 1.0
4 2009-10-12 3.0
5 2009-10-13 1.0

Categories