Calculations with Groupby objects Pandas - python

I would like to retrieve two groupby series objects and calculate between each other.
Series objects below:
Cost
ID yy
312 13 102429.610000
361 15 170526.000000
373 14 400000.000000
403 13 165000.000000
14 165000.000000
15 183558.720000
16 133763.760980
17 121301.930160
Percentage
ID yy
312 13 21.687500
361 15 33.181818
373 14 12.439024
403 13 22.966667
14 22.966667
15 24.142857
16 23.333333
17 36.666667
cost=df.groupby(['ID', 'yy'])['cost']
percentage=df.groupby(['ID', 'yy'])['percentage']
I essentially want to calculate cost * percentage.
how is this done correctly? The error is 'unsupported operand type(s) for *: 'SeriesGroupBy' and 'SeriesGroupBy'.

You are using groupby without any aggregate function which returns as groupby object, NOT a series.
You need
cost = df1.set_index(['ID', 'yy'])['cost']
pct = df2.set_index(['ID', 'yy'])['cost']
cost.mul(pct/100)
ID yy
312 13 22214.421669
361 15 56583.626963
373 14 49756.096000
403 13 37895.000550
14 37895.000550
15 44316.319281
16 31211.543783
17 44477.374796

Is this what you need ?
pct.mul(cost)/100
Out[332]:
ID yy
312 13 22214.421669
361 15 56583.626963
373 14 49756.096000
403 13 37895.000550
14 37895.000550
15 44316.319281
16 31211.543783
17 44477.374796
Name: V, dtype: float64

You can directly multiply cost and percentage just because here your indices i.e id and yy are same for both DF.
So
percentage.mul(cost) should work.

You made a mistake in treating two series from the same (grouped) df as two different objects. So just do:
with df.groupby(['ID', 'yy']) as dfg:
dfg['cost'] * dfg['percentage'] # you have to assign or write the output
You can probably even reduce this to a one-liner, if you post us reproducible data I'll post it. In fact as #Neo showed, something like:
df.groupby(['ID', 'yy']).percentage.mul(cost)

Related

Pandas apply polyfit to a series against a value of the series

I'm new to the Pandas world and it has been hard to stop thinking sequentially.
I have a Series like:
df['sensor'].head(30)
0 6.8855
1 6.8855
2 6.8875
3 6.8885
4 6.8885
5 6.8895
6 6.8895
7 6.8895
8 6.8905
9 6.8905
10 6.8915
11 6.8925
12 6.8925
13 6.8925
14 6.8925
15 6.8925
16 6.8925
17 6.8925
Name: int_price, dtype: float64
I want to calculate the polynomial fit of the first value against all others to the find and average. I defined a function to do the calculation and I want it to be applied onto the series.
The function:
def linear_trend(a,b):
return np.polyfit([1,2],[a,b],1)
The application:
a = pd.Series(df_plot['sensor'].iloc[0] for x in range(len(df_plot.index)))
df['ref'] = df_plot['sensor'].apply(lambda df_plot: linear_trend(a,df['sensor']))
This returns TypeError: No loop matching the specified signature and casting was found for ufunc lstsq_m.
or this:
a = df_plot['sensor'].iloc[0]
df['ref'] = df_plot['sensor'].apply(lambda df_plot: linear_trend(a,df['sensor']))
That returns ValueError: setting an array element with a sequence.
How can I solve this?
I was able to work around my issue by doing the following:
a = pd.Series(data=(df_plot['sensor'].iloc[0] for x in range(len(df_plot.index))), name='sensor_ref')
df_poly = pd.concat([a,df_plot['sensor']],axis=1)
df_plot['slope'] = df_poly[['sensor_ref','sensor']].apply(lambda df_poly: linear_trend(df_poly['sensor_ref'],df_poly['sensor']), axis=1)
If you have a better method, it's welcome.

How to round a no index dataframe in Dask?

I was trying to merge 2 dataframes with float type series in Dask (due to memory issue I can't use pure Pandas). From the post, I found that there will have issue when merging float type columns. So I tried the answer in the post accordingly, to get the XYZ values * 100 and convert into int.
x y z R G B
39020.470001199995750 33884.200004600003012 36.445701600000000 25 39 26
39132.740005500003463 33896.049995399996988 30.405698800000000 19 24 18
39221.059997600001225 33787.050003099997411 26.605699500000000 115 145 145
39237.370010400001775 33773.019996599992737 30.205699900000003 28 33 37
39211.370010400001775 33848.270004300000437 32.535697900000002 19 28 25
What I did
N = 100
df2.x = np.round(df2.x*N).astype(int)
df2.head()
But since this dataframe has no index, it results in a error message
local variable 'index' referenced before assignment
Expected answer
x y z R G B
3902047 3388420 3644 25 39 26
I was having the same problem and got it to work this way:
df2.x = (df2.x*N).round().astype(int)
If you need to round to a specific decimal:
(df2.x*N).round(2)

How to plot Matplotlib chart which takesvalues from different columns

This is my dataframe
Order Time Profit
0 1 106 NaN
1 1 111 -296.0
2 2 14 NaN
3 2 16 -296.0
4 3 62 NaN
.. ... ... ...
335 106 32 -297.6
336 107 44 NaN
337 107 44 138.0
338 108 58 NaN
339 108 63 -303.4
So the way I want it to work is plot a chart where X is the time, Y is the absolute price(positive or negative) so we need to have 2 bars. Now, the time should not be from the same row, but from the first row with the same order number.
For ex. The -296.0 would be under time 106, not 111 because 106 was the first under Order nr.1. How would we do something like that?
This is my code so far:
data = pd.read_csv(filename)
df = pd.DataFrame(data, columns = ['Order','Time','Profit']).astype(str)
#turns time column into hours of week
df['Time'] = df['Time'].apply(lambda x: findHourOfWeek(x))
df['Profit'] = df['Profit'].astype(float)
Assuming the structure we see in the sample of your data holds over the entire data set, i.e. there is only one Profit value per Order, you can do it like this: Group the DataFrame by Order, and aggregate by taking the minimum:
df_grouped = df.groupby(by='Order').min()
resulting in this DataFrame:
Time Profit
Order
1 106 -296.0
2 14 -296.0
3 62 NaN
...
106 32 -297.6
107 44 138.0
108 58 -303.4
Then you can sort by Time and do the plot:
import matplotlib.pyplot as plt
df_grouped.sort_values(by='Time', inplace=True)
plt.plot(df_grouped['Time'], df_grouped['Profit'])
If you rather want to rely on position in the data table you can also do this:
plot_df = pd.DataFrame()
plot_df["Order"] = df.Order.unique()
plot_df["Profit"] = list(df.groupby("Order").nth(-1)["Profit"])
plot_df["Time"] = list(df.groupby("Order").nth(0)["Time"])
However, if you want min value for time you'd better use solution provided by Arne since it would be more safe and correct (provided that you only have one profit value for each order number).

Result of math.log in Python pandas DataFrame is interger

I have a DataFrame, all values are integer
Millage UsedMonth PowerPS
1 261500 269 101
3 320000 211 125
8 230000 253 101
9 227990 255 125
13 256000 240 125
14 153000 242 150
17 142500 215 101
19 220000 268 125
21 202704 260 101
22 350000 246 101
25 331000 230 125
26 250000 226 125
And I would like to calculate log(Millage)
SO I used code
x_trans=copy.deepcopy(x)
x_trans=x_trans.reset_index(drop=True)
x_trans.astype(float)
import math
for n in range(0,len(x_trans.Millage)):
x_trans.Millage[n]=math.log(x_trans.Millage[n])
x_trans.UsedMonth[n]=math.log(x_trans.UsedMonth[n])
I got all interger values
Millage UsedMonth PowerPS
0 12 5 101
1 12 5 125
2 12 5 101
3 12 5 125
4 12 5 125
5 11 5 150
It's python 3, Jupyter notebook
I tried math.log(100)
And get 4.605170185988092
I think the reason could be DataFrame data type.
How could I get the log() result as float
Thanks
One solution would be to simply do
x_trans['Millage'] = np.log(x_trans['Millage'])
Conversion to astype(float) is not an in-place operation. Assign back to your dataframe and you will find your log series will be of type float:
x_trans = x_trans.astype(float)
But, in this case, math.log is inefficient. Instead, you can use vectorised functionality via NumPy:
x_trans['Millage'] = np.log(x_trans['Millage'])
x_trans['UsedMonth'] = np.log(x_trans['UsedMonth'])
With this solution, you do not need to explicitly convert your dataframe to float.
In addition, note that deep copying is native in Pandas, e.g. x_trans = x.copy(deep=True).
First of I strongly recommend using the numpy library for those kind of mathematical operations, it is faster and outputs results in a easier way to use since both numpy and pandas are from the same project.
Now taking into account how you created your dataframe it automatically assumed your data type is integer, try to define it as float when you create the dataframe adding in the parameters dtype = float or better if you are using numpy package (import numpy as np) dtype = np.float64.

Pandas: Iterate over rows and find frequency of occurances

I have a dataframe with 2 columns and 3000 rows.
First column is representing time in time-steps. For example first row is 0, second is 1, ..., last one is 2999.
Second column is representing pressure. The pressure changes as we iterate over the rows, but shows a repetitive behaviour. So every few steps we see that it goes to its minimum value (which is 375), then goes up again, then again at 375 etc.
What I want to do in Python, is to iterate over the rows and see:
1) at which time-steps we see pressure is at its minimum
2)Find the frequency between the minimum values.
import numpy as np
import pandas as pd
import numpy.random as rnd
import scipy.linalg as lin
from matplotlib.pylab import *
import re
from pylab import *
import datetime
df = pd.read_csv('test.csv')
row = next(df.iterrows())[0]
dataset = np.loadtxt(df, delimiter=";")
df.columns = ["Timestamp", "Pressure"]
print(df[[0, 1]])
You don't need to iterate row-wise, you can compare the entire column against the min value to mask it, you can then use the mask to find the timestep diff:
Data setup:
In [44]:
df = pd.DataFrame({'timestep':np.arange(20), 'value':np.random.randint(375, 400, 20)})
df
Out[44]:
timestep value
0 0 395
1 1 377
2 2 392
3 3 396
4 4 377
5 5 379
6 6 384
7 7 396
8 8 380
9 9 392
10 10 395
11 11 393
12 12 390
13 13 393
14 14 397
15 15 396
16 16 393
17 17 379
18 18 396
19 19 390
mask the df by comparing the column against the min value:
In [45]:
df[df['value']==df['value'].min()]
Out[45]:
timestep value
1 1 377
4 4 377
We can use the mask with loc to find the corresponding 'timestep' value and use diff to find the interval differences:
In [48]:
df.loc[df['value']==df['value'].min(),'timestep'].diff()
Out[48]:
1 NaN
4 3.0
Name: timestep, dtype: float64
You can divide the above by 1/60 to find frequency wrt to 1 minute or whatever frequency unit you desire

Categories