pandas groupby single column but add contents of multiple columns - python

I have a dataframe like this
Time
Buy
Sell
Bin
0
09:15:01
3200
0
3573.0
1
09:15:01
0
4550
3562.0
2
09:15:01
4250
0
3565.0
3
09:15:01
0
5150
3562.0
4
09:15:01
1200
0
3563.0
..
...
...
...
...
292
09:15:01
375
0
3564.0
293
09:15:01
175
0
3564.0
294
09:15:01
0
25
3564.0
295
09:15:01
400
0
3564.0
(Disregard 'Time' currently just using a static value)
what would be the most efficient way to
sum up all the Buys and sells within each bin and remove duplicates
Currently im using
Step1.
final_df1['Buy'] = final_df1.groupby(final_df1['Bin'])['Buy'].transform('sum')
Step2.
final_df1['Sell'] = final_df1.groupby(final_df1['Bin'])['Sell'].transform('sum')
Step3.
##remove duplicates
final_df1 = final_df1.groupby('Bin', as_index=False).max()
using agg or sum or cumsum just removed all the other columns from the resulting df
Ideally there should be distinct bins with sum of buy and/or sell
The output must be
Time
Buy
Sell
Bin
0
09:15:01
3200
0
3573.0
1
09:15:01
450
4550
3562.0
2
09:15:01
4250
3625
3565.0
292
09:15:01
950
25
3564.0

This can also be achieved by using pivot_table from pandas.
Here is the simple recreated example for your code:
import numpy as np
import pandas as pd
df=pd.DataFrame({'buy':[1,2,3,0,3,2,1],
'sell':[2,3,4,0,5,4,3],
'bin':[1,2,1,2,1,2,1],
'time': [1,1,1,1,1,1,1]
})
df_output=df.pivot_table(columns='bin', values=['buy','sell'], aggfunc=np.sum)
Output will look like this:
bin 1 2
buy 8 4
sell 14 7
In case you want the output you mentioned:
we can take the transpose of the above dataframe output:
df_output.T
Or use a groupby as below to input dataframe:
df.groupby(['bin'])[['sell', 'buy']].sum()
The output is as below:
sell buy
bin
1 14 8
2 7 4
If you also need time in the dataframe, we could do it by using and separating the aggregate function for each column:
df.groupby("bin").agg({ "sell":"sum", "buy":"sum", "time":"first"})
The output is as below:
sell buy time
bin
1 14 8 1
2 7 4 1

Related

Pandas returning 0 string as 0%

I'm doing a evaluation on how many stores report back in how many time (same day(0), 1 day(1), etc), but when calculate the percentage of the total, all same day stores return 0% of the total.
I tried turning the column into object, float and int, but with the same result.
DF['T_days'] = (DF['day included in the server'] - DF['day of sale']).dt.days
create my T_Days and fills it with the amount in days based on the 2 datatime columns. This works fine. And by:
DF['Percentage'] = (DF['T_days'] /DF['T_days'].sum()) * 100
return this table. I know what i should do but now how to do it.
COD_store
date in server
Date bought
T_days
Percentage
1
2021-12-03
2021-12-02
1
0.013746
1
2021-12-03
2021-12-02
1
0.013746
922
2022-01-27
2022-01-10
17
0.233677
922
2022-01-27
2022-01-10
17
0.233677
...
...
...
...
...
65
2022-01-12
2022-01-12
0
0.0
new DF after groupby:
T_DIAS
0 0.000000
1 1.374570
2 0.192440
3 15.793814
7 0.384880
17 82.254296
Name: Percentage, dtype: float64
I know i should divide the days resulted by the total amount of rows in DF and then group them by days, but my search on how to do this resulted in nothing. THW: i already have a separate DF for those days and percentage
Expected table:
T_days
Percentage
0
50
2
30
3
10
4
3
5
7
DF['T_days'].value_counts(normalize=True)*100)
worked. And after I turned it from a series to a DF to help the usage.

Python: adding columns to dataframe using calculated values from specific rows

Hi I am pretty new to python and would like to start working in it and move away from excel. My problem is two fold:
First part is that I have a csv file which looks like this
row 1: 52.78 52.52 53.2 51.98 53.22 50.85 51.44 52.38 52.21 52.09 51.5 51.92
row2 : 6.89 5.47 5.8 5.89 6.56 5.69 5.48 4.9 6.39 5.12 3.61 4.48
row3: 156 126 185 363 197 261 417 298 292 150 102 303
row4: 0 0 0 0 0 0 0 0 0 0 0 0
row5: 0 3 5 8 0 0 10 0 12 0 13 0
...
...
...
row195: 0 5 5 7 1 2 11 0 12 0 13 0
it goes on like this till row 195
I want to create new columns which start on row 4 and use the following formula:
the first column should be
[(row3,column1)*(row4,column1)]+[(row3,column1)*(row101,column1)]
the second column should be
[(row3,column2)*(row4,column2)]+[(row3,column2)*(row101,column2)]
it goes on like this till row 100 for all 12 columns
the formula in row 100 for the first column should be
[(row3,column1)*(row98,column1)]+[(row3,column1)*(row195,column1)]
how do I go about doing this in Pandas?
The second part of my problem is that I have 365 different files with similar data (the values change per file but the format is the same) and I would like to apply this same formula in all the files.
Appreciate any help I can get
Thanks
You are trying to use pandas like Excel if I understand it correctly.
If your dataset/dataframe is called df and you would like to append a new column. You would right something like:
df['first_col']=float(df.iloc[2,0])*float(df.iloc[3,0])+float(df.iloc[2,0])*float(df.ilo[100,0])
... and the same for the other 2 columns. Be aware that python starts counting from 0. Hence, your row 1 is actually row 0 and column 1 is column 0 etc. Hope this helps.

Count instances of transactions per day pandas data frame

I would like to retrieve a column from a csv file and make it an index in a dataframe. However, I realize that I might need to do another step beforehand.
The csv looks like this;
Date,Step,Order,Price
2011-01-10,Step,BUY,150
2011-01-10,Step,SELL,150
2011-01-13,Step,SELL,150
2011-01-13,Step1,BUY,400
2011-01-26,Step2,BUY,100
If I print the dataframe this is the output:
Date Step Order Price
0 0 Step BUY 150
1 1 Step SELL 150
2 2 Step SELL 150
3 3 Step1 BUY 400
4 4 Step2 BUY 100
However, the output that I would like is to tell how many buys/sells per type of Step I have on each day.
For example;
The expected dataframe and output are:
Date Num-Buy-Sell
2011-01-10 2
2011-01-13 2
2011-01-16 1
This is the code on how I'm retrieving the data frame;
num_trasanctions_day = pd.read_csv(orders_file, parse_dates=True, sep=',', dayfirst=True)
num_trasanctions_day['Transactions'] = orders.groupby(['Date', 'Order'])
num_trasanctions_day['Date'] = num_trasanctions_day.index
My first thought was to make date the index, but I guess I need to calculate how many sell/buys are there per date.
Error
KeyError: 'Order'
Thanks
Just using value_counts
df.Date.value_counts()
Out[27]:
2011-01-13 2
2011-01-10 2
2011-01-26 1
Name: Date, dtype: int64
Edit: If you want to assign it back , you are looking for transform also, please modify your expected output.
df['Transactions']=df.groupby('Date')['Order'].transform('count')
df
Out[122]:
Date Step Order Price Transactions
0 2011-01-10 Step BUY 150 2
1 2011-01-10 Step SELL 150 2
2 2011-01-13 Step SELL 150 2
3 2011-01-13 Step1 BUY 400 2
4 2011-01-26 Step2 BUY 100 1

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Column operations on pandas groupby object

I have a dataframe df that looks like this:
id Category Time
1 176 12 00:00:00
2 4956 2 00:00:00
3 583 4 00:00:04
4 9395 2 00:00:24
5 176 12 00:03:23
which is basically a set of id and the category of item they used at a particular Time. I use df.groupby['id'] and then I want to see if they used the same category or different and assign True or False respectively (or NaN if that was the first item for that particular id. I also filtered out the data to remove all the ids with only one Time.
For example one of the groups may look like
id Category Time
1 176 12 00:00:00
2 176 12 00:03:23
3 176 2 00:04:34
4 176 2 00:04:54
5 176 2 00:05:23
and I want to perform an operation to get
id Category Time Transition
1 176 12 00:00:00 NaN
2 176 12 00:03:23 False
3 176 2 00:04:34 True
4 176 2 00:04:54 False
5 176 2 00:05:23 False
I thought about doing an apply of some sorts to the Category column after groupby but I am having trouble figuring out the right function.
you don't need a groupby here, you just need sort and shift.
df.sort(['id', 'Time'], inplace=True)
df['Transition'] = df.Category != df.Category.shift(1)
df.loc[df.id != df.id.shift(1), 'Transition'] = np.nan
i haven't tested this, but it should do the trick

Categories