I would like to retrieve a column from a csv file and make it an index in a dataframe. However, I realize that I might need to do another step beforehand.
The csv looks like this;
Date,Step,Order,Price
2011-01-10,Step,BUY,150
2011-01-10,Step,SELL,150
2011-01-13,Step,SELL,150
2011-01-13,Step1,BUY,400
2011-01-26,Step2,BUY,100
If I print the dataframe this is the output:
Date Step Order Price
0 0 Step BUY 150
1 1 Step SELL 150
2 2 Step SELL 150
3 3 Step1 BUY 400
4 4 Step2 BUY 100
However, the output that I would like is to tell how many buys/sells per type of Step I have on each day.
For example;
The expected dataframe and output are:
Date Num-Buy-Sell
2011-01-10 2
2011-01-13 2
2011-01-16 1
This is the code on how I'm retrieving the data frame;
num_trasanctions_day = pd.read_csv(orders_file, parse_dates=True, sep=',', dayfirst=True)
num_trasanctions_day['Transactions'] = orders.groupby(['Date', 'Order'])
num_trasanctions_day['Date'] = num_trasanctions_day.index
My first thought was to make date the index, but I guess I need to calculate how many sell/buys are there per date.
Error
KeyError: 'Order'
Thanks
Just using value_counts
df.Date.value_counts()
Out[27]:
2011-01-13 2
2011-01-10 2
2011-01-26 1
Name: Date, dtype: int64
Edit: If you want to assign it back , you are looking for transform also, please modify your expected output.
df['Transactions']=df.groupby('Date')['Order'].transform('count')
df
Out[122]:
Date Step Order Price Transactions
0 2011-01-10 Step BUY 150 2
1 2011-01-10 Step SELL 150 2
2 2011-01-13 Step SELL 150 2
3 2011-01-13 Step1 BUY 400 2
4 2011-01-26 Step2 BUY 100 1
Related
I have a dataset like this:
Customer ID
Date
Profit
1
4/13/2018
10.00
1
4/26/2018
13.27
1
10/23/2018
15.00
2
1/1/2017
7.39
2
7/5/2017
9.99
2
7/7/2017
10.01
3
5/4/2019
30.30
I'd like to groupby and sum profit, for every 6 months, starting at each users first transaction.
The output ideally should look like this:
Customer ID
Date
Profit
1
4/13/2018
23.27
1
10/13/2018
15.00
2
1/1/2017
7.39
2
7/1/2017
20.00
3
5/4/2019
30.30
The closest I've seem to get on this problem is by using:
df.groupby(['Customer ID',pd.Grouper(key='Date', freq='6M', closed='left')])['Profit'].sum().reset_index()
But, that doesn't seem to sum starting on a users first transaction day.
If the changing of dates is not possible (ex. customer 2 date is 7/1/2017 and not 7/5/2017), then at least summing the profit so that its based on each users own 6 month purchase journey would be extremely helpful. Thank you!
I can get you the first of the month until you find a more perfect solution.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df = (
df
.set_index("Date")
.groupby(["Customer ID"])
.Profit
.resample("6MS")
.sum()
.reset_index(name="Profit")
)
print(df)
Customer ID Date Profit
0 1 2018-04-01 23.27
1 1 2018-10-01 15.00
2 2 2017-01-01 7.39
3 2 2017-07-01 20.00
4 3 2019-05-01 30.30
I am trying to aggregate a dataset with purchases, I have shortened the example in this post to keep it simple. The purchases are distinguished based on two different columns used to identify both customer and transaction. The reference refers to the same transaction, while the ID refers to the type of transaction.
I want to sum these records based on ID, however while keeping in mind the reference and not double-counting the size. The example I provide clears it up.
What I tried so far is:
df_new = df.groupby(by = ['id'], as_index=False).agg(aggregate)
df_new = df.groupby(by = ['id','ref'], as_index=False).agg(aggregate)
Let me know if you have any idea what I can do in pandas, or otherwise in Python.
This is basically what I have,
Name
Reference
Side
Size
ID
Alex
0
BUY
2400
0
Alex
0
BUY
2400
0
Alex
0
BUY
2400
0
Alex
1
BUY
3000
0
Alex
1
BUY
3000
0
Alex
1
BUY
3000
0
Alex
2
SELL
4500
1
Alex
2
SELL
4500
1
Sam
3
BUY
1500
2
Sam
3
BUY
1500
2
Sam
3
BUY
1500
2
What I am trying to achieve is the following,
Name
Side
Size
ID
Alex
BUY
5400
0
Alex
SELL
4500
1
Sam
BUY
1500
2
P.S. the records are not duplicates of each other, what I provide is a simplified version, but in reality 'Name' is 20 more columns identifying each row.
P.S. P.S. My solution was to first aggregate by Reference then by ID.
Use drop_duplicates, groupby, and agg:
new_df = df.drop_duplicates().groupby(['Name', 'Side']).agg({'Size': 'sum', 'ID': 'first'}).reset_index()
Output:
>>> new_df
Name Side Size ID
0 Alex BUY 5400 0
1 Alex SELL 4500 1
2 Sam BUY 1500 2
Edit: richardec's solution is better as this will also sum the ID column.
This double groupby should achieve the output you want, as long as names are unique.
df.groupby(['Name', 'Reference']).max().groupby(['Name', 'Side']).sum()
Explanation: First we group by Name and Reference to get the following dataframe. The ".max()" could just as well be ".min()" or ".mean()" as it seems your data will have the same size per unique transaction:
Name
Reference
Side
Size
ID
Alex
0
BUY
2400
0
1
BUY
3000
0
2
SELL
4500
1
Sam
3
BUY
1500
2
Then we group this data by Name and Side with a ".sum()" operation to get the final result.
Name
Side
Size
ID
Alex
BUY
5400
0
SELL
4500
1
Sam
BUY
1500
2
Just drop duplicates first and then aggregate with a list
something like this should do (not tested)
I always like to reset the index after
i.e
df.drop_duplicates().groupby(["Name","Side","ID"]).sum()["Size"].reset_index()
or
# stops the double counts
df_dropped = df.drop_duplicates()
# groups by all the fields in your example
df_grouped = df_dropped.groupby(["Name","Side","ID"]).sum()["Size"]
# resets the 3 indexes created with above
df_reset = df_grouped.reset_index()
I have a dataframe like this
Time
Buy
Sell
Bin
0
09:15:01
3200
0
3573.0
1
09:15:01
0
4550
3562.0
2
09:15:01
4250
0
3565.0
3
09:15:01
0
5150
3562.0
4
09:15:01
1200
0
3563.0
..
...
...
...
...
292
09:15:01
375
0
3564.0
293
09:15:01
175
0
3564.0
294
09:15:01
0
25
3564.0
295
09:15:01
400
0
3564.0
(Disregard 'Time' currently just using a static value)
what would be the most efficient way to
sum up all the Buys and sells within each bin and remove duplicates
Currently im using
Step1.
final_df1['Buy'] = final_df1.groupby(final_df1['Bin'])['Buy'].transform('sum')
Step2.
final_df1['Sell'] = final_df1.groupby(final_df1['Bin'])['Sell'].transform('sum')
Step3.
##remove duplicates
final_df1 = final_df1.groupby('Bin', as_index=False).max()
using agg or sum or cumsum just removed all the other columns from the resulting df
Ideally there should be distinct bins with sum of buy and/or sell
The output must be
Time
Buy
Sell
Bin
0
09:15:01
3200
0
3573.0
1
09:15:01
450
4550
3562.0
2
09:15:01
4250
3625
3565.0
292
09:15:01
950
25
3564.0
This can also be achieved by using pivot_table from pandas.
Here is the simple recreated example for your code:
import numpy as np
import pandas as pd
df=pd.DataFrame({'buy':[1,2,3,0,3,2,1],
'sell':[2,3,4,0,5,4,3],
'bin':[1,2,1,2,1,2,1],
'time': [1,1,1,1,1,1,1]
})
df_output=df.pivot_table(columns='bin', values=['buy','sell'], aggfunc=np.sum)
Output will look like this:
bin 1 2
buy 8 4
sell 14 7
In case you want the output you mentioned:
we can take the transpose of the above dataframe output:
df_output.T
Or use a groupby as below to input dataframe:
df.groupby(['bin'])[['sell', 'buy']].sum()
The output is as below:
sell buy
bin
1 14 8
2 7 4
If you also need time in the dataframe, we could do it by using and separating the aggregate function for each column:
df.groupby("bin").agg({ "sell":"sum", "buy":"sum", "time":"first"})
The output is as below:
sell buy time
bin
1 14 8 1
2 7 4 1
I am manipulating some data in Python and was wondering if anyone can help.
I have data that looks like this:
count source timestamp tokens
0 1 alt-right-census 2006-03-21 setting
1 1 alt-right-census 2006-03-21 twttr
2 1 stormfront 2006-06-24 head
3 1 stormfront 2006-10-07 five
and I need data that looks like this:
count_stormfront count_alt-right-census month token
2 1 2006-01 setting
or like this:
date token alt_count storm_count
4069995 2016-09 zealand 0 0
4069996 2016-09 zero 11 8
4069997 2016-09 zika 295 160
How can I aggregate days by year-month and pivot so that count becomes count_source summed over the month?
Any help would be appreciated. Thanks!
df.groupby(['source', df['timestamp'].str[:7]]).size().unstack()
Result:
timestamp 2006-03 2006-06 2006-10
source
alt-right-census 2.0 NaN NaN
stormfront NaN 1.0 1.0
I'm still learning python and would like to ask your help with the following problem:
I have a csv file with daily data and I'm looking for a solution to sum it per calendar weeks. So for the mockup data below I have rows stretched over 2 weeks (week 14 (current week) and week 13 (past week)). Now I need to find a way to group rows per calendar week, recognize what year they belong to and calculate week sum and week average. In the file input example there are only two different IDs. However, in the actual data file I expect many more.
input.csv
id date activeMembers
1 2020-03-30 10
2 2020-03-30 1
1 2020-03-29 5
2 2020-03-29 6
1 2020-03-28 0
2 2020-03-28 15
1 2020-03-27 32
2 2020-03-27 10
1 2020-03-26 9
2 2020-03-26 3
1 2020-03-25 0
2 2020-03-25 0
1 2020-03-24 0
2 2020-03-24 65
1 2020-03-23 22
2 2020-03-23 12
...
desired output.csv
id week WeeklyActiveMembersSum WeeklyAverageActiveMembers
1 202014 10 1.4
2 202014 1 0.1
1 202013 68 9.7
2 202013 111 15.9
my goal is to:
import pandas as pd
df = pd.read_csv('path/to/my/input.csv')
Here I'd need to group by 'id' + 'date' column (per calendar week - not sure if this is possible) and create a 'week' column with the week number, then sum 'activeMembers' values for the particular week, save as 'WeeklyActiveMembersSum' column in my output file and finally calculate 'weeklyAverageActiveMembers' for the particular week. I was experimenting with groupby and isin parameters but no luck so far... would I have to go with something similar to this:
df.groupby('id', as_index=False).agg({'date':'max',
'activeMembers':'sum'}
and finally save all as output.csv:
df.to_csv('path/to/my/output.csv', index=False)
Thanks in advance!
It seems I'm getting a different week setting than you do:
# should convert datetime column to datetime type
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['id',df.date.dt.strftime('%Y%W')], sort=False)
.activeMembers.agg([('Sum','sum'),('Average','mean')])
.add_prefix('activeMembers')
.reset_index()
)
Output:
id date activeMembersSum activeMembersAverage
0 1 202013 10 10.000000
1 2 202013 1 1.000000
2 1 202012 68 9.714286
3 2 202012 111 15.857143