so I have a data frame similar to the one below where stamp is an date time index;
for context, it represents orders received and my goal is to match orders that may be the same but have come as two separate orders.
Stamp
Price.
indicator
EX
qty
1234
10
1
d
12
2345
30
-1
d
13
I want to group entries that have the same date time stamp, given that those entries have the same EX and Indicator.
I think I know how to do this with just the stamp however I'm unsure how to add the conditions of EX and indicator to the groupby.
Beginner here so any help is greatly appreciated!
Try this:
df.groupby(["Stamp", "EX", "indicator"])
And if you then want to get the sum of quantities and prices you can do this:
df.groupby(["Stamp", "EX", "indicator"]).sum()
you can groupby more than one column: df.groupby(['Stamp', 'EX'])
Then you can check the length of each group to see if there are multiple rows that share both columns:
df.groupby(['Stamp', 'EX']).apply(len)
Related
I have a dataframe new_df which has a list of customer id's, dates, and a customer segment for each day. Customer segment can take multiple values. I am looking to identify a list of customers whose segment has changed more than twice in the past 15 days.
Currently, I am using the following to check how many times each segment appears for each customer id.
segment_count = new_df.groupby(new_df['customer_id'].ne(new_df['customer_id'].shift()).cumsum())['segment'].value_counts()
My thinking is if a customer has more than 2 segments which have a count of >1, then they must have migrated from one segment to another at least twice. 2 sample customers may look like this:
|customer_id|day|segment|
|-----------|---|-------|
|12345|'2021-01-01'|G|
|12345|'2021-01-02'|G|
|12345|'2021-01-03'|M|
|12345|'2021-01-04'|G|
|12345|'2021-01-05'|M|
|12345|'2021-01-06'|M|
|6789|'2021-01-01'|G|
|6789|'2021-01-02'|G|
|6789|'2021-01-03'|G|
|6789|'2021-01-04'|G|
|6789|'2021-01-05'|G|
|6789|'2021-01-06'|M|
As an output, I would want to return the following:
customer_id
migration_count
12345
3
6789
1
Anyone have any advice on best way to tackle this or if there are any built in functions I can use to simplify?Thanks!
From the dataframe below:
I would like to group column 'datum' by date 01-01-2019 and so on. and get an average at the same time on column 'PM10_gemiddelde'.
So now all 01-01-2019 (24 times) is on hour base and i need it combined to 1 and get the average on column ' PM10_gemiddelde' at the same time. See picture for the data.
besides that, PM10_gemiddelde has also negative data. How can i erase that data in python easily?
Thank you!
ps. im new with python
What you are trying to do can be achieve by:
data[['datum','PM10_gemiddelde']].loc[data['PM10_gemiddelde'] > 0 ].groupby(['datum']).mean()
You can create a new column with the average of PM10_gemiddelde using groupby along with transform. Try the following:
Assuming your dataframe is called df, start first by removing the negative data:
new_df = df[df['PM10_gemiddelde'] > 0]
Then, you can create a new column that contains the average value for every date:
new_df['avg_col'] = new_df.groupby('datum')['PM10_gemiddelde'].transform('mean')
Dataset
I have a movie dataset where there are over half a million rows, and this dataset looks like following (with made-up numbers)
MovieName Date Rating Revenue
A 2019-01-15 3 3.4 million
B 2019-02-03 3 1.2 million
... ... ... ...
Object
Select movies that are released "closed enough" in terms of date (for example, the release date difference of movie A and movie B is less than a month) and see when the rating is same, how the revenue could be different.
Question
I know I could write a double loop to achieve this goal. However, I am doubting this is the right/efficient way to do, because
Some posts (see comment of #cs95 to the question) suggested iterating over a dataframe is "anti-pattern" and therefore something not advisable to do.
The dataset has over half a million rows, I am not sure if writing double loop is something efficient to do.
Could someone provide pointers to the question I have? Thank you in advance.
In general, it is true that you should try avoiding loops when working with pandas. My idea is not ideal, but might point you in the right direction:
Retrieve month and year from the date column in every row to create new columns "month" and "year". You can see how to do it here
Afterwards, group your dataframe by month and year (grouped_df = df.groupby(by=["month","year])) and the resulting groups are dataframe with movies from the same month and year. Now it's up to you what further analysis you want to perform, for example mean (grouped_df = df.groupby(by=["month","year]).mean()), standard deviation or something more fancy with the apply() function.
You can also extract weeks if you want a period shorter than a month
I have 2 Dataframes:
df_Billed: pd.Dataframe({'Bill_Number':[220119, 220120, 220219, 220219, 220419, 220519, 220619, 221219],'Date': [1/31/2019, 2/20/2020, 2/28/2019, 6/30/2019,6/30/2019,6/30/2019,6/30/2019,12/31/2019], 'Amount': [3312.5, 832.0,10000.0, -3312.5,8725.0,1862.5,3637.5,1587.5]})
df_Received: pd.Dataframe({'Bill_Number':[220119, 220219, 220419, 220519, 220619],'Date':[4/16/2019,5/21/2019,8/2/2019,8/2/2019,8/2/2019],'Amount':[3312.5,6687.5,8725,1862.5,3637.5]})
I am trying to search for for each "Bill_Number" in df_Billed to see if it is present if df_Received. Ideally, if it is present, I would like to take the difference between the dates between df_Billed and df_Received for that particular bill number (to see how many days it took to get paid). If the billing number is not present in df_Received, I would like to simply return the all rows for that billing number in df_Billed.
EX: Since df_Billed Bill_Number 220119 is in df_Received, it would return 75 (which is the number of days it took for the bill to be paid 4/16/2019 - 1/31/2019).
EX: Since df_Billed Bill_Number 221219 is not in df_Received, it would return 12/31/2019 (which is the date it was billed).
You could probably use merge on Bill_Number initially
df_Billed=df_Billed.merge(df_Received,on='Bill_Number',how='left')
Then use apply and pandas.to_datetime for computing diffrence between dates
df_Billed['result']=df_Billed.apply(lambda x:x.Date_x if pd.isnull(x.Date_y)
else abs(pd.to_datetime(x.Date_x)-pd.to_datetime(x.Date_y)).days,
axis=1)
And finally, I think you want to create a new column for final result..so I'm renaming the merged columns Date_x and Amount_y back to Date and Amount below:
df_Billed.drop(['Date_y','Amount_y'],axis=1,inplace=True)
df_Billed.rename(columns={"Date_x": "Date","Amount_x":"Amount"},inplace=True)
Final Dataframe:
I want to modify this SO topic here to three hourly interval.
I have a database of events at minute resolution. I need to group them in three hourly, and extract the count of this grouping.
The output would ideally look something like a table like the following:
3hourly count
0 10
3 3
6 5
9 2
...
You haven't provided much detail, but you can use the 'TimeGrouper':
df.groupby(pd.TimeGrouper(key='your_time_column', freq='3H')).count()
The key parameter is optional if your time is the index.