How to group data by 6 month in Python - python

I have the following dataframe and I want to get the sum of the Revenue per 6 month. I can extract quarter, month, year out of the date, but I am unable to do it for the 6 month
| date | Revenue |
|-----------|---------|
| 1/2/2017 | 200 |
| 2/2/2017 | 300 |
| 3/2/2017 | 100 |
| 4/2/2017 | 100 |
| 5/23/2017 | 200 |
| 6/20/2017 | 300 |
| 7/22/2017 | 400 |
| 8/21/2017 | 800 |
| 9/21/2017 | 500 |
| 10/21/2017| 500 |
| 11/21/2017| 500 |
| 12/21/2017| 500 |

You can use resample.
df['date'] = pd.to_datetime(df['date'])
df.resample('6M', on='date').sum().reset_index()
#output
date renevue
0 2017-01-31 200
1 2017-07-31 1400
2 2018-01-31 2800

Use pandas.Grouper:
df['date'] = pd.to_datetime(df['date'])
dfg = df.groupby(pd.Grouper(key='date', freq='6M')).sum().reset_index()
date Revenue
0 2017-01-31 200
1 2017-07-31 1400
2 2018-01-31 2800

You could do
df['date'] = pd.to_datetime(df['date'])
df['year_half'] = df.date.dt.month <= 6
df.groupby([df.year_half, df.date.dt.year])['Revenue'].sum()

Related

Extract utc format for datetime object in a new Python column

Be the following pandas DataFrame:
| ID | date |
|--------------|---------------------------------------|
| 0 | 2022-03-02 18:00:20+01:00 |
| 0 | 2022-03-12 17:08:30+01:00 |
| 1 | 2022-04-23 12:11:50+01:00 |
| 1 | 2022-04-04 10:15:11+01:00 |
| 2 | 2022-04-07 08:24:19+02:00 |
| 3 | 2022-04-11 02:33:22+02:00 |
I want to separate the date column into two columns, one for the date in the format "yyyy-mm-dd" and one for the time in the format "hh:mm:ss+tmz".
That is, I want to get the following resulting DataFrame:
| ID | date_only | time_only |
|--------------|-------------------------|----------------|
| 0 | 2022-03-02 | 18:00:20+01:00 |
| 0 | 2022-03-12 | 17:08:30+01:00 |
| 1 | 2022-04-23 | 12:11:50+01:00 |
| 1 | 2022-04-04 | 10:15:11+01:00 |
| 2 | 2022-04-07 | 08:24:19+02:00 |
| 3 | 2022-04-11 | 02:33:22+02:00 |
Right now I am using the following code, but it does not return the time with utc +hh:mm.
df['date_only'] = df['date'].apply(lambda a: a.date())
df['time_only'] = df['date'].apply(lambda a: a.time())
| ID | date_only |time_only |
|--------------|-------------------------|----------|
| 0 | 2022-03-02 | 18:00:20 |
| 0 | 2022-03-12 | 17:08:30 |
| ... | ... | ... |
| 3 | 2022-04-11 | 02:33:22 |
I hope you can help me, thank you in advance.
Convert column to datetimes and then extract Series.dt.date and times with timezones by Series.dt.strftime:
df['date'] = pd.to_datetime(df['date'])
df['date_only'] = df['date'].dt.date
df['time_only'] = df['date'].dt.strftime('%H:%M:%S%z')
Or split converted values to strings by space and select second lists:
df['date'] = pd.to_datetime(df['date'])
df['date_only'] = df['date'].dt.date
df['time_only'] = df['date'].astype(str).str.split().str[1]

Groupby 2 columns and find .min of multiple other columns (python pandas)

My data frame looks like this:
|Months | Places | Sales_X | Sales_Y | Sales_Z |
|----------------------------------------------------|
|**month1 | Place1 | 10000 | 12000 | 13000 |
|month1 | Place2 | 300 | 200 | 1000 |
|month1 | Place3 | 350 | 1000 | 1200** |
|month2 | Place2 | 1400 | 12300 | 14000 |
|month2 | Place3 | 9000 | 8500 | 150 |
|month2 | Place1 | 90 | 4000 | 3000 |
|month3 | Place2 | 12350 | 8590 | 4000 |
|month3 | Place1 | 4500 | 7020 | 8800 |
|month3 | Place3 | 351 | 6500 | 4567 |
I need to find the highest number from the three sales columns by month and show the name of the place with the highest number.
I have been trying to solve it by using pandas.DataFrame.idxmax and groupby but it does not seem to work.
I created a new df with the highest number/row which may help
|Months | Places | Highest_sales|
|-----------------------------------|
|**month1| Place1 | 10000 |
|month1 | Place2 | 200 |
|month1 | Place3 | 350** |
| | | |
|**month2| Place2 | 1400 |
|month2 | Place3 | 150 |
|month2 | Place1 | 90** |
| | | |
|**month3| Place2 | 4000 |
|month3 | Place1 | 4500 |
|month3 | Place3 | 351** |
|-----------------------------------|
Now I just need the highest number/ month and the name of the place. When using groupby on two columns and getting the min of the Lowest_sales, the result
df.groupby(['Months', 'Places'])['Highest_sales'].max()
when I run this
Months Places Highest Sales
1 Place1 1549.0
Place2 2214.0
Place3 2074.0
...
12 Place1 1500.0
Place2 8090.0
Place3 2074.0
the format I am looking for would be
|**Months|Places |Highest Sales**|
|--------|--------------------------|---------------|
|Month1 |Place(*of highest sales*) |100000 |
|Month2 |Place(*of highest sales*) |900000 |
|Month3 |Place(*of highest sales*) |3232000 |
|Month4 |Place(*of highest sales*) |1300833 |
|.... | | |
|Month12 |Place(*of highest sales*) | |
-----------------------------------------------------
12 rows and 3 columns
Use DataFrame.filter for Sales columns, create Highest column adn then aggregate DataFrameGroupBy.idxmax only for Months and select rows and columns by list in DataFrame.loc:
#columns with substring Sales
df1 = df.filter(like='Sales')
#or all columns from third position
#df1 = df.iloc[: 2:]
df['Highest'] = df1.min(axis=1)
df = df.loc[df.groupby('Months')['Highest'].idxmax(), ['Months','Places','Highest']]
print (df)
Months Places Highest
0 month1 Place1 10000
3 month2 Place2 1400
7 month3 Place1 4500

Calculating the range of column value datewise through Python

I want to calculate the maximum difference of product_mrp according to the dates.
For that I was trying to group by date but not able to get after that.
INPUT:
+-------------+--------------------+
| product_mrp | order_date |
+-------------+--------------------+
| 142 | 01-12-2019 |
| 20 | 01-12-2019 |
| 20 | 01-12-2019 |
| 120 | 01-12-2019 |
| 30 | 03-12-2019 |
| 20 | 03-12-2019 |
| 45 | 03-12-2019 |
| 215 | 03-12-2019 |
| 15 | 03-12-2019 |
| 25 | 07-12-2019 |
| 5 | 07-12-2019 |
+-------------+--------------------+
EXPECTED OUTPUT:
+-------------+--------------------+
| product_mrp | order_date |
+-------------+--------------------+
| 122 | 01-12-2019 |
| 200 | 03-12-2019 |
| 20 | 07-12-2019 |
+-------------+--------------------+
you can use groupby as you said and max, min and reset_index like:
gr = df.groupby('order_date')['product_mrp']
df_ = (gr.max()-gr.min()).reset_index()
print (df_)
order_date product_mrp
0 01-12-2019 122
1 03-12-2019 200
2 07-12-2019 20
Use pandas to load the data, then use groupby to group by the shared index:
import pandas as pd
dates = ['01-12-2019']*4 + ['03-12-2019']*5 + ['07-12-2019']*2
data = [142,20,20,120,30,20,45,215,15,25,5]
df = pd.DataFrame(data,)
df.index = pd.DatetimeIndex(dates)
grouped = df.groupby(df.index).apply(lambda x: x.max()-x.min())
Output:
product mrp
2019-01-12 122
2019-03-12 200
2019-07-12 20

Difference of sum of consecutive years pandas

Suppose I have this pandas DataFrame df
Date | Year | Value
2017-01-01 | 2017 | 20
2017-01-12 | 2017 | 40
2018-01-12 | 2018 | 150
2019-10-10 | 2019 | 300
I want to calculate the difference between the total sum of Value per year between consecutive years. To get the total sum of Value per year I can do
df['YearlyValue'] = df.groupy('Year')['Value'].transform('sum')
which gives me
Date | Year | Value | YearlyValue
2017-01-01 | 2017 | 20 | 60
2017-01-12 | 2017 | 40 | 60
2018-01-12 | 2018 | 150 | 150
2019-10-10 | 2019 | 300 | 300
but how can I get a new column 'Increment' has difference between YearlyValue of consecutive years?

Pandas, create new column based on values from previuos rows with certain values

Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.
A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.

Categories