I have one dataframe data contains daily data of sales (DF).
I have another dataframe that contains quarterly data (DF1).
This is what the quarterly dataframe looks like DF1.
Date Computer Sale In Person Sales Net Sales
1/29/2021 1 2 3
4/30/2021 2 4 6
7/29/2021 3 6 9
1/29/2022 4 8 12
5/1/2022 5 10 15
7/30/2022 6 12 18
This is what the daily Data frame looks like: DF
Date Num of people
1 / 30 / 2021 45
1 / 31 / 2021 35
2 / 1 / 2021 25
5 / 1 / 2021 20
5 / 2 / 2021 15
I have columns Computer Sales, In Person Sales, Net Sales in the quarterly dataframe.
How to I merge the columns from above to the daily dataframe so that I can see on the daily dataframe the quarterly data. I want the final result to look like this
Date Num of people Computer Sale In Person Sales Net Sales
1/30/2021 45 1 2 3
1/31/2021 35 1 2 3
2/1/2021 25 1 2 3
5/1/2021 20 2 4 6
5/2/2021 15 2 4 6
So, for example. I want 1/30/2021 to be the figure that is 1/29/2021 and once the daily data goes past 4/30/2021 then merge the new quarterly Data.
Please let me know if I need to be more specific.
A possible solution:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
pd.merge_asof(df2, df1, on='Date', direction='backward')
Output:
Date Num of people Computer Sale In Person Sales Net Sales
0 2021-01-30 45 1 2 3
1 2021-01-31 35 1 2 3
2 2021-02-01 25 1 2 3
3 2021-05-01 20 2 4 6
4 2021-05-02 15 2 4 6
Related
This question already has answers here:
Select rows in pandas MultiIndex DataFrame
(5 answers)
Closed 3 years ago.
I'm working in forecasting the demand of a product using many scenarios per year. I have a MulitiIndexed dataframe (Simulation, Year, Month) and need to filter by one of them (let's say Simulation).
import pandas as pd
idx = pd.MultiIndex.from_tuples([(1,2020,1), (1,2020,2), (2,2020,1), (2,2020,2)],
names=['Simulation', 'Year', 'Month'])
d = {'Apples': [1,2,3,4], 'Organes': [4,5,6,8], 'Lemons': [9,10,11,12]}
df = pd.DataFrame(d, index=idx)
print(df)
Simulation Year Month Apples Oranges Lemons
1 2020 1 1 4 9
1 2 2 5 10
2 2020 1 3 6 11
2 2 4 8 12
How can I filter by Simulation?
Expected output for filtering by simulation number 1 only
Simulation Year Month Apples Oranges Lemons
1 2020 1 1 4 9
1 2 2 5 10
Say you want to index where Simulation is 1, you can use index.get_level_values as:
df[df.index.get_level_values(0) == 1]
Apples Oranges Lemons
Simulation Year Month
1 2020 1 10 30 10
2 25 50 5
2030 12 30 70 5
For multiple values, you can add an isin at the end to values in a list:
df.loc[df.index.get_level_values(0).isin([1, 2])]
Apples Oranges Lemons
Simulation Year Month
1 2020 1 10 30 10
2 25 50 5
2030 12 30 70 5
2 2020 1 15 25 10
2 20 50 15
get_level_values is basically returning a Int64Index containing all indices along the first axis:
df.index.get_level_values(0)
# Int64Index([1, 1, 1, 2, 2, 50], dtype='int64', name='Simulation')
We can then use the result to perform boolean indexing on the dataframe along the axis of interest.
Or you can also use pd.IndexSlice:
df.loc[pd.IndexSlice[[1,2], :, :]]
Apples Oranges Lemons
Simulation Year Month
1 2020 1 10 30 10
2 25 50 5
2030 12 30 70 5
2 2020 1 15 25 10
2 20 50 15
I have a dataframe (i.e df1) with the below values. I wanted to SUM Row 4 to 9 and put the resulting value in Row3. How can we achieve it? In excel it has been simple SUM formula like this =SUM(B9:B14) but what is the alternative in pandas?
Detail Value
0 Day 23
1 Month Aug
2 Year 2020
3 Total Tickets NaN
4 Pune 2
5 Mumbai 3
6 Thane 33
7 Kolkatta NaN
8 Hyderabad NaN
9 Kerala 283
df
price vol date
0 2 4 03-04-2020
1 4 24 03-04-2020
2 5 10 03-04-2020
How could I add a calculate the average price if vol is above the 75th or below the 25 percentile for volumes that month?
I tried:
df.loc[df['avg_price_by_day_and_quartile'] = df[(df['avg_price_by_day_and_quartile'] > vol.quantile(.75) & <vol.quantile(.25) ).groupby(date')['quartile'].transform('mean')
Expected Output:
price vol date quartile avg_price_by_day_and_quartile
0 2 4 03-04-2020 below 2
1 4 24 03-04-2020 above 4
2 5 10 03-04-2020
My dataframe is given below:
input_df =
index Year Month Day Hour Minute GHI
0 2017 1 1 7 30 100
1 2017 1 1 8 30 200
2 2017 1 2 9 30 300
3 2017 1 2 10 30 400
4 2017 2 1 11 30 500
5 2017 2 1 12 30 600
6 2017 2 2 13 30 700
I want to sum each day GHI data. From above I am expecting an output like below:
result_df =
index Year Month Day GHI
0 2017 1 1 300
1 2017 1 2 700
2 2017 2 1 1100
3 2017 2 2 700
My code and my present output is:
result_df = input_df.groupby(['Year','Month','Day'])['GHI'].sum()
print(result_df)
result_df =
index Year Month Day GHI
0 2017 1 1 1400
1 2017 2 2 1400
My above code is combining first day in each month and summing the data. But it is wrong. How to overcome it?
You are incredibly close in your attempt. The thing to bear in mind is that pd.groupby() has a parameter as_index with default value True. Therefore your groupby() outputs a multi-index data frame. To get the desired output you can either chain the reset_index() method after the groupby or change the value of the as_index parameter to False.
result_df = input_df.groupby(['Year','Month','Day'])['GHI'].sum()
result_df
Out[12]:
Year Month Day
2017 1 1 300
2 700
2 1 1100
2 700
Name: GHI, dtype: int64
# Getting the desired output
input_df.groupby(['Year','Month','Day'])['GHI'].sum().reset_index()
Out[16]:
Year Month Day GHI
0 2017 1 1 300
1 2017 1 2 700
2 2017 2 1 1100
3 2017 2 2 700
input_df.groupby(['Year','Month','Day'], as_index=False)['GHI'].sum()
Out[17]:
Year Month Day GHI
0 2017 1 1 300
1 2017 1 2 700
2 2017 2 1 1100
3 2017 2 2 700
I have a pandas dataframe where the index is the date, from year 2007 to 2017.
I'd like to calculate the mean of each weekday for each year. I am able to group by year:
groups = df.groupby(TimeGrouper('A'))
years = DataFrame()
for name, group in groups:
years[name.year] = group.values
This is the way I create a new dataframe (years) where in each column I obtain each year of the time series.
If I want to see the statistics of each years (for example, the mean):
print(years.mean())
But now I would like to separate each day of the week for each year, in order to obtain the mean of each weekday for all of then.
The only thing I know is:
year=df[(df.index.year==2007)]
day_week=df[(df.index.weekday==2)]
The problem with this is that I have to change 7 times the day of the week, and then repeat this for 11 years (my time series begins on 2007 and ends on 2017), so I must do it 77 times!
Is there a way to group time by years and weekday in order to make this faster?
It seems you need groupby by DatetimeIndex.year with DatetimeIndex.weekday:
rng = pd.date_range('2017-04-03', periods=10, freq='10M')
df = pd.DataFrame({'a': range(10)}, index=rng)
print (df)
a
2017-04-30 0
2018-02-28 1
2018-12-31 2
2019-10-31 3
2020-08-31 4
2021-06-30 5
2022-04-30 6
2023-02-28 7
2023-12-31 8
2024-10-31 9
df1 = df.groupby([df.index.year, df.index.weekday]).mean()
print (df1)
a
2017 6 0
2018 0 2
2 1
2019 3 3
2020 0 4
2021 2 5
2022 5 6
2023 1 7
6 8
2024 3 9
df1 = df.groupby([df.index.year, df.index.weekday]).mean().reset_index()
df1 = df1.rename(columns={'level_0':'years','level_1':'weekdays'})
print (df1)
years weekdays a
0 2017 6 0
1 2018 0 2
2 2018 2 1
3 2019 3 3
4 2020 0 4
5 2021 2 5
6 2022 5 6
7 2023 1 7
8 2023 6 8
9 2024 3 9