duplicate specific rows of a dataframe based on column values

duplicate specific rows of a dataframe based on column values - python

hi I have the following data frame
weather day month activity
sunny Monday April go for cycling
raining Friday December stay home
what I want is to duplicate the rows by 5 times without taking into account the activity column
so the output should be
weather day month activity
sunny Monday April go for cycling
sunny Monday April
sunny Monday April
sunny Monday April
sunny Monday April
raining Friday December stay home
raining Friday December
raining Friday December
raining Friday December
raining Friday December
raining Friday December

Use Index.repeat with DataFrame.loc for repeated rows and then replace duplicated activity by Series.mask with Index.duplicated:
df = df.loc[df.index.repeat(5)]
df['activity'] = df['activity'].mask(df.index.duplicated(), '')
df = df.reset_index(drop=True)
print (df)
weather day month activity
0 sunny Monday April go for cycling
1 sunny Monday April
2 sunny Monday April
3 sunny Monday April
4 sunny Monday April
5 raining Friday December stay home
6 raining Friday December
7 raining Friday December
8 raining Friday December
9 raining Friday December

Related

ValueError: Cannot create a NumPy datetime other than NaT with generic units

I have generated this df
PredictionTargetDateEOM PredictionTargetDateBOM DayAfterTargetDateEOM business_days
0 2018-12-31 2018-12-01 2019-01-01 20
1 2019-01-31 2019-01-01 2019-02-01 21
2 2019-02-28 2019-02-01 2019-03-01 20
3 2018-11-30 2018-11-01 2018-12-01 21
4 2018-10-31 2018-10-01 2018-11-01 23
... ... ... ... ...
172422 2020-10-31 2020-10-01 2020-11-01 22
172423 2020-11-30 2020-11-01 2020-12-01 20
172424 2020-12-31 2020-12-01 2021-01-01 22
172425 2020-09-30 2020-09-01 2020-10-01 21
172426 2020-08-31 2020-08-01 2020-09-01 21
with this code:
predicted_df['PredictionTargetDateBOM'] = predicted_df.apply(lambda x: pd.to_datetime(x['PredictionTargetDateEOM']).replace(day=1), axis = 1) #Get first day of the target month
predicted_df['PredictionTargetDateEOM'] = pd.to_datetime(predicted_df['PredictionTargetDateEOM'])
predicted_df['DayAfterTargetDateEOM'] = predicted_df['PredictionTargetDateEOM'] + timedelta(days=1) #Get the first day of the month after target month. i.e. M+2
predicted_df['business_days_bankers'] = predicted_df.apply(lambda x: np.busday_count(x['PredictionTargetDateBOM'].date(), x['DayAfterTargetDateEOM'].date(), holidays=[list(holidays.US(years=x['PredictionTargetDateBOM'].year).keys())[index] for index in [list(holidays.US(years=x['PredictionTargetDateBOM'].year).values()).index(item) for item in rocket_holiday_including_observed if item in list(holidays.US(years=x['PredictionTargetDateBOM'].year).values())]] ), axis = 1) #Count number of business days of the target month
That counts the number of business days in the month of the PredictionTargetDateEOM column based on Python's holiday package, which is a dictionary that includes the following holidays:
2022-01-01 New Year's Day
2022-01-17 Martin Luther King Jr. Day
2022-02-21 Washington's Birthday
2022-05-30 Memorial Day
2022-06-19 Juneteenth National Independence Day
2022-06-20 Juneteenth National Independence Day (Observed)
2022-07-04 Independence Day
2022-09-05 Labor Day
2022-10-10 Columbus Day
2022-11-11 Veterans Day
2022-11-24 Thanksgiving
2022-12-25 Christmas Day
2022-12-26 Christmas Day (Observed)
However, I would like to replicate the business day count but instead use this list called rocket_holiday as the reference for np.busday_count():
["New Year's Day",
'Martin Luther King Jr. Day',
'Memorial Day',
'Independence Day',
'Labor Day',
'Thanksgiving',
'Christmas Day',
"New Year's Day (Observed)",
'Martin Luther King Jr. Day (Observed)',
'Memorial Day (Observed)',
'Independence Day (Observed)',
'Labor Day (Observed)',
'Thanksgiving (Observed)',
'Christmas Day (Observed)']
So I've added this line
predicted_df['business_days_rocket'] = predicted_df.apply(lambda x: np.busday_count(x['PredictionTargetDateBOM'].date(), x['DayAfterTargetDateEOM'].date(), holidays=[rocket_holiday]), axis = 1)
But I get the ValueError listed in the title of this question. I think the problem is that the first list is a dictionary with the dates of those holidays, so I need to write a function that could generate those dates for the holidays of the second list in a dynamic fashion based on year, and convert that list into a dictionary. Is there a way to do that with Python's holiday package so that I don't have to hard-code the dates in?

Get values of latest year and all its months in pandas

Below is the Raw Data.
Event Month Year
Event1 January 2012
Event1 February 2013
Event1 March 2014
Event1 April 2017
Event1 May 2017
Event1 June 2017
Event2 May 2018
Event2 May 2019
Event3 February 2012
Event3 March 2012
Event3 April 2012
Event1 latest year is 2017 so month should be April, May, June.
Event2 latest year is 2019 so month should be May.
Event3 latest year is 2012 so month should be February, March, April.
Output Should be : -
Event Month Year
Event1 April 2017
Event1 May 2017
Event1 June 2017
Event2 May 2019
Event3 February 2012
Event3 March 2012
Event3 April 2012

You can transform the latest year per group and use it to slice:
out = df[df['Year'].eq(df.groupby('Event')['Year'].transform('max'))]
output:
Event Month Year
3 Event1 April 2017
4 Event1 May 2017
5 Event1 June 2017
7 Event2 May 2019
8 Event3 February 2012
9 Event3 March 2012
10 Event3 April 2012

Date Offset in pandas data range

I have the following formula which get me EOM date every 3M starting Feb 90.
dates = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M")
I am looking to get in a condensed manner the same table but where the dates are offset by x business days.
This mean, if x = 2, 2 business days before the EOM date calculated every 3M starting Feb 90.
Thanks for the help.

from pandas.tseries.offsets import BDay
x = 2
dates = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M") - BDay(x)
>>> dates
DatetimeIndex(['1990-02-26', '1990-05-29', '1990-08-29', '1990-11-28',
'1991-02-26', '1991-05-29', '1991-08-29', '1991-11-28',
'1992-02-27', '1992-05-28',
...
'2027-05-27', '2027-08-27', '2027-11-26', '2028-02-25',
'2028-05-29', '2028-08-29', '2028-11-28', '2029-02-26',
'2029-05-29', '2029-08-29'],
dtype='datetime64[ns]', length=159, freq=None)
Example
x = 2
dti1 = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M")
dti2 = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M") - BDay(x)
df = pd.DataFrame({"dti1": dti1.day_name(), "dti2": dti2.day_name()})
>>> df.head(20)
dti1 dti2
0 Wednesday Monday
1 Thursday Tuesday
2 Friday Wednesday
3 Friday Wednesday
4 Thursday Tuesday
5 Friday Wednesday
6 Saturday Thursday
7 Saturday Thursday
8 Saturday Thursday
9 Sunday Thursday
10 Monday Thursday
11 Monday Thursday
12 Sunday Thursday
13 Monday Thursday
14 Tuesday Friday
15 Tuesday Friday
16 Monday Thursday
17 Tuesday Friday
18 Wednesday Monday
19 Wednesday Monday

How to set the columns in pandas

Here is my dataframe:
Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19
Saturday 2540.0 2441.0 3832.0 4093.0 1455.0 2552.0
Sunday 1313.0 1891.0 2968.0 2260.0 1454.0 1798.0
Monday 1360.0 1558.0 2967.0 2156.0 1564.0 1752.0
Tuesday 1089.0 2105.0 2476.0 1577.0 1744.0 1457.0
Wednesday 1329.0 1658.0 2073.0 2403.0 1231.0 874.0
Thursday 798.0 1195.0 2183.0 1287.0 1460.0 1269.0
I have tried some pandas ops but I am not able to do that.
This is what I want to do:
items
Saturday 2540.0
Sunday 1313.0
Monday 1360.0
Tuesday 1089.0
Wednesday 1329.0
Thursday 798.0
Saturday 2441.0
Sunday 1891.0
Monday 1558.0
Tuesday 2105.0
Wednesday 1658.0
Thursday 1195.0 ............ and so on
I want to set those rows into rows in downside, how to do that?

df.reset_index().melt(id_vars='index').drop('variable',1)
Output:
index value
0 Saturday 2540.0
1 Sunday 1313.0
2 Monday 1360.0
3 Tuesday 1089.0
4 Wednesday 1329.0
5 Thursday 798.0
6 Saturday 2441.0
7 Sunday 1891.0
8 Monday 1558.0
9 Tuesday 2105.0
10 Wednesday 1658.0
11 Thursday 1195.0
12 Saturday 3832.0
13 Sunday 2968.0
14 Monday 2967.0
15 Tuesday 2476.0
16 Wednesday 2073.0
17 Thursday 2183.0
18 Saturday 4093.0
19 Sunday 2260.0
20 Monday 2156.0
21 Tuesday 1577.0
22 Wednesday 2403.0
23 Thursday 1287.0
24 Saturday 1455.0
25 Sunday 1454.0
26 Monday 1564.0
27 Tuesday 1744.0
28 Wednesday 1231.0
29 Thursday 1460.0
30 Saturday 2552.0
31 Sunday 1798.0
32 Monday 1752.0
33 Tuesday 1457.0
34 Wednesday 874.0
35 Thursday 1269.0
Note: just noted a commented suggesting to do the same thing, I will delete my post if requested :)

Create it with numpy by reshaping the data.
import pandas as pd
import numpy as np
pd.DataFrame(df.to_numpy().flatten('F'),
index=np.tile(df.index, df.shape[1]),
columns=['items'])
Output:
items
Saturday 2540.0
Sunday 1313.0
Monday 1360.0
Tuesday 1089.0
Wednesday 1329.0
Thursday 798.0
Saturday 2441.0
...
Sunday 1798.0
Monday 1752.0
Tuesday 1457.0
Wednesday 874.0
Thursday 1269.0

You can do:
df = df.stack().sort_index(level=1).reset_index(level = 1, drop=True).to_frame('items')
It is interesting that this method got overlooked even though it is the fastest:
import time
start = time.time()
df.stack().sort_index(level=1).reset_index(level = 1, drop=True).to_frame('items')
end = time.time()
print("time taken {}".format(end-start))
yields: time taken 0.006181955337524414
while this:
start = time.time()
df.reset_index().melt(id_vars='days').drop('variable',1)
end = time.time()
print("time taken {}".format(end-start))
yields: time taken 0.010072708129882812
Any my output format matches OP's requested exactly.

Need to Calcualte Number of Order Based on WeekDay and TimeSlot

I have formatted my data through pandas in such a way that I get the number of orders that are placed in every 2 hour period for the past 3 months. I need to get the total amount of order that is placed for each timeslot based on the day of the week.
Converted OrderCount day_of_week
2/1/2019 0:00 2 Friday
2/1/2019 2:00 0 Friday
2/1/2019 4:00 0 Friday
2/1/2019 6:00 0 Friday
2/1/2019 8:00 0 Friday
2/1/2019 10:00 1 Friday
2/1/2019 12:00 2 Friday
2/1/2019 14:00 3 Friday
2/1/2019 16:00 5 Friday
2/2/2019 0:00 2 Saturday
2/2/2019 2:00 1 Saturday
2/2/2019 4:00 0 Saturday
2/2/2019 6:00 0 Saturday
2/2/2019 8:00 0 Saturday
Where Converted is my index and OrderCount column contains the count of orders by timeslot(2hr)
I have tried the following code
df.groupby([df.index.hour, df.index.weekday]).count()
But this give totally different result
What is want is the total number of orders placed on a particular day based on the timeslot
Ex
Converted OrderCount day_of_week
2/1/2019 0:00 2 Friday
2/8/2019 0:00 5 Friday
2/2/2019 4:00 1 Saturday
2/9/2019 4:00 10 Saturday
The Output Should be
TimeSlot OrderCount day_of_week
0:00 7 Friday
4:00 11 Saturday
Where total 7 is (5+2) and 11 is (1+11)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

duplicate specific rows of a dataframe based on column values - python

Related

ValueError: Cannot create a NumPy datetime other than NaT with generic units

Get values of latest year and all its months in pandas

Date Offset in pandas data range

How to set the columns in pandas

Need to Calcualte Number of Order Based on WeekDay and TimeSlot

Categories

Resources