Pulling start date, end date, and mean quantity for unbalanced dataset - python

I have a dataset (seen in the image) that consists of cities (column "IBGE"), dates, and quantities (column "QTD"). I am trying to extract three things into a new column: start date per "IBGE", end date per "IBGE", and mean per "code".
Also, before doing so, should I change the index of my dataset?
The panel data is unbalanced, so different "IBGE" values have different start and end dates, and mean. How could I go about creating a new data frame with the following information separated in columns? I want the dataframe to look like this:
CODE
Start
End
Mean QTD
10001
2020-01-01
2022-01-01
604
10002
2019-09-01
2021-10-01
1008
10003
2019-02-01
2020-12-01
568
10004
2020-03-01
2021-05-01
223
...
...
...
...
99999
2020-02-01
2022-04-01
9394
I am thinking that maybe a for while loop could potentially take that info, but I am not sure how to write the code.

Try with groupby and named aggregations:
#convert DATE column to datetime if needed
df["DATE"] = pd.to_datetime(df["DATE"])
output = df.groupby("IBGE").agg(Start=("DATE","min"),
End=("DATE","max"),
Mean_QTD=("QTD","mean"))

Related

How to Calculate Year over Year Percentage Change in Dataframe with Datetime Index based on Date and not number of Periods

I have multiple Dataframes for macroeconomic timeseries. In each of these Dataframes I want to add a column showing the Year over Year percentage change. Ideally I would do this with a for loop so I don't have to repeat the process multiple times. However, the series do not have the same frequency. For example, GDP is quarterly, PCE is monthly and S&P returns are daily. So, I cannot specify the number of periods. Since my dataframe is already in Datetime index I would like to specify that I want to the percentage change to be calculated based on the dates. Is that possible?
Please see examples of my Dataframes below:
print(gdp):
Date GDP
1947-01-01 2.034450e+12
1947-04-01 2.029024e+12
1947-07-01 2.024834e+12
1947-10-01 2.056508e+12
1948-01-01 2.087442e+12
...
2021-04-01 1.936831e+13
2021-07-01 1.947889e+13
2021-10-01 1.980629e+13
2022-01-01 1.972792e+13
2022-04-01 1.969946e+13
[302 rows x 1 columns]
print(pce):
Date PCE
1960-01-01 1.695549
1960-02-01 1.706421
1960-03-01 1.692806
1960-04-01 1.863354
1960-05-01 1.911975
...
2022-02-01 6.274030
2022-03-01 6.638595
2022-04-01 6.269216
2022-05-01 6.324989
2022-06-01 6.758935
[750 rows x 1 columns]
print(spx):
Date SPX
1928-01-03 17.76
1928-01-04 17.72
1928-01-05 17.55
1928-01-06 17.66
1928-01-09 17.59
...
2022-08-19 4228.48
2022-08-22 4137.99
2022-08-23 4128.73
2022-08-24 4140.77
2022-08-25 4199.12
[24240 rows x 1 columns]
Instead of doing this:
gdp['GDP] = gdp['GDP'].pct_change(4)
pce['PCE'] = pce['PCE'].pct_change(12)
spx['SPX'] = spx['SPX'].pct_change(252)
I would like a for loop to do it for all Dataframes without specifying the periods but specifying that I want the percentage change from Year to Year.
Given:
d = {'Date': [ '2021-02-01',
'2021-03-01',
'2021-04-01',
'2021-05-01',
'2021-06-01',
'2022-02-01',
'2022-03-01',
'2022-04-01',
'2022-05-01',
'2022-06-01'],
'PCE': [ 1.695549, 1.706421, 1.692806, 1.863354, 1.911975,
6.274030, 6.638595, 6.269216, 6.324989, 6.758935]}
pce = pd.DataFrame(d)
pce = pce.set_index('Date')
pce.index = pce.to_datetime(pce.index)
You could create a new dataframe with a copy of the datetime index as a new column, resample the new dataframe with annual frequency ('A') and count all unique values in the Date column.
pce_annual_rows = pce.index.to_frame()
resampled_annual = pce_annual_rows.resample('A').count()
Next you can get the second last Date-count value and use that as your periods values in the pct_change method.
The second last, because if there is an incomplete year at the end, you probably end up with a wrong periods value. This assumes, that you have more than 1 year of data in every dataframe, otherwise you'll get an IndexError.
periods_per_year = resampled_annual['Date'].iloc[-2]
pce['ROC'] = pce['PCE'].pct_change(periods_per_year)
This produces the following output:
PCE ROC
Date
2021-02-01 1.695549 NaN
2021-03-01 1.706421 NaN
2021-04-01 1.692806 NaN
2021-05-01 1.863354 NaN
2021-06-01 1.911975 NaN
2022-02-01 6.274030 2.700294
2022-03-01 6.638595 2.890362
2022-04-01 6.269216 2.703446
2022-05-01 6.324989 2.394411
2022-06-01 6.758935 2.535054
This solution isn't very nice, maybe someone comes up with another, less complicated idea.
To build your for-loop to do this for every dataframe, you'd probably better use the same column name for the columns you want to apply the pct_change method on.

Counting backwards from end date in pd.Grouper

I want to aggregate daily data to weekly (7-day sum) but with the last date as the 'origin'. Is it possible to do a group by from the end date using pd.Grouper? This is how the data looks like:
This code:
df.groupby(pd.Grouper(key='date', freq='7d'))['value'].sum()
results to
2020-01-01 5
2020-01-08 12
2020-01-15 4
but I was hoping for this:
2020-01-01 0
2020-01-03 7
2020-01-10 14
the method you have used can be shortened using resample method of pandas on df
but i think you problem is the order your dates are;
the result you expect is more day wise output;
hence what i will recommend is splitting the df and then again merging them
df.set_index(['date'],inplace=True)
df_below = df[3:].resample('W').sum()
df_up = df.iloc[0:3,:].sum()
# or you can give dates instead of 0:3 in iloc
the rows [0,1,2] you can take sum of those n then using hstack or concat or merge again make them one DataFrame
feel free for asking further queries....

How to obtain difference of a date column in groupby

Currently my data looks like :
user_ID order_number order_start_date order_value week_day
237 135950 1594878.0 2018-01-01 534.0 Monday
235 32911 1594942.0 2018-01-01 89.0 Monday
232 208474 1594891.0 2018-01-01 85.0 Monday
231 9048 1594700.0 2018-01-01 224.0 Monday
228 134896 1594633.0 2018-01-01 449.0 Monday
What I want to achieve is groupby the records by user_ID and take difference of min and max value of each date and find out difference between them in days. Where I am struggling:
Groupby does not inherently supports minimum maximum difference
It is not possible to perform numerical operations such as mean() on datetime series which exist as a column in a dataframe. Though possible for individual series.
Any help?
I feel like your description was practically the pseudocode!
output = df.groupby('user_ID')['order_start_date'].apply(lambda g: g.max()-g.min())
You can then get the difference in days as numbers (rather than timedeltas):
output = [i / pd.Timedelta(days=1) for i in output]
The output on your example data is all 0 because there is only one entry per user, this is what you expect yes?
As for taking the mean, you just need to represent the dates as seconds since some time and then take the average. I had tried to convert all to timedeltas since an old time and then average, but this post does it better and works well with groupby. Here's a test scenario where its all data for one userID and the dates go from Jan 1st to Jan 5th, 2020:
df.loc[:,'user_ID'] = 1111
df['order_start_date'] = pd.date_range('01-01-2020','01-05-2020',periods=5)
df['order_start_date'] = np.array(df['order_start_date'],dtype='datetime64[s]').view('i8')
output = df.groupby('user_ID')['order_start_date'].mean().astype('datetime64[s]')
Results:
user_ID
1111 2020-01-03

How to filter dateframe by end of business month ('BM') using datetime?

I'm trying to look at the adjusted close stock values of a particular stock at the end of the month. I was able to get a dataframe of dates and adjclose values, but I can't seem to be able to filter that dataframe to include only dates that are end of month and their corresponding adj close value.
apple_adjclose = apple_stock[['date','adjclose']]
this is the dataframe which includes dates for 2 years in the following format YYYY-MM-DD, and adjclose has float values. Help is really appreciated!
Sample picture of input and output I'm getting
(still haven't figured out how to put tables in my questions :)
Other attempt
Attempt 3
Solved here
Lets say you have a dataframe like this with two columns,
dates = pd.date_range('01/01/2016', '12/31/2017')
df = pd.DataFrame({'date':dates,'adjclose':np.random.randint(100,200, len(dates))})
You can create an instance of offsets BMonthEnd to get the dates with MonthEnd freq and slice of the dataframe
df.loc[df.date.isin(df.date + pd.offsets.BMonthEnd(1))]
adjclose date
28 128 2016-01-29
59 193 2016-02-29
90 167 2016-03-31
119 185 2016-04-29
151 133 2016-05-31
181 184 2016-06-30
converted date from object to datetime then used .asfreq() to get what I needed. Solution can be found here:
Solution

Resampling dataframe in pandas as a checking operation

I have a DataFrame like this:
A B value
2014-11-14 12:00:00 30.5 356.3 344
2014-11-15 00:00:00 30.5 356.3 347
2014-11-15 12:00:00 30.5 356.3 356
2014-11-16 00:00:00 30.5 356.3 349
...
2017-01-06 00:00:00 30.5 356.3 347
I want to check if the index is running every 12 hours, perhaps there is some data missing, so there can be a jump of 24 or more hours. In that case I want to introduce nan in the value column and copy the values from columns A and B.
I thought of using resample:
df = df.resample('12H')
but I don't know how to handles the different columns or if this is the right approach.
EDIT: If there is a value missing, for instance in 2015-12-12 12:00:00 I would like to add a row like this:
...
2015-12-12 00:00:00 30.5 356.3 323
2015-12-12 12:00:00 30.5 356.3 NaN *<- add this*
2015-12-13 00:00:00 30.5 356.3 347
...
You can use the asfreq method to produce evenly spaced indexes every 12 hours which will automatically put np.nan values for every jump. Then you can just forward fill columns A and B.
df1= df.asfreq('12H')
df1[['A','B']] = df1[['A','B']].fillna(method='ffill')
I would go for simply sorting your dataframe on the index and create a new column that takes the value from the next row (for the time). The current time would be called "from" and the time from the next time would be called "to".
Next step would be to use the two columns ("from" and "to") to create a column containing a list of values between this row and next row for every 12 hours (a range basically).
Final step would be to "explode" every line for each value in the range. Look at How to explode a list inside a Dataframe cell into separate rows
Hope this helps :)

Categories