randomly subsample once every month pandas - python

I have the following dataframe.
data = {'bid':['23243', '23243', '23243', '12145', '12145', '12145', '12145'],
'lid':['54346786', '23435687', '23218987', '43454432', '21113567', '54789876', '22898721'],
'date':['2021-08-11','2021-08-12','2021-09-17','2021-05-02','2021-05-11','2021-05-20','2021-08-13'],
'val1':[44,34,54,76,89,33,27],
'val2':[11,55,43,89,76,44,88]}
df = pd.DataFrame(data)
What I am looking for is to randomly pick a lid per month for the bid column, and maintain a count of past instances until the point of the random sample, something similar to this:
I can think of separating the year and months into different columns and then apply pd.groupby on the bid, year and month with the pd.Series.sample function, but there must be a better way of doing it.

Use GroupBy.cumcount per bid and then per months and bid use DataFrameGroupBy.sample:
df['date'] = pd.to_datetime(df['date'])
#if necessary sorting
#df = df.sort_values(['bid','date'])
df['prev'] = df.groupby('bid').cumcount()
df1 = df.groupby(['bid', pd.Grouper(freq='M', key='date')], sort=False).sample(n=1)
print (df1)
bid lid date val1 val2 prev
1 23243 23435687 2021-08-12 34 55 1
2 23243 23218987 2021-09-17 54 43 2
5 12145 54789876 2021-05-20 33 44 2
6 12145 22898721 2021-08-13 27 88 3

IIUC, use groupby.sample, assume date column have datetime64 dtype:
out = df.groupby([df['date'].dt.month, 'bid']).sample(n=1).reset_index(drop=True)
print(out)
# Output
bid lid date val1 val2
0 12145 21113567 2021-05-11 89 76
1 12145 22898721 2021-08-13 27 88
2 23243 23435687 2021-08-12 34 55
3 23243 23218987 2021-09-17 54 43

Related

extract number of days in month from date column, return days to date in current month

I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0

Calculation in grouped dataframe with date type index

I have a dataset like:
date_time value
30.04.20 9:31 1
30.04.20 10:12 5
30.04.20 15:16 2
01.05.20 12:01 63
01.05.20 13:00 78
02.05.20 7:23 4
02.05.20 17:34 2
02.05.20 18:34 4
02.05.20 21:39 3458
03.05.20 9:34 77
03.05.20 14:54 4
03.05.20 16:54 7
04.05.20 15:24 35
I need to group records within a day and calculate the average over 3 days (day_before-today-next_day) period as follows (desired result):
date value
01.05.2020 3617
02.05.2020 3697
03.05.2020 3591
I wrote the beginning of the code
import pandas as pd
df = pd.read_excel(...)
df['date'] = df['date_time'].dt.normalize()
df.groupby('date').sum()
The grouped dataframe here looks like:
date value
30.04.2020 8
01.05.2020 141
02.05.2020 3468
03.05.2020 88
04.05.2020 35
But I can't go further because I don't understand how to get the desired result in a concise "pandas" way. Please give me some pointers.
You almost have done your work, just add these lines of code to your current solution:
df_group = df.groupby('date').sum()
results = df_group.rolling(window=3, min_periods=3, center=True).sum()
print(results)
2020-04-30 NaN
2020-05-01 3617.0
2020-05-02 3697.0
2020-05-03 3591.0
2020-05-04 NaN
# retain only rows with values
print(results.dropna())
date
2020-05-01 3617.0
2020-05-02 3697.0
2020-05-03 3591.0
Hope this helps!

Pandas filter data for line graph

I'm trying to use Pandas to filter the dataframe. So in the dataset I have 1982-01 to 2019-11. I want to filter data based on year 2010 onwards ie. 2010-01 to 2019-11.
mydf = pd.read_csv('temperature_mean.csv')
df1 = mydf.set_index('month')
df= df1.loc['2010-01':'2019-11']
I had set the index = month, and I'm able to get the mean temperature for the filtered data. However I need to get the indexes as my x label for line graph. I'm not able to do it. I tried to use reg to get data from 201x onwards, but there's still an error.
How do I get the label for the months i.e. 2010-01, 2010-02......2019-10, 2019-11, for the line graph.
Thanks!
mydf = pd.read_csv('temperature_mean.csv')
month mean_temp
______________________
0 1982-01-01 39
1 1985-04-01 29
2 1999-03-01 19
3 2010-01-01 59
4 2013-05-01 32
5 2015-04-01 34
6 2016-11-01 59
7 2017-08-01 14
8 2017-09-01 7
df1 = mydf.set_index('month')
df= df1.loc['2010-01':'2019-11']
mean_temp
month
______________________
2010-01-01 59
2013-05-01 32
2015-04-01 34
2016-11-01 59
2017-08-01 14
2017-09-01 7
Drawing the line plot (default x & y arguments):
df.plot.line()
If for some reason, you want to manually specify the column name
df.reset_index().plot.line(x='month', y='mean_temp')

pandas find the last row with the same value as the previous row in a df

I have a df,
acct_no code date id
100 10 01/04/2019 22
100 10 01/03/2019 22
100 10 01/05/2019 22
200 20 01/06/2019 33
200 20 01/05/2019 33
200 20 01/07/2019 33
I want to first sort the df in ascending order for date when acct_no and code are the same,
df.sort_values(['acct_no', 'code', 'date'], inplace=True)
then I am wondering what the way to find the last row whose acct_no, code are the same as the previous row, the result need to look like,
acct_no code date id
100 10 01/05/2019 22
200 20 01/07/2019 33
You can also try with groupby.last():
df.groupby(['acct_no', 'code'],as_index=False).last()
acct_no code date id
0 100 10 01/05/2019 22
1 200 20 01/07/2019 33
Use DataFrame.drop_duplicates, but first convert column to datetimes:
#if dates are first use dayfirst=True
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
#if months are first
#df['date'] = pd.to_datetime(df['date'])
df1 = (df.sort_values(['acct_no', 'code', 'date'])
.drop_duplicates(['acct_no', 'code'], keep='last'))
print (df1)
acct_no code date id
2 100 10 2019-05-01 22
5 200 20 2019-07-01 33

Using pandas to csv, how to organize time and numerical data in a multi-level index

Using pandas to write to a csv, I want Monthly Income sums for each unique Source. Month is in datetime format.
I have tried resampling and groupby methods, but groupby neglects month and resampling neglects source. I currently have a multi-level index with Month and Source as indexes.
Month Source Income
2019-03-01 A 100
2019-03-05 B 50
2019-03-06 A 4
2019-03-22 C 60
2019-04-23 A 40
2019-04-24 A 100
2019-04-24 C 30
2019-06-1 C 100
2019-06-1 B 90
2019-06-8 B 20
2019-06-12 A 50
2019-06-27 C 50
I can groupby Source which neglects date, or I can resample for date which neglects source. I want monthly sums for each unique source.
What you have in the Month column is a Timestamp. So you can separate the month attribute of this Timestamp and afterward apply the groupby method, like this:
df.columns = ['Timestamp', 'Source', 'Income']
month_list = []
for i in range(len(df)):
month_list.append(df.loc[i,'Timestamp'].month)
df['Month'] = month_list
df1 = df.groupby(['Month', 'Source']).sum()
The output should be like this:
Income
Month Source
3 A 104
B 50
C 60
4 A 140
C 30
6 A 50
B 110
C 150

Categories