Randomly sample rows based on year-month - python

data = {'date':['2019-01-01', '2019-01-02', '2020-01-01', '2020-02-02'],
'tweets':["aaa", "bbb", "ccc", "ddd"]}
df = pandas.DataFrame(data)
df['daate'] = pandas.to_datetime(df['date'], infer_datetime_format=True)
So I have an object type date and a datetime64[ns] type date. Image that I have 100 rows in each year-month. How can I randomly sample 10 rows in each year-month and put them into a data frame? Thanks!

Use DataFrame.groupby per years and months or month periods and use custom lambda function with DataFrame.sample:
df1 = (df.groupby([df['daate'].dt.year, df['daate'].dt.month], group_keys=False)
.apply(lambda x: x.sample(n=10)))
Or:
df1 = (df.groupby(df['daate'].dt.to_period('m'), group_keys=False)
.apply(lambda x: x.sample(n=10)))
Sample:
data = {'daate':pd.date_range('2019-01-01', '2020-01-22'),
'tweets':np.random.choice(["aaa", "bbb", "ccc", "ddd"], 387)
}
df = pd.DataFrame(data)
df1 = (df.groupby([df['daate'].dt.year, df['daate'].dt.month], group_keys=False)
.apply(lambda x: x.sample(n=10)))
print (df1)
date tweets daate
9 2019-01-10 bbb 2019-01-10
29 2019-01-30 ddd 2019-01-30
17 2019-01-18 ccc 2019-01-18
12 2019-01-13 ccc 2019-01-13
20 2019-01-21 ddd 2019-01-21
.. ... ... ...
381 2020-01-17 bbb 2020-01-17
375 2020-01-11 aaa 2020-01-11
373 2020-01-09 bbb 2020-01-09
368 2020-01-04 aaa 2020-01-04
382 2020-01-18 bbb 2020-01-18
[130 rows x 3 columns]

import pandas as pd
data = {"date": ["2019-01-01", "2019-01-02", "2020-01-01", "2020-02-02"], "tweets": ["aaa", "bbb", "ccc", "ddd"]}
df = pd.DataFrame(data)
df["daate"] = pd.to_datetime(df["date"], infer_datetime_format=True)
# Just duplicating row
df = df.loc[df.index.repeat(100)]
# The actual code
available_dates = df["daate"].unique()
sampled_df = pd.DataFrame()
for each_date in available_dates:
rows_with_that_date = df.loc[df["daate"] == each_date]
sampled_rows_with_that_date = rows_with_that_date.sample(5) # 5 samples
sampled_df = sampled_df.append(sampled_rows_with_that_date)
print(len(sampled_df))

Related

Convert pandas dataframe hourly values in column names (H1, H2,... ) to a series in a separate column

I am trying to convert a dataframe in which hourly data appears in distinct columns, like here:
... to a dataframe that only contains two columns ['datetime', 'value'].
For example:
Datetime
value
2020-01-01 01:00:00
0
2020-01-01 02:00:00
0
...
...
2020-01-01 09:00:00
106
2020-01-01 10:00:00
2852
Any solution without using a for-loop?
Use DataFrame.melt with convert values to datetimes and add hours by to_timedelta with remove H:
df = df.melt('Date')
td = pd.to_timedelta(df.pop('variable').str.strip('H').astype(int), unit='H')
df['Date'] = pd.to_datetime(df['Date']) + td
You can do it by applying several function to DataFrame:
from datetime import datetime
# Example DataFrame
df = pd.DataFrame({'date': ['1/1/2020', '1/2/2020', '1/3/2020'],
'h1': [0, 222, 333],
'h2': [44, 0, 0],
"h3": [1, 2, 3]})
# To simplify I used only hours in range 1...3, so You must change it to 25
HOURS_COUNT = 4
df["hours"] = df.apply(lambda row: [h for h in range(1, HOURS_COUNT)], axis=1)
df["hour_values"] = df.apply(lambda row: {h: row[f"h{h}"] for h in range(1, HOURS_COUNT)}, axis=1)
df = df.explode("hours")
df["value"] = df.apply(lambda row: row["hour_values"][row["hours"]], axis=1)
df["date_full"] = df.apply(lambda row: datetime.strptime(f"{row['date']} {row['hours']}", "%m/%d/%Y %H"), axis=1)
df = df[["date_full", "value"]]
df = df.loc[df["value"] > 0]
So initial DataFrame is:
date h1 h2 h3
0 1/1/2020 0 44 1
1 1/2/2020 222 0 2
2 1/3/2020 333 0 3
And result DataFrame is:
date_full value
0 2020-01-01 02:00:00 44
0 2020-01-01 03:00:00 1
1 2020-01-02 01:00:00 222
1 2020-01-02 03:00:00 2
2 2020-01-03 01:00:00 333
2 2020-01-03 03:00:00 3

Date-time column names in pandas

I have a usage data per customer, collected per months during several years, shaped as ~(6000, 60).
Sample dataframe:
df = pd.DataFrame({'id': ['user_1', 'user_2'], 'access_type': ['mobile', 'desktop'], '2018-09-01 00:00:00': [7,5], '2018-10-01 00:00:00':[1,3], '2018-11-01 00:00:00':[0,10]})
id access_type 2018-09-01 00:00:00 2018-10-01 00:00:00 2018-11-01 00:00:00
0 user_1 mobile 7 1 0
1 user_2 desktop 5 3 10
How do I change 40 selected date-columns to a datetime index (?) format, or other format that will allow selecting/slicing required periods of time as date?
Use DataFrame.melt with DataFrame.set_index:
df2 = (df.melt(['id','access_type'], var_name='date')
.assign(date = lambda x: pd.to_datetime(x['date']))
.set_index('date'))
print (df2)
id access_type value
date
2018-09-01 user_1 mobile 7
2018-09-01 user_2 desktop 5
2018-10-01 user_1 mobile 1
2018-10-01 user_2 desktop 3
2018-11-01 user_1 mobile 0
2018-11-01 user_2 desktop 10
If need MultiIndex use set_index with DataFrame.stack:
s = (df.set_index(['id','access_type'])
.stack()
.rename(index = lambda x: pd.to_datetime(x), level=2))
print (s)
Or:
s = (df.melt(['id','access_type'], var_name='date')
.assign(date = lambda x: pd.to_datetime(x['date']))
.set_index(['id','access_type','date'])['value'])
print (s)

How to replace timestamp across the columns using pandas

df = pd.DataFrame({
'subject_id':[1,1,2,2],
'time_1':['2173/04/11 12:35:00','2173/04/12 12:50:00','2173/04/11 12:59:00','2173/04/12 13:14:00'],
'time_2':['2173/04/12 16:35:00','2173/04/13 18:50:00','2173/04/13 22:59:00','2173/04/21 17:14:00'],
'val' :[5,5,40,40],
'iid' :[12,12,12,12]
})
df['time_1'] = pd.to_datetime(df['time_1'])
df['time_2'] = pd.to_datetime(df['time_2'])
df['day'] = df['time_1'].dt.day
Currently my dataframe looks like as shown below
I would like to replace the timestamp in time_1 column to 00:00:00 and time_2 column to 23:59:00
This is what I tried but it doesn't work
df.groupby(df['subject_id'])['time_1'].apply(lambda x: pd.datetime.strftime(x, "%H:%M:%S") == "00:00:00") #approach 1
df.groupby(df['subject_id'])['time_1'].apply(lambda x: pd.pd.Timestamp(hour = '00', second = '00')) #approach 2
I expect my output dataframe to be like as shown below
I pandas if all datetimes have 00:00:00 times in same column then not display it.
Use Series.dt.floor or Series.str.normalize for remove times and for second add DateOffset:
df['time_1'] = pd.to_datetime(df['time_1']).dt.floor('d')
#alternative
#df['time_1'] = pd.to_datetime(df['time_1']).dt.normalize()
df['time_2']=pd.to_datetime(df['time_2']).dt.floor('d') + pd.DateOffset(hours=23, minutes=59)
df['day'] = df['time_1'].dt.day
print (df)
subject_id time_1 time_2 val iid day
0 1 2173-04-11 2173-04-12 23:59:00 5 12 11
1 1 2173-04-12 2173-04-13 23:59:00 5 12 12
2 2 2173-04-11 2173-04-13 23:59:00 40 12 11
3 2 2173-04-12 2173-04-21 23:59:00 40 12 12

pandas merge on date range

I have two dataframes,
df = pd.DataFrame({'Date': ['2011-01-02', '2011-04-10', '2015-02-02', '2016-03-03'], 'Price': [100, 200, 300, 400]})
df2 = pd.DataFrame({'Date': ['2011-01-01', '2014-01-01'], 'Revenue': [14, 128]})
I want add df2.revenue to df to produce the below table using the both the date columns for reference.
Date Price Revenue
2011-01-02 100 14
2011-04-10 200 14
2015-02-02 300 128
2016-03-03 400 128
As above the revenue is added according df2.Date and df.Date
Use merge_asof:
df['Date'] = pd.to_datetime(df['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
df3 = pd.merge_asof(df, df2, on='Date')
print (df3)
Date Price Revenue
0 2011-01-02 100 14
1 2011-04-10 200 14
2 2015-02-02 300 128
3 2016-03-03 400 128

Applying Date Operation to Entire Data Frame

import pandas as pd
import numpy as np
df = pd.DataFrame({'year': np.repeat(2018,12), 'month': range(1,13)})
In this data frame, I am interested in creating a field called 'year_month' such that each value looks like so:
datetime.date(df['year'][0], df['month'][0], 1).strftime("%Y%m")
I'm stuck on how to apply this operation to the entire data frame and would appreciate any help.
Join both columns converted to strings and for months add zfill:
df['new'] = df['year'].astype(str) + df['month'].astype(str).str.zfill(2)
Or add new column day by assign, convert columns to_datetime and last strftime:
df['new'] = pd.to_datetime(df.assign(day=1)).dt.strftime("%Y%m")
If multiple columns in DataFrame:
df['new'] = pd.to_datetime(df.assign(day=1)[['day','month','year']]).dt.strftime("%Y%m")
print (df)
month year new
0 1 2018 201801
1 2 2018 201802
2 3 2018 201803
3 4 2018 201804
4 5 2018 201805
5 6 2018 201806
6 7 2018 201807
7 8 2018 201808
8 9 2018 201809
9 10 2018 201810
10 11 2018 201811
11 12 2018 201812
Timings:
df = pd.DataFrame({'year': np.repeat(2018,12), 'month': range(1,13)})
df = pd.concat([df] * 1000, ignore_index=True)
In [212]: %timeit pd.to_datetime(df.assign(day=1)).dt.strftime("%Y%m")
10 loops, best of 3: 74.1 ms per loop
In [213]: %timeit df['year'].astype(str) + df['month'].astype(str).str.zfill(2)
10 loops, best of 3: 41.3 ms per loop
One way would be to create the datetime objects directly from the source data:
import pandas as pd
import numpy as np
from datetime import date
df = pd.DataFrame({'date': [date(i, j, 1) for i, j \
in zip(np.repeat(2018,12), range(1,13))]})
# date
# 0 2018-01-01
# 1 2018-02-01
# 2 2018-03-01
# 3 2018-04-01
# 4 2018-05-01
# 5 2018-06-01
# 6 2018-07-01
# 7 2018-08-01
# 8 2018-09-01
# 9 2018-10-01
# 10 2018-11-01
# 11 2018-12-01
You could use an apply function such as:
df['year_month'] = df.apply(lambda row: datetime.date(row[1], row[0], 1).strftime("%Y%m"), axis = 1)

Categories