Two DataFrames Random Sample by Day grouping instead of hour

Two DataFrames Random Sample by Day grouping instead of hour - python

I have two dataframes, One is Price and the other one is Volume. They are both hourly and for the the same timeframe (one year).
dfP = pd.DataFrame(np.random.randint(5, 10, (8760,4)), index=pd.date_range('2008-01-01', periods=8760, freq='H'), columns='Col1 Col2 Col3 Col4'.split())
dfV = pd.DataFrame(np.random.randint(50, 100, (8760,4)), index=pd.date_range('2008-01-01', periods=8760, freq='H'), columns='Col1 Col2 Col3 Col4'.split())
Each Day is a SET in the sense that the values have to stay together. When a sample is generated, it needs to be a full day. so a sample would be (for example 24 hours of Feb 2, 2008) in this data set. I would like to generate a 185 day (50%) sample set for dfP and have the Volumes from the same days so i can generate a sum product.
dfProduct = dfP_Sample * dfV_Sample
I am lost on how to achieve this. Any help is appreciated.

It sounds like you're expecting to get the sum of the volumes and prices for each day and then multiply them together?
If that's the case, try the following. If not, please clarify your question.
priceGroup = dfP.groupby(by=dfP.index.date).sum()
volumeGroup = dfV.grouby(by=dfV.index.date).sum()
dfProduct = priceGroup*volumeGroup
If you want to just look at a specific date range, try
import datetime as datetime
dfProduct[np.logical_and(dfProduct.index > datetime.date(2006,08,09),dfProduct.index < datetime.date(2007,01,02))]

First of all we'll generate a column that refers to the day index of the year for example 2008-01-01 will be assigned 1 because it indicates first day of the year and so on
day_order = [date.timetuple().tm_yday for date in dfP.index]
dfP['day_order'] = day_order
then generate random days from 1 to 365 this will represent the day order in the year for example if you get random number 1 this indicates 2008-01-01
random_days = np.random.choice(np.arange(1 , 366) , size = 185 , replace=False)
then slice your original data frame to get only values from random sample according to the day order column we've created previously
dfP_sample = dfP[dfP.day_order.isin(random_days)]
then you can merge both frames on index , and you can do whatever you want
final = pd.merge(dfP_sample , dfV , left_index=True , right_index=True)
final.head()
Out[47]:
Col1_x Col2_x Col3_x Col4_x day_order Col1_y Col2_y Col3_y Col4_y
2008-01-03 00:00:00 9 6 9 9 3 66 85 62 82
2008-01-03 01:00:00 5 8 9 8 3 54 89 65 98
2008-01-03 02:00:00 7 5 5 9 3 83 58 60 96
2008-01-03 03:00:00 9 5 7 6 3 59 54 67 78
2008-01-03 04:00:00 9 5 8 9 3 92 66 66 55
if you don't want to merge both frames , you can apply the same logic on dfV
and then you will get samples from both data frames on the same days

Related

changing wide to long table format and splitting dates by year

I have a table that looks like this:
temp = [['K98R', 'AB',34,'2010-07-27', '2013-08-17', '2008-03-01', '2011-05-02', 44],['S33T','ES',55, '2009-07-23', '2012-03-12', '2010-09-17', '', 76]]
Data = pd.DataFrame(temp,columns=['ID','Initials','Age', 'Entry','Exit','Event1','Event2','Weight'])
What you see in the table above, is that there is an entry and exit dates, with dates for the events 1 and 2, there is also a missing date for event 2 for the second patient because the event didn't happen. Also note that the event1 for the first patient happened before entry date.
What I am trying to achieve is a two fold:
1. Split the time between the entry and exit into years
2. Convert the wide format to long one with one row per year
3. Check if event 1 and 2 have occurred during the time period included in each row
To explain further, here is the output I am trying to ge.
ID Initial Age Entry Exit Event1 Event2 Weight
K89R AB 34 27/07/2010 31/12/2010 1 0 44
K89R AB 35 1/01/2011 31/12/2011 1 1 44
K89R AB 36 1/01/2012 31/12/2012 1 1 44
K89R AB 37 1/01/2013 17/08/2013 1 1 44
S33T ES 55 23/07/2009 31/12/2009 0 0 76
S33T ES 56 1/01/2010 31/12/2010 1 0 76
S33T ES 57 1/01/2011 31/12/2011 1 0 76
S33T ES 58 1/01/2012 12/03/2012 1 0 76
What you notice here is that the entry to exit date period is split into individual rows per patient, each representing a year. The event columns are now coded as 0 (meaning the event has not yet happened) or 1 (the event happened) which is then carried over to the years after because the event has already happened.
The age increases in every row per patient as time progresses
The patient ID and initial remain the same as well as the weight.
Could anyone please help with this, thank you

Begin by getting the number of years between Entry and Exit:
# Convert to datetime
df.Entry = pd.to_datetime(df.Entry)
df.Exit = pd.to_datetime(df.Exit)
df.Event1 = pd.to_datetime(df.Event1)
df.Event2 = pd.to_datetime(df.Event2)
# Round up, to include the upper years
import math
df['Years_Between'] = (df.Exit - df.Entry).apply(lambda x: math.ceil(x.days/365))
# printing the df will provide the following:
ID Initials Age Entry Exit Event1 Event2 Weight Years_Between
0 K98R AB 34 2010-07-27 2013-08-17 2008-03-01 2011-05-02 44 4
1 S33T ES 55 2009-07-23 2012-03-12 2010-09-17 NaT 76 3
Loop through your data and create a new row for each year:
new_data = []
for idx, row in df.iterrows():
year = row['Entry'].year
new_entry = pd.to_datetime(year, format='%Y')
for y in range(row['Years_Between']):
new_entry = new_entry + pd.DateOffset(years=1)
new_exit = new_entry + pd.DateOffset(years=1) - pd.DateOffset(days=1)
record = {'Entry': new_entry,'Exit':new_exit}
if row['Entry']> new_entry:
record['Entry'] = row['Entry']
if row['Exit']< new_exit:
record['Exit'] = row['Exit']
for col in ['ID', 'Initials', 'Age', 'Event1', 'Event2', 'Weight']:
record[col] = row[col]
new_data.append(record)
Create a new DataFrame, the compare dates:
df_new = pd.DataFrame(new_data, columns = ['ID','Initials','Age', 'Entry','Exit','Event1','Event2','Weight'])
df_new['Event1'] = (df_new.Event1 <= df_new.Exit).astype(int)
df_new['Event2'] = (df_new.Event2 <= df_new.Exit).astype(int)
# printing df_new will provide:
ID Initials Age Entry Exit Event1 Event2 Weight
0 K98R AB 34 2011-01-01 2011-12-31 1 1 44
1 K98R AB 34 2012-01-01 2012-12-31 1 1 44
2 K98R AB 34 2013-01-01 2013-08-17 1 1 44
3 K98R AB 34 2014-01-01 2013-08-17 1 1 44
4 S33T ES 55 2010-01-01 2010-12-31 1 0 76
5 S33T ES 55 2011-01-01 2011-12-31 1 0 76
6 S33T ES 55 2012-01-01 2012-03-12 1 0 76

create column name repeats for column values when particular columns have duplicate rows

I have a dataframe that I need to spin around (am not sure if this involves stacking or pivoting..)
So, where I have duplicate values in columns "Year", "Month and "Group" , I want to shift the follow columns names to be repeated for the Variable
So if this is the original DF:
Year Month Group Variable feature1 feature2 feature3
2010 6 1 1 12 23 56
2010 6 1 2 34 56 25
The result will be :
Year Month Group Variable1 feature1_1 feature2_1 feature3_1 Variable2 feature1_2 feature2_2 feature3_2
2010 6 1 1 12 23 56 2 34 56 25
I am looking for something along these lines - any tips/help is much appreciated,
Thankyou
Izzy

IIUC, if you want to convert it back from long to wide , you can using cumcount get the addtional key , then reshape.(Notice this reverse of wide_to_long)
df['New']=(df.groupby(['Year','Month','Group']).cumcount()+1).astype(str)
w=df.set_index(['Year','Month','Group','New']).unstack().sort_index(level=1,axis=1)
w.columns=pd.Index(w.columns).str.join('_')
w
Out[217]:
Variable_1 feature1_1 feature2_1 feature3_1 Variable_2 \
Year Month Group
2010 6 1 1 12 23 56 2
feature1_2 feature2_2 feature3_2
Year Month Group
2010 6 1 34 56 25

Python Pandas: Create New Column With Calculations Based on Categorical Values in A Different Column

I have the following sample data frame:
id category time
43 S 8
22 I 10
15 T 350
18 L 46
I want to apply the following logic:
1) if category value equals "T" then create new column called "time_2" where "time" value is divided by 24.
2) if category value equals "L" then create new column called "time_2" where "time" value is divided by 3.5.
3) otherwise take existing "time" value from categories S or I
Below is my desired output table:
id category time time_2
43 S 8 8
22 I 10 10
15 T 350 14.58333333
18 L 46 13.14285714
I've tried using pd.np.where to get the above to work but am confused around syntax.

You can use map for rules
In [1066]: df['time_2'] = df.time / df.category.map({'T': 24, 'L': 3.5}).fillna(1)
In [1067]: df
Out[1067]:
id category time time_2
0 43 S 8 8.000000
1 22 I 10 10.000000
2 15 T 350 14.583333
3 18 L 46 13.142857

You can use np.select. This is a good alternative to nested np.where logic.
conditions = [df['category'] == 'T', df['category'] == 'L']
values = [df['time'] / 24, df['time'] / 3.5]
df['time_2'] = np.select(conditions, values, df['time'])
print(df)
id category time time_2
0 43 S 8 8.000000
1 22 I 10 10.000000
2 15 T 350 14.583333
3 18 L 46 13.142857

Performing calculations on subset of data frame subset in Python

user_id char_id rating
100 33 3
100 44 2
100 33 1
100 44 4
111 55 5
111 44 4
111 55 5
I have a data frame formatted similarly to this one and am trying to perform calculations on the ratings after they have been grouped by user_id and char_id.
It doesn't work but I need to do something like data.groupby('user_id', 'char_id') and then calculate the moving average for each char_id for each user_id. Any help? I have several thousand user_id so I can't go through and select one at a time for the calculations.
I need to somehow iterate over the user_id column and group all the same user_ids together, and save that format so that user_ids are separate. Then I need to do the same thing, iterating over char_id for each user_id subset and saving that format so that I can finally perform calculations on the subsets of subsets of ratings. So far all my attempts have been unsuccessful. The closest I came was:
def divide_by_user(data):
for user in data['user_id']:
user_data = data.where(data['user_id'] == user)
return user_data

There's no need to do this manually, creating and summarizing subsets like this is exactly what DataFrame.groupby() is for. Create your groupby:
grouped = df.groupby(['user_id', 'char_id'])
Then you can apply a function to each subset. It sounds like you want either rolling_mean or expanding_mean, both of which are already available in pandas:
df['cum_average'] = grouped['rating'].apply(pd.expanding_mean)
# New column now contains the average rating for each subset,
# including all values that have been seen so far.
df
Out[43]:
user_id char_id rating cum_average
0 100 33 3 3
1 100 44 2 2
2 100 33 1 2
3 100 44 4 3
4 111 55 5 5
5 111 44 4 4
6 111 55 5 5
Using a larger randomly-generated dataset to demonstrate rolling_window():
df = pd.DataFrame({
'user_id': [random.choice([100, 111, 112]) for n in range(n_rows)],
'char_id': [random.choice([33, 44, 55]) for n in range(n_rows)],
'rating': [random.choice([1, 2, 3, 4, 5]) for n in range(n_rows)]
})
grouped = df.groupby(['user_id', 'char_id'])
df['cum_average'] = grouped['rating'].apply(pd.rolling_mean, window=7)
# Output. The rolling average will be NaN until enough values have been
# observed for that subset, you can change this using the
# min_periods argument to rolling_window
df.sort(columns=['user_id', 'char_id'])
char_id rating user_id cum_average
3 33 1 100 NaN
19 33 2 100 NaN
22 33 5 100 NaN
34 33 1 100 NaN
47 33 1 100 NaN
48 33 1 100 NaN
49 33 1 100 1.714286
51 33 4 100 2.142857
55 33 2 100 2.142857
60 33 2 100 1.714286
66 33 2 100 1.857143
...
etc.

Try this:
"df" is the dataFrame
mean=pd.rolling_mean(df.rating, 7)

data cleaning a python dataframe

I have a Python dataframe with 1408 lines of data. My goal is to compare the largest number and smallest number associated with a given weekday during one week to the next week's number on the same day of the week which the prior largest/smallest occurred. Essentially, I want to look at quintiles (since there are 5 days in a business week) rank 1 and 5 and see how they change from week to week. Build a cdf of numbers associated to each weekday.
To clean the data, I need to remove 18 weeks in total from it. That is, every week in the dataframe associated with holidays plus the entire week following week after the holiday occurred.
After this, I think I should insert a column in the dataframe that labels all my data with Monday through Friday-- for all the dates in the file (there are 6 years of data). The reason for labeling M-F is so that I can sort each number associated to the day of the week in ascending order. And query on the day of the week.
Methodological suggestions on either 1. or 2. or both would be immensely appreciated.
Thank you!

#2 seems like it's best tackled with a combination of df.groupby() and apply() on the resulting Groupby object. Perhaps an example is the best way to explain.
Given a dataframe:
In [53]: df
Out[53]:
Value
2012-08-01 61
2012-08-02 52
2012-08-03 89
2012-08-06 44
2012-08-07 35
2012-08-08 98
2012-08-09 64
2012-08-10 48
2012-08-13 100
2012-08-14 95
2012-08-15 14
2012-08-16 55
2012-08-17 58
2012-08-20 11
2012-08-21 28
2012-08-22 95
2012-08-23 18
2012-08-24 81
2012-08-27 27
2012-08-28 81
2012-08-29 28
2012-08-30 16
2012-08-31 50
In [54]: def rankdays(df):
.....: if len(df) != 5:
.....: return pandas.Series()
.....: return pandas.Series(df.Value.rank(), index=df.index.weekday)
.....:
In [52]: df.groupby(lambda x: x.week).apply(rankdays).unstack()
Out[52]:
0 1 2 3 4
32 2 1 5 4 3
33 5 4 1 2 3
34 1 3 5 2 4
35 2 5 3 1 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.