Python Pandas - Dynamic matching of different date indices - python

I have two dataframes with different timeseries data (see example below). Whereas Dataframe1 contains multiple daily observations per month, Dataframe2 only contains one observation per month.
What I want to do now is to align the data in Dataframe2 with the last day every month in Dataframe1. The last day per month in Dataframe1 does not necessarily have to be the last day of that respective calendar month.
I'm grateful for every hint how to tackle this problem in an efficient manner (as dataframes can be quite large)
Dataframe1
----------------------------------
date A B
1980-12-31 152.799 209.132
1981-01-01 152.799 209.132
1981-01-02 152.234 209.517
1981-01-05 152.895 211.790
1981-01-06 155.131 214.023
1981-01-07 152.596 213.044
1981-01-08 151.232 211.810
1981-01-09 150.518 210.887
1981-01-12 149.899 210.340
1981-01-13 147.588 207.621
1981-01-14 148.231 208.076
1981-01-15 148.521 208.676
1981-01-16 148.931 209.278
1981-01-19 149.824 210.372
1981-01-20 149.849 210.454
1981-01-21 150.353 211.644
1981-01-22 149.398 210.042
1981-01-23 148.748 208.654
1981-01-26 148.879 208.355
1981-01-27 148.671 208.431
1981-01-28 147.612 207.525
1981-01-29 147.153 206.595
1981-01-30 146.330 205.558
1981-02-02 145.779 206.635
Dataframe2
---------------------------------
date C D
1981-01-13 53.4 56.5
1981-02-15 52.2 60.0
1981-03-15 51.8 58.0
1981-04-14 51.8 59.5
1981-05-16 50.7 58.0
1981-06-15 50.3 59.5
1981-07-15 50.6 53.5
1981-08-17 50.1 44.5
1981-09-12 50.6 38.5

To provide a readable example, I prepared test data as follows:
df1 - A couple of observations from January and February:
date A B
0 1981-01-02 152.234 209.517
1 1981-01-07 152.596 213.044
2 1981-01-13 147.588 207.621
3 1981-01-20 151.232 211.810
4 1981-01-27 150.518 210.887
5 1981-02-05 149.899 210.340
6 1981-02-14 152.895 211.790
7 1981-02-16 155.131 214.023
8 1981-02-21 180.000 200.239
df2 - Your data, also from January and February:
date C D
0 1981-01-13 53.4 56.5
1 1981-02-15 52.2 60.0
Both dataframes have date column of datetime type.
Start from getting the last observation in each month from df1:
res1 = df1.groupby(df1.date.dt.to_period('M')).tail(1)
The result, for my data, is:
date A B
4 1981-01-27 150.518 210.887
8 1981-02-21 180.000 200.239
Then, to join observations, the join must be performed on the
whole month period, not the exact date. To do this, run:
res = pd.merge(res1.assign(month=res1['date'].dt.to_period('M')),
df2.assign(month=df2['date'].dt.to_period('M')),
how='left', on='month', suffixes=('_1', '_2'), )
The result is:
date_1 A B month date_2 C D
0 1981-01-27 150.518 210.887 1981-01 1981-01-13 53.4 56.5
1 1981-02-21 180.000 200.239 1981-02 1981-02-15 52.2 60.0
If you want the merge to include data only for months where there
is at least one observation in both df1 and df2, drop how parameter.
Its default value is inner, which is the correct mode in this case.

When you have a sample dataframe, you can provide code for doing so. Simply select a column as a list (step 1 and 2) and use that list to build the dataframe with code (step 3 and 4).
import pandas as pd
# Step 1: create your dataframe, and print each column as a list, copy-paste into code example below.
df_1 = pd.read_csv('dataset1.csv')
print(list(df_1['date']))
print(list(df_1['A']))
print(list(df_1['B']))
# Step 2: create your dataframe, and print each column as a list, copy-paste into code example below.
df_2 = pd.read_csv('dataset2.csv')
print(list(df_2['date']))
print(list(df_2['C']))
print(list(df_2['D']))
# Step 3: create sample dataframe ... good if you can provide this in your future questions
df_1 = pd.DataFrame({
'date': ['12/31/1980', '1/1/1981', '1/2/1981', '1/5/1981', '1/6/1981',
'1/7/1981', '1/8/1981', '1/9/1981', '1/12/1981', '1/13/1981',
'1/14/1981', '1/15/1981', '1/16/1981', '1/19/1981', '1/20/1981',
'1/21/1981', '1/22/1981', '1/23/1981', '1/26/1981', '1/27/1981',
'1/28/1981', '1/29/1981', '1/30/1981', '2/2/1981'],
'A': [152.799, 152.799, 152.234, 152.895, 155.131,
152.596, 151.232, 150.518, 149.899, 147.588,
148.231, 148.521, 148.931, 149.824, 149.849,
150.353, 149.398, 148.748, 148.879, 148.671,
147.612, 147.153, 146.33, 145.779],
'B': [209.132, 209.132, 209.517, 211.79, 214.023,
213.044, 211.81, 210.887, 210.34, 207.621,
208.076, 208.676, 209.278, 210.372, 210.454,
211.644, 210.042, 208.654, 208.355, 208.431,
207.525, 206.595, 205.558, 206.635]
})
# Step 4: create sample dataframe ... good if you can provide this in your future questions
df_2 = pd.DataFrame({
'date': ['1/13/1981', '2/15/1981', '3/15/1981', '4/14/1981', '5/16/1981',
'6/15/1981', '7/15/1981', '8/17/1981', '9/12/1981'],
'C': [53.4, 52.2, 51.8, 51.8, 50.7, 50.3, 50.6, 50.1, 50.6],
'D': [56.5, 60.0, 58.0, 59.5, 58.0, 59.5, 53.5, 44.5, 38.5]
})
# Step 5: make sure the date field is actually a date, not a string
df_1['date'] = pd.to_datetime(df_1['date']).dt.date
# Step 6: create new colum with year and month
df_1['date_year_month'] = pd.to_datetime(df_1['date']).dt.to_period('M')
# Step 7: create boolean mask that grabs the max date for each year-month
mask_last_day_month = df_1.groupby('date_year_month')['date'].transform(max) == df_1['date']
# Step 8: create new dataframe with only last day of month
df_1_max = df_1.loc[mask_last_day_month]
print('here is dataframe 1 with only last day in the month')
print(df_1_max)
print()
# Step 9: make sure the date field is actually a date, not a string
df_2['date'] = pd.to_datetime(df_2['date']).dt.date
# Step 10: create new colum with year and month
df_2['date_year_month'] = pd.to_datetime(df_2['date']).dt.to_period('M')
print('here is the original dataframe 2')
print(df_2)
print()

Related

How to interpolate only over a specific window?

I have a dataset that follows a weekly indexation, and a list of dates that I need to get interpolated data for. For example, I have the following df with weekly aggregation:
data value
1/01/2021 10
7/01/2021 10
14/01/2021 10
28/01/2021 10
and a list of dates that do not coincide with the df indexed dates, for example:
list_dates = [12/01/2021, 13/01/2021 ...]
I need to get what the interpolated values would be for every date on the list_dates but within a given window (for ex: using only 4 values in the df to calculate to interpolation, split between before and after --> so the 2 first dates before the list date and the 2 first dates after the list date).
To get the interpolated value of the list date 12/01/2021 in the list, I would need to use:
1/1/2021
7/1/2021
14/1/2021
28/1/2021
The output would then be:
data value
1/01/2021 10
7/01/2021 10
12/01/2021 10
13/01/2021 10
14/01/2021 10
28/01/2021 10
I have successfully coded an example of this but it fails for when there are multiple NaNs consecutively (for ex: 12/01 and 13/01). I also can't concat the interpolated value before running the next one in the list, as that would be using the interpolated date to calc the new interpolated date (for ex: using 12/01 to calculate 13/01).
Any advice on how to do this?
Use interpolate to get expected outcome but before you have to prepare your dataframe like below.
I slightly modify your input data to show you interpolation with datetimeindex (method='time'):
# Input data
df = pd.DataFrame({'data': ['1/01/2021', '7/01/2021', '14/01/2021', '28/01/2021'],
'value': [10, 10, 17, 10]})
list_dates = ['12/01/2021', '13/01/2021']
# Conversion of dates
df['data'] = pd.to_datetime(df['data'], format='%d/%m/%Y')
new_dates = pd.to_datetime(list_dates, format='%d/%m/%Y')
# Set datetime column as index and append new dates
df = df.set_index('data')
df = df.reindex(df.index.append(new_dates)).sort_index()
# Interpolate with method='time'
df['value'] = df['value'].interpolate(method='time')
Output:
>>> df
value
2021-01-01 10.0
2021-01-07 10.0
2021-01-12 15.0 # <- time interpolation
2021-01-13 16.0 # <- time interpolation
2021-01-14 17.0 # <- changed from 10 to 17
2021-01-28 10.0

How to get mean of last month in pandas

I have a data set with first column is the Date, Second column is the Collaborator and third column is price paid.
I want to get the mean price paid of every Collaborator for the previous month. I want to return a table tha looks like this:
I used some solutions like rolling but i could get only the past X days, not the past month
Pandas has a built-in method .rolling
x = 3 # This is where you define the number of previous entries
df.rolling(x).mean() # Apply the mean
Hence:
df['LastMonthMean'] = df['Price'].rolling(x).mean()
I'm not sure how you want to calculate your mean but hope this helps
I would first add month column and then use groupby and would retrieve the first item
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 1, 2, 2, 2],
'collaborator': [1, 2, 3, 1, 2, 3],
'price': [100, 200, 300, 400, 500, 600]
})
df.groupby(['collaborator', 'month']).mean()
The rolling() method would have to be applied to the DataFrame grouped by Collaborator to obtain the mean sale price of every collaborator in the previous month.
Because the data would be grouped by and summarised, the number of data points would not match the original dataset, thus not allowing you to easily append the result to the original dataset.
If you use a DatetimeIndex in your DataFrame it will be considered a time series and then you can resample() the data more easily.
I have produced a replicable solution below, based on your initial question in which I resample the data and append the last month's mean to it. Thanks to #akilat90 for the function to generate random dates within a range.
import pandas as pd
import numpy as np
def random_dates(start, end, n=10):
# Function copied from #akilat90
# Available on https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
start_u = pd.to_datetime(start).value//10**9
end_u = pd.to_datetime(end).value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
size = 1000
index = random_dates(start='2021-01-01', end='2021-06-30', n=size).sort_values()
collaborators = np.random.randint(low=1, high=4, size=size)
prices = np.random.uniform(low=5., high=25., size=size)
data = pd.DataFrame({'Collaborator': collaborators,
'Price': prices}, index=index)
monthly_mean = data.groupby('Collaborator').resample('M')['Price'].mean()
data_final = pd.merge(data, monthly_mean, how='left', left_on=['Collaborator', data.index.month],
right_on=[monthly_mean.index.get_level_values('Collaborator'), monthly_mean.index.get_level_values(1).month + 1])
data_final.index = data.index
data_final = data_final.drop('key_1', axis=1)
data_final.columns = ['Collaborator', 'Price', 'LastMonthMean']
This is the output:
Collaborator Price LastMonthMean
2021-01-31 04:26:16 2 21.838910 NaN
2021-01-31 05:33:04 2 19.164086 NaN
2021-01-31 12:32:44 2 24.949444 NaN
2021-01-31 12:58:02 2 8.907224 NaN
2021-01-31 14:43:07 1 7.446839 NaN
2021-01-31 18:38:11 3 6.565208 NaN
2021-02-01 00:08:25 2 24.520149 15.230642
2021-02-01 09:25:54 2 20.614261 15.230642
2021-02-01 09:59:48 2 10.879633 15.230642
2021-02-02 10:12:51 1 22.134549 14.180087
2021-02-02 17:22:18 2 24.469944 15.230642
As you can see, the records in January 2021, the first month in this time series, do not have a valid Last Month Mean, unlike the records in February.

Pandas: group columns into a time series

Consider this set of data:
data = [{'Year':'1959:01','0':138.89,'1':139.39,'2':139.74,'3':139.69,'4':140.68,'5':141.17},
{'Year':'1959:07','0':141.70,'1':141.90,'2':141.01,'3':140.47,'4':140.38,'5':139.95},
{'Year':'1960:01','0':139.98,'1':139.87,'2':139.75,'3':139.56,'4':139.61,'5':139.58}]
How can I convert to Pandas time series, like this:
Year Value
1959-01 138.89
1959-02 139.39
1959-03 139.74
...
1959-07 141.70
1959-08 141.90
...
Code
df = pd.DataFrame(data).set_index('Year').stack().droplevel(1)
df.index=pd.date_range(start=pd.to_datetime(df.index, format='%Y:%m')[0],
periods=len(df.index), freq='M').to_period('M')
df = df.to_frame().reset_index().rename(columns={'index': 'Year', (0):'Value'})
Explanation
Converting the df to series using stack and dropping the level which is not required.
Then resetting the index for desired range and since we need the output in monthly freq, hence doing that using to_period.
Last step is to convert series back to frame and rename columns.
Output as required
Year Value
0 1959-01 138.89
1 1959-02 139.39
2 1959-03 139.74
3 1959-04 139.69
4 1959-05 140.68
5 1959-06 141.17
6 1959-07 141.70
7 1959-08 141.90
8 1959-09 141.01
9 1959-10 140.47
10 1959-11 140.38
11 1959-12 139.95
12 1960-01 139.98
13 1960-02 139.87
14 1960-03 139.75
15 1960-04 139.56
16 1960-05 139.61
17 1960-06 139.58
here is one way
s = pd.DataFrame(data).set_index("Year").stack()
s.index = pd.Index([pd.to_datetime(start, format="%Y:%m") + pd.DateOffset(months=int(off))
for start, off in s.index], name="Year")
df = s.to_frame("Value")
First we set Year as the index and then stack the values next to it. Then prepare an index from the current index via available dates + other values as month offsets. Lastly go to a frame with new column's name being Value.
to get
>>> df
Value
Year
1959-01-01 138.89
1959-02-01 139.39
1959-03-01 139.74
1959-04-01 139.69
1959-05-01 140.68
1959-06-01 141.17
1959-07-01 141.70
1959-08-01 141.90
1959-09-01 141.01
1959-10-01 140.47
1959-11-01 140.38
1959-12-01 139.95
1960-01-01 139.98
1960-02-01 139.87
1960-03-01 139.75
1960-04-01 139.56
1960-05-01 139.61
1960-06-01 139.58

What is the best way to calculate maximum and minimum values for one list, for each distinct value of another list in Python?

Say I have three lists of paired numeric data in Python. The lists are for the day of year (number between 1-365), the hour of day (number between 0-24), and the corresponding temperature at that time. I have provided example lists below:
day_of_year = [1,1,1,1,1,1,1,1,1,1,1,1,1,1] #day = Jan 1 in this example
hour_of_day = [2,4,6,8,10,12,14,16,18,20,22,24]
temperature =[23.1,22.0,24.1,26.5,23.8,40.1,32.7,41.3,29.4,36.4,22.0,24.1]
I have these hourly paired data for a location for an entire year (I've just shown simplified lists above). So for each day I have 24 day_of_year values (that are just the same number repeated, in this example the day = 1) and 24 temperature values since they're hourly data. I'm trying to design a for loop that allows me to iterate through these data to calculate and use the maximum and minimum temperature for each day of year, since another function that my code uses needs to call on those values. What would be the best way to reference all the temperature values where day_of_year are the same to calculate max and min temperatures for every day.
I have a function that takes the following inputs:
minimum_temp_today, minimum_temp_tomorrow, maximum_temp_today, maximum_temp_yesterday
I need to figure out how to pull out those values for each day of the year. I am looking for suggestions on the best way to do this. Any suggestions/tips would be super appreciated!
There are a lot of ways you could approach this, depending on what data structures you want to use. If you don't care about when the min and max occur, then personally I'd do something like this.
from collections import defaultdict
daily_temps = defaultdict(list)
for day, value in zip(day_of_year, temperature):
daily_temps[day].append(value)
ranges = dict()
for day, values in temps.items():
ranges[day] = (min(values), max(values))
Basically, you're constructing an intermediate dict that maps each day of the year to a list of all the measurements for that day. Then in the second step you use that dict to create your final dict which maps each day of the year to a tuple which is the minimum and maximum value recorded for that day.
You could use pandas which does this quite efficiently. I am using pandas 1.0.1. We end up using named aggregation for this task.
import pandas as pd
df = pd.DataFrame({'day_of_year': day_of_year, 'hour_of_day': hour_of_day, 'temperature': temperature})
print(df)
day_of_year hour_of_day temperature
0 1 2 23.1
1 1 4 22.0
2 1 6 24.1
3 1 8 26.5
4 1 10 23.8
5 1 12 40.1
6 1 14 32.7
7 1 16 41.3
8 1 18 29.4
9 1 20 36.4
10 1 22 22.0
11 1 24 24.1
df.groupby('day_of_year').agg( \
min_temp=('temperature', 'min'),
max_temp=('temperature', 'max')) \
.reset_index() \
.to_dict('records')
[{'day_of_year': 1, 'min_temp': 22.0, 'max_temp': 41.3}]
Now suppose we have data for more than one day.
day_of_year min_temp max_temp
0 1.0 22.0 41.3
1 2.0 24.0 26.0
2 3.0 24.5 42.3
grouped = df.groupby('day_of_year') \
.agg(min_temp=('temperature', 'min'),
max_temp=('temperature', 'max')) \
.reset_index()
tmrw = grouped.shift(-1) \
.rename( \
columns={'min_temp': 'min_temp_tmrw',
'max_temp': 'max_temp_tmrw'}) \
.drop('day_of_year', axis=1)
pd.concat([grouped, tmrw], axis=1).to_dict('records')
[{'day_of_year': 1.0,
'min_temp': 22.0,
'max_temp': 41.3,
'min_temp_tmrw': 24.0,
'max_temp_tmrw': 26.0},
{'day_of_year': 2.0,
'min_temp': 24.0,
'max_temp': 26.0,
'min_temp_tmrw': 24.5,
'max_temp_tmrw': 42.3},
{'day_of_year': 3.0,
'min_temp': 24.5,
'max_temp': 42.3,
'min_temp_tmrw': nan,
'max_temp_tmrw': nan}]

Passing string to dataframe iloc

I have a ProductDf which have many versions of the same product. I want to filter the last iteration of the product. So I did this as below:
productIndexDf= ProductDf.groupby('productId').apply(lambda
x:x['startDtTime'].reset_index()).reset_index()
productToPick = productIndexDf.groupby('productId')['index'].max()
get the value of productToPick into a string
productIndex = productToPick.to_string(header=False,
index=False).replace('\n',' ')
productIndex = productIndex.split()
productIndex = list(map(int, productIndex))
productIndex.sort()
productIndexStr = ','.join(str(e) for e in productIndex)
Once I get that in a Series, I call loc function manually and it works:
filteredProductDf = ProductDf.iloc[[7,8],:]
If I pass it the string, I get an error:
filteredProductDf = ProductDf.iloc[productIndexStr,:]
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
I also tried this:
filteredProductDf = ProductDf[productIndexStr]
But then I get this issue:
KeyError: '7,8'
Pandas Dataframe iloc method works only with integer type indexed value. If you want to use string value as index for accessing data from pandas dataframe then you have to use Pandas Dataframe loc method.
Know more about these method from these link.
Use of Pandas Dataframe iloc method
Use of Pandas Dataframe loc method
Ok I think you are confusing it.
Given a dataframe that look like this:
avgPrice productId startDtTime totalSold
0 42.5 A001 01/05/2018 100
1 55.5 A001 02/05/2018 150
2 48.5 A001 03/05/2018 300
3 42.5 A002 01/05/2018 220
4 53.5 A002 02/05/2018 250
I assume that you are interested in row 2 and 4 (the last value of respective productId). In pandas the easiest way would be to use drop_duplicates() with the param keep='last'. Consider this example:
import pandas as pd
d = {'startDtTime': {0: '01/05/2018', 1: '02/05/2018',
2: '03/05/2018', 3: '01/05/2018', 4: '02/05/2018'},
'totalSold': {0: 100, 1: 150, 2: 300, 3: 220, 4: 250},
'productId': {0: 'A001', 1: 'A001', 2: 'A001', 3: 'A002', 4: 'A002'},
'avgPrice': {0: 42.5, 1: 55.5, 2: 48.5, 3: 42.5, 4: 53.5}
}
# Recreate dataframe
ProductDf = pd.DataFrame(d)
# Convert column with dates to datetime objects
ProductDf['startDtTime'] = pd.to_datetime(ProductDf['startDtTime'])
# Sort values by productId and startDtTime to ensure correct order
ProductDf.sort_values(by=['productId','startDtTime'], inplace=True)
# Drop the duplicates
ProductDf.drop_duplicates(['productId'], keep='last', inplace=True)
print(ProductDf)
And you get:
avgPrice productId startDtTime totalSold
2 48.5 A001 2018-03-05 300
4 53.5 A002 2018-02-05 250

Categories