Ultimate goal description
My goal is to compute the average of two times series (red and green) stored in pandas DataFrames. However, while both time series have the same columns, they differ in precise time points. What I want to implement is a function average which computes average time series from the two given series such that if a value is missing for particular time point, it should be interpolated. For example:
import pandas as pd
green_df = pd.DataFrame({'A': [4, 2, 5], 'B': [1, 2, 3]}, index=[1, 3, 6])
red_df = pd.DataFrame({'A': [4, 2.5, 8, 2, 4], 'B': [4, 2, 2, 4, 1]}, index=[1, 2, 4, 5, 6])
average_grey_df = pd.DataFrame({'A': [4, 2.7, 3.75, 5.5, 3, 4.5], 'B': [...]}, index= [1, 2, 3, 4, 5, 6])
assert average_grey_df == average(green_df, red_df)
It is obvious when displayed graphically (values shown for column A, but the same should be done with all columns; precise values are just illustrative):
Approach
So far I was not able to find a completely working solution. I was thinking about dividing it to three steps:
(1) extend both time series by time points from the other time series such that missing data are nan
A | ... A | ...
------- -------
1 | 4 | 1 | 4 |
2 |nan| 2 |2.5|
red: 3 | 2 | green: 3 |nan|
4 |nan| 4 | 8 |
5 |nan| 5 | 2 |
6 | 5 | 6 | 4 |
(2) fill the missing data by interpolating both dataframes (direct usage of dataframe interpolate method)
(3) finally compute average of these two time series as following:
averages = (green_df.stack() + red_df.stack()) / 2
average_grey_df = averages.unstack()
Additionally, method dropna can be used to drop created nans. Moreover, maybe there is a better method I haven't discovered.
Question
I was not able to figure out how to compute part (1) at all. I checked methods like join, merge and concat with its various examples, but none of them seems to do the job. Any suggestions? I am also open to other approaches.
Thank you
To perform the task 1) you can do this:
#union of the indexes
union_idx = green_df.index.union(red_df.index)
#reindex with the union
green_df= green_df.reindex(union_idx)
red_df= red_df.reindex(union_idx)
# the interpolation
green_df = green_df.interpolate(method='linear', limit_direction='forward', axis=0)
red_df = red_df.interpolate(method='linear', limit_direction='forward', axis=0)
grey_df= pd.concat([green_df,red_df])
grey_df= grey_df.groupby(level=0).mean()
I get (i didn't pay attention to displaying the correct colors)
You can merge the two dfs. From there, you can interpolate the NA values
green_df = pd.DataFrame({'A': [4, 2, 5], 'B': [1, 2, 3]}, index=[1, 3, 6])
red_df = pd.DataFrame({'A': [4, 2.5, 8, 2, 4], 'B': [4, 2, 2, 4, 1]}, index=[1, 2, 4, 5, 6])
combined_df = pd.merge(green_df, red_df, suffixes=('_green', '_red'), left_index=True, right_index=True, how='outer')
combined_df = combined_df.interpolate()
combined_df['A_avg'] = combined_df[["A_green", "A_red"]].mean(axis=1)
combined_df['B_avg'] = combined_df[["B_green", "B_red"]].mean(axis=1)
These can then be plotted using .plot():
combined_df[['A_green', 'A_red', 'A_avg']].plot(color=['green', 'red', 'gray'])
Related
Given a DataFrame that represents instances of called customers:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5]})
The data is ordered by time such that every customer is a time-series and every customer has different timestamps. Thus I need a column that consists of the ranked timepoints:
df_2 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5],
"call_nr" : [0,1,2,0,1,0,1,2,3,0,0,1]})
After trying different approaches I came up with this to create call_nr:
np.concatenate([np.arange(df["customer_id"].value_counts().loc[i]) for i in df["customer_id"].unique()])
It works, but I doubt this is best practice. Is there a better solution?
A simpler solution would be to groupby your 'customer_id' and use cumcount:
>>> df_1.groupby('customer_id').cumcount()
0 0
1 1
2 2
3 0
4 1
5 0
6 1
7 2
8 3
9 0
10 0
11 1
which you can assign back as a column in your dataframe
I am stuck with an issue on a massive pandas table. I would like to get a boolean to check the cross of 2 series.
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8]})
I would like to add one column in my array to get a result like this one
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8],
'C': [0, -1, 0, 1]
})
So basically to get
0 when there is no cross between series B and A
-1 when table B crosses down table A
1 when table B crosses up table A
I need to do vector calculation because my real table is like more than one million rows.
Thank you
You can compute the relative position of the 2 columns with lt, then convert to integer and compute the diff:
m = df['A'].lt(df['B'])
df['C'] = m.astype(int).diff().fillna(0, downcast='infer')
output:
A B C
0 1 10 0
1 2 1 -1
2 3 2 0
3 4 8 1
visual of A/B:
I have a huge data set with columns like: "Eas_1", "Eas_2", and so on to "Eas_40" and "Nor_1" to "Nor_40". I want to automatically create multiple separate data sets that consist of all columns that end with the same number (grouped by column name number) and column number pasted as values in the new column (Bin).
My data frame:
df = pd.DataFrame({
"Eas_1": [3, 4, 9, 1],
"Eas_2": [4, 5, 10, 2],
"Nor_1": [9, 7, 9, 2],
"Nor_2": [10, 8, 10, 3],
"Error_1": [2, 5, 1, 6],
"Error_2": [5, 0, 3, 2],
})
I don't know how to create Bin column and paste the column name values, but I could separate data sets manually like this:
df1 = df.filter(regex='_1')
df2 = df.filter(regex='_2')
This would take a lot of effort for me, plus I would have to change the script every time I get new data. This is how I imagine end result:
df1 = pd.DataFrame({
"Eas_1": [3, 4, 9, 1],
"Nor_1": [9, 7, 9, 2],
"Error_1": [2, 5, 1, 6],
"Bin": [1, 1, 1, 1],
})
Thanks in advance!
You can extract the suffixes with .str.extract, then groupby on those:
suffixes = df.columns.str.extract('(\d+)$', expand=False)
for label, data in df.groupby(suffixes, axis=1):
print('-'*10, label, '-'*10)
print(data)
Note To collect your dataframes, you can do:
dfs = [data for _, data in df.groupby(suffixes, axis=1)]
# access the second dataframe
dfs[1]
Output:
---------- 1 ----------
Eas_1 Nor_1 Error_1
0 3 9 2
1 4 7 5
2 9 9 1
3 1 2 6
---------- 2 ----------
Eas_2 Nor_2 Error_2
0 4 10 5
1 5 8 0
2 10 10 3
3 2 3 2
I have a dataframe that has duplicated time indices and I would like to get the mean across all for the previous 2 days (I do not want to drop any observations; they are all information that I need). I've checked pandas documentation and read previous posts on Stackoverflow (such as Apply rolling mean function on data frames with duplicated indices in pandas), but could not find a solution. Here's an example of how my data frame look like and the output I'm looking for. Thank you in advance.
data:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],'t': [1, 2, 3, 2, 1, 2, 2, 3, 4],'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
output:
t
v2
1
-
2
-
3
4.167
4
5
5
6.667
A rough proposal to concatenate 2 copies of the input frame in which values in 't' are replaced respectively by values of 't+1' and 't+2'. This way, the meaning of the column 't' becomes "the target day".
Setup:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],
't': [1, 2, 3, 2, 1, 2, 2, 3, 4],
'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
Implementation:
len = df.shape[0]
incr = pd.DataFrame({'id': [0]*len, 't': [1]*len, 'v1':[0]*len}) # +1 in 't'
df2 = pd.concat([df + incr, df + incr + incr]).groupby('t').mean()
df2 = df2[1:-1] # Drop the days that have no full values for the 2 previous days
df2 = df2.rename(columns={'v1': 'v2'}).drop('id', axis=1)
Output:
v2
t
3 4.166667
4 5.000000
5 6.666667
Thank you for all the help. I ended up using groupby + rolling (2 Day), and then drop duplicates (keep the last observation).
I have a dataset of stores with 2D locations at daily timestamps. I am trying to match up each row with weather measurements made at stations at some other locations, also with daily timestamps, such that the Cartesian distance between each store and matched station is minimized. The weather measurements have not been performed daily, and the station positions may vary, so this is a matter of finding the closest station for each specific store at each specific day.
I realize that I can construct nested loops to perform the matching, but I am wondering if anyone here can think of some neat way of using pandas dataframe operations to accomplish this. A toy example dataset is shown below. For simplicity, it has static weather station positions.
store_df = pd.DataFrame({
'store_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'x': [1, 1, 1, 4, 4, 4, 4, 4, 4],
'y': [1, 1, 1, 1, 1, 1, 4, 4, 4],
'date': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
weather_station_df = pd.DataFrame({
'station_id': [1, 1, 1, 2, 2, 3, 3, 3],
'weather': [20, 21, 19, 17, 16, 18, 19, 17],
'x': [0, 0, 0, 5, 5, 3, 3, 3],
'y': [2, 2, 2, 1, 1, 3, 3, 3],
'date': [1, 2, 3, 1, 3, 1, 2, 3]})
The data below is the desired outcome. I have included station_id only for clarification.
store_id date station_id weather
0 1 1 1 20
1 1 2 1 21
2 1 3 1 19
3 2 1 2 17
4 2 2 3 19
5 2 3 2 16
6 3 1 3 18
7 3 2 3 19
8 3 3 3 17
The idea of the solution is to build the table of all combinations,
df = store_df.merge(weather_station_df, on='date', suffixes=('_store', '_station'))
calculate the distance
df['dist'] = (df.x_store - df.x_station)**2 + (df.y_store - df.y_station)**2
and choose the minimum per group:
df.groupby(['store_id', 'date']).apply(lambda x: x.loc[x.dist.idxmin(), ['station_id', 'weather']]).reset_index()
If you have a lot of date the you can do the join per group.
import math
import numpy as np
def distance(x1, x2, y1, y2):
return np.sqrt((x2-x1)**2 + (y2-y1)**2)
#Join On Date to get all combinations of store and stations per day
df_all = store_df.merge(weather_station_df, on=['date'])
#Apply distance formula to each combination
df_all['distances'] = distance(df_all['x_y'], df_all['x_x'], df_all['y_y'], df_all['y_x'])
#Get Minimum distance for each day Per store_id
df_mins = df_all.groupby(['date', 'store_id'])['distances'].min().reset_index()
#Use resulting minimums to get the station_id matching the min distances
closest_stations_df = df_mins.merge(df_all, on=['date', 'store_id', 'distances'], how='left')
#filter out the unnecessary columns
result_df = closest_stations_df[['store_id', 'date', 'station_id', 'weather', 'distances']].sort_values(['store_id', 'date'])
edited: To use vectorized distance formula