Joining pandas dataframes based on minimization of distance - python

I have a dataset of stores with 2D locations at daily timestamps. I am trying to match up each row with weather measurements made at stations at some other locations, also with daily timestamps, such that the Cartesian distance between each store and matched station is minimized. The weather measurements have not been performed daily, and the station positions may vary, so this is a matter of finding the closest station for each specific store at each specific day.
I realize that I can construct nested loops to perform the matching, but I am wondering if anyone here can think of some neat way of using pandas dataframe operations to accomplish this. A toy example dataset is shown below. For simplicity, it has static weather station positions.
store_df = pd.DataFrame({
'store_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'x': [1, 1, 1, 4, 4, 4, 4, 4, 4],
'y': [1, 1, 1, 1, 1, 1, 4, 4, 4],
'date': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
weather_station_df = pd.DataFrame({
'station_id': [1, 1, 1, 2, 2, 3, 3, 3],
'weather': [20, 21, 19, 17, 16, 18, 19, 17],
'x': [0, 0, 0, 5, 5, 3, 3, 3],
'y': [2, 2, 2, 1, 1, 3, 3, 3],
'date': [1, 2, 3, 1, 3, 1, 2, 3]})
The data below is the desired outcome. I have included station_id only for clarification.
store_id date station_id weather
0 1 1 1 20
1 1 2 1 21
2 1 3 1 19
3 2 1 2 17
4 2 2 3 19
5 2 3 2 16
6 3 1 3 18
7 3 2 3 19
8 3 3 3 17

The idea of the solution is to build the table of all combinations,
df = store_df.merge(weather_station_df, on='date', suffixes=('_store', '_station'))
calculate the distance
df['dist'] = (df.x_store - df.x_station)**2 + (df.y_store - df.y_station)**2
and choose the minimum per group:
df.groupby(['store_id', 'date']).apply(lambda x: x.loc[x.dist.idxmin(), ['station_id', 'weather']]).reset_index()
If you have a lot of date the you can do the join per group.

import math
import numpy as np
def distance(x1, x2, y1, y2):
return np.sqrt((x2-x1)**2 + (y2-y1)**2)
#Join On Date to get all combinations of store and stations per day
df_all = store_df.merge(weather_station_df, on=['date'])
#Apply distance formula to each combination
df_all['distances'] = distance(df_all['x_y'], df_all['x_x'], df_all['y_y'], df_all['y_x'])
#Get Minimum distance for each day Per store_id
df_mins = df_all.groupby(['date', 'store_id'])['distances'].min().reset_index()
#Use resulting minimums to get the station_id matching the min distances
closest_stations_df = df_mins.merge(df_all, on=['date', 'store_id', 'distances'], how='left')
#filter out the unnecessary columns
result_df = closest_stations_df[['store_id', 'date', 'station_id', 'weather', 'distances']].sort_values(['store_id', 'date'])
edited: To use vectorized distance formula

Related

Fill panel data with ranked timepoints in pandas

Given a DataFrame that represents instances of called customers:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5]})
The data is ordered by time such that every customer is a time-series and every customer has different timestamps. Thus I need a column that consists of the ranked timepoints:
df_2 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5],
"call_nr" : [0,1,2,0,1,0,1,2,3,0,0,1]})
After trying different approaches I came up with this to create call_nr:
np.concatenate([np.arange(df["customer_id"].value_counts().loc[i]) for i in df["customer_id"].unique()])
It works, but I doubt this is best practice. Is there a better solution?
A simpler solution would be to groupby your 'customer_id' and use cumcount:
>>> df_1.groupby('customer_id').cumcount()
0 0
1 1
2 2
3 0
4 1
5 0
6 1
7 2
8 3
9 0
10 0
11 1
which you can assign back as a column in your dataframe

Trying to make new column of values from second dataframe based on values in two other columns

I have two dataframes, df1 has columns with month and year associated with the data and df2 has month (in numbers) as the headers and years as the index values.
I'm then trying to populate a new column in df1 with the appropriate values that correspond to the month/year from df2. I have tried .loc function but not sure if it's meant to populate a whole column or just return one value at a time.
df1
other data
month
year
xyz
12
1966
xyz
1
1997
df2
index
1
2
3
4
5
....
12
1929
x
y
z
x
y
....
z
1930
x
y
z
x
y
....
z
...
x
y
z
x
y
....
z
1966
x
y
z
x
y
....
z
1997
x
y
z
x
y
....
z
and I want a new column to be added to df1 like this, based on values from df2:
other data
month
year
df2_value
xyz
12
1966
z
xyz
1
1997
x
so far I have been trying this:
df1['df2_value'] = df2.loc[df1['year'],df2['month']]
but I'm getting this key error:
KeyError: "None of [Int64Index([12, 1, 2, 3, 2, 2, 3, 2, 4, 1, 1, 2, 3, 2, 1, 2, 2,\n
2, 2, 2, 12, 3, 1, 2, 12, 1, 2, 11, 3, 1, 2, 1, 3, 12,\n
4, 3, 2, 1, 3, 2, 11, 12, 10, 12, 2, 4, 3, 1, 4, 1, 1,\n
2, 3, 1, 2, 4, 2, 2, 2, 4, 2, 3, 12, 9, 12, 3, 2, 3,\n
1, 2, 3, 11, 11, 4],\n dtype='int64')] are in the [columns]"
I have changed the month and year columns in df1 to object type instead of integer but that didn't change the error. This is my first time trying to use .loc so could be missing something very obvious or maybe I need to use an entirely different function?
Just stack df2, reset the index and merge
df1.merge(df2.stack().reset_index(),
left_on=['year', 'month'],
right_on=['index', 'level_1'])
other data month year index level_1 0
0 xyz 12 1966 1966 12 z
1 xyz 1 1997 1997 1 x

Automatically create multiple python datasets based on column names

I have a huge data set with columns like: "Eas_1", "Eas_2", and so on to "Eas_40" and "Nor_1" to "Nor_40". I want to automatically create multiple separate data sets that consist of all columns that end with the same number (grouped by column name number) and column number pasted as values in the new column (Bin).
My data frame:
df = pd.DataFrame({
"Eas_1": [3, 4, 9, 1],
"Eas_2": [4, 5, 10, 2],
"Nor_1": [9, 7, 9, 2],
"Nor_2": [10, 8, 10, 3],
"Error_1": [2, 5, 1, 6],
"Error_2": [5, 0, 3, 2],
})
I don't know how to create Bin column and paste the column name values, but I could separate data sets manually like this:
df1 = df.filter(regex='_1')
df2 = df.filter(regex='_2')
This would take a lot of effort for me, plus I would have to change the script every time I get new data. This is how I imagine end result:
df1 = pd.DataFrame({
"Eas_1": [3, 4, 9, 1],
"Nor_1": [9, 7, 9, 2],
"Error_1": [2, 5, 1, 6],
"Bin": [1, 1, 1, 1],
})
Thanks in advance!
You can extract the suffixes with .str.extract, then groupby on those:
suffixes = df.columns.str.extract('(\d+)$', expand=False)
for label, data in df.groupby(suffixes, axis=1):
print('-'*10, label, '-'*10)
print(data)
Note To collect your dataframes, you can do:
dfs = [data for _, data in df.groupby(suffixes, axis=1)]
# access the second dataframe
dfs[1]
Output:
---------- 1 ----------
Eas_1 Nor_1 Error_1
0 3 9 2
1 4 7 5
2 9 9 1
3 1 2 6
---------- 2 ----------
Eas_2 Nor_2 Error_2
0 4 10 5
1 5 8 0
2 10 10 3
3 2 3 2

How to apply rolling mean function while keeping all the observations with duplicated indices in time

I have a dataframe that has duplicated time indices and I would like to get the mean across all for the previous 2 days (I do not want to drop any observations; they are all information that I need). I've checked pandas documentation and read previous posts on Stackoverflow (such as Apply rolling mean function on data frames with duplicated indices in pandas), but could not find a solution. Here's an example of how my data frame look like and the output I'm looking for. Thank you in advance.
data:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],'t': [1, 2, 3, 2, 1, 2, 2, 3, 4],'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
output:
t
v2
1
-
2
-
3
4.167
4
5
5
6.667
A rough proposal to concatenate 2 copies of the input frame in which values in 't' are replaced respectively by values of 't+1' and 't+2'. This way, the meaning of the column 't' becomes "the target day".
Setup:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],
't': [1, 2, 3, 2, 1, 2, 2, 3, 4],
'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
Implementation:
len = df.shape[0]
incr = pd.DataFrame({'id': [0]*len, 't': [1]*len, 'v1':[0]*len}) # +1 in 't'
df2 = pd.concat([df + incr, df + incr + incr]).groupby('t').mean()
df2 = df2[1:-1] # Drop the days that have no full values for the 2 previous days
df2 = df2.rename(columns={'v1': 'v2'}).drop('id', axis=1)
Output:
v2
t
3 4.166667
4 5.000000
5 6.666667
Thank you for all the help. I ended up using groupby + rolling (2 Day), and then drop duplicates (keep the last observation).

Python Pandas Choosing Random Sample of Groups from Groupby

What is the best way to get a random sample of the elements of a groupby? As I understand it, a groupby is just an iterable over groups.
The standard way I would do this for an iterable, if I wanted to select N = 200 elements is:
rand = random.sample(data, N)
If you attempt the above where data is a 'grouped' the elements of the resultant list are tuples for some reason.
I found the below example for randomly selecting the elements of a single key groupby, however this does not work with a multi-key groupby. From, How to access pandas groupby dataframe by key
create groupby object
grouped = df.groupby('some_key')
pick N dataframes and grab their indices
sampled_df_i = random.sample(grouped.indices, N)
grab the groups using the groupby object 'get_group' method
df_list = map(lambda df_i: grouped.get_group(df_i),sampled_df_i)
optionally - turn it all back into a single dataframe object
sampled_df = pd.concat(df_list, axis=0, join='outer')
You can take a randoms sample of the unique values of df.some_key.unique(), use that to slice the df and finally groupby on the resultant:
In [337]:
df = pd.DataFrame({'some_key': [0,1,2,3,0,1,2,3,0,1,2,3],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [338]:
print df[df.some_key.isin(random.sample(df.some_key.unique(),2))].groupby('some_key').mean()
val
some_key
0 1.000000
2 3.666667
If there are more than one groupby keys:
In [358]:
df = pd.DataFrame({'some_key1':[0,1,2,3,0,1,2,3,0,1,2,3],
'some_key2':[0,0,0,0,1,1,1,1,2,2,2,2],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [359]:
gby = df.groupby(['some_key1', 'some_key2'])
In [360]:
print gby.mean().ix[random.sample(gby.indices.keys(),2)]
val
some_key1 some_key2
1 1 5
3 2 8
But if you are just going to get the values of each group, you don't even need to groubpy, MultiIndex will do:
In [372]:
idx = random.sample(set(pd.MultiIndex.from_product((df.some_key1, df.some_key2)).tolist()),
2)
print df.set_index(['some_key1', 'some_key2']).ix[idx]
val
some_key1 some_key2
2 0 3
3 1 5
I feel like lower-level numpy operations are cleaner:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"some_key": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8],
}
)
ids = df["some_key"].unique()
ids = np.random.choice(ids, size=2, replace=False)
ids
# > array([3, 2])
df.loc[df["some_key"].isin(ids)]
# > some_key val
# 2 2 3
# 3 3 4
# 6 2 1
# 7 3 5
# 10 2 7
# 11 3 8
Although this question was asked and answered long ago, I think the following is cleaner:
import pandas as pd
df = pd.DataFrame(
{
"some_key1": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"some_key2": [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8]
}
)
# Set the number of samples by group
n_samples_by_group = 1
samples_by_group = df \
.groupby(by=["some_key1", "some_key2"]) \
.sample(n_samples_by_group)

Categories