Python Pandas Choosing Random Sample of Groups from Groupby - python

What is the best way to get a random sample of the elements of a groupby? As I understand it, a groupby is just an iterable over groups.
The standard way I would do this for an iterable, if I wanted to select N = 200 elements is:
rand = random.sample(data, N)
If you attempt the above where data is a 'grouped' the elements of the resultant list are tuples for some reason.
I found the below example for randomly selecting the elements of a single key groupby, however this does not work with a multi-key groupby. From, How to access pandas groupby dataframe by key
create groupby object
grouped = df.groupby('some_key')
pick N dataframes and grab their indices
sampled_df_i = random.sample(grouped.indices, N)
grab the groups using the groupby object 'get_group' method
df_list = map(lambda df_i: grouped.get_group(df_i),sampled_df_i)
optionally - turn it all back into a single dataframe object
sampled_df = pd.concat(df_list, axis=0, join='outer')

You can take a randoms sample of the unique values of df.some_key.unique(), use that to slice the df and finally groupby on the resultant:
In [337]:
df = pd.DataFrame({'some_key': [0,1,2,3,0,1,2,3,0,1,2,3],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [338]:
print df[df.some_key.isin(random.sample(df.some_key.unique(),2))].groupby('some_key').mean()
val
some_key
0 1.000000
2 3.666667
If there are more than one groupby keys:
In [358]:
df = pd.DataFrame({'some_key1':[0,1,2,3,0,1,2,3,0,1,2,3],
'some_key2':[0,0,0,0,1,1,1,1,2,2,2,2],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [359]:
gby = df.groupby(['some_key1', 'some_key2'])
In [360]:
print gby.mean().ix[random.sample(gby.indices.keys(),2)]
val
some_key1 some_key2
1 1 5
3 2 8
But if you are just going to get the values of each group, you don't even need to groubpy, MultiIndex will do:
In [372]:
idx = random.sample(set(pd.MultiIndex.from_product((df.some_key1, df.some_key2)).tolist()),
2)
print df.set_index(['some_key1', 'some_key2']).ix[idx]
val
some_key1 some_key2
2 0 3
3 1 5

I feel like lower-level numpy operations are cleaner:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"some_key": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8],
}
)
ids = df["some_key"].unique()
ids = np.random.choice(ids, size=2, replace=False)
ids
# > array([3, 2])
df.loc[df["some_key"].isin(ids)]
# > some_key val
# 2 2 3
# 3 3 4
# 6 2 1
# 7 3 5
# 10 2 7
# 11 3 8

Although this question was asked and answered long ago, I think the following is cleaner:
import pandas as pd
df = pd.DataFrame(
{
"some_key1": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"some_key2": [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8]
}
)
# Set the number of samples by group
n_samples_by_group = 1
samples_by_group = df \
.groupby(by=["some_key1", "some_key2"]) \
.sample(n_samples_by_group)

Related

How to generate numeric mapping for categorical columns in pandas?

I want to manipulate categorical data using pandas data frame and then convert them to numpy array for model training.
Say I have the following data frame in pandas.
import pandas as pd
df2 = pd.DataFrame({"c1": ['a','b',None], "c2": ['d','e','f']})
>>> df2
c1 c2
0 a d
1 b e
2 None f
And now I want "compress the categories" horizontally as the following:
compressed_categories
0 c1-a, c2-d <--- this could be a string, ex. "c1-a, c2-d" or array ["c1-a", "c2-d"] or categorical data
1 c1-b, c2-e
2 c1-nan, c2-f
Next I want to generate a dictionary/vocabulary based on the unique occurrences plus "nan" columns in compressed_categories, ex:
volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-c": 2,
"c1-nan": 3,
"c2-d": 4,
"c2-e": 5,
"c2-f": 6,
"c2-nan": 7,
}
So I can further numerically encoding then as follows:
compressed_categories_numeric
0 [0, 4]
1 [1, 5]
2 [3, 6]
So my ultimate goal is to make it easy to convert them to numpy array for each row and thus I can further convert it to tensor.
input_data = np.asarray(df['compressed_categories_numeric'].tolist())
then I can train my model using input_data.
Can anyone please show me an example how to make this series of conversion? Thanks in advance!
To build volcab dictionary and compressed_categories_numeric, you can use:
df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)
Output:
>>> volcab
{'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
>>> df2
c1 c2 compressed_categories_numeric
0 a d [0, 3]
1 b e [1, 4]
2 None f [2, 5]
>>> np.array(df2['compressed_categories_numeric'].tolist())
array([[0, 3],
[1, 4],
[2, 5]])

Fill panel data with ranked timepoints in pandas

Given a DataFrame that represents instances of called customers:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5]})
The data is ordered by time such that every customer is a time-series and every customer has different timestamps. Thus I need a column that consists of the ranked timepoints:
df_2 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5],
"call_nr" : [0,1,2,0,1,0,1,2,3,0,0,1]})
After trying different approaches I came up with this to create call_nr:
np.concatenate([np.arange(df["customer_id"].value_counts().loc[i]) for i in df["customer_id"].unique()])
It works, but I doubt this is best practice. Is there a better solution?
A simpler solution would be to groupby your 'customer_id' and use cumcount:
>>> df_1.groupby('customer_id').cumcount()
0 0
1 1
2 2
3 0
4 1
5 0
6 1
7 2
8 3
9 0
10 0
11 1
which you can assign back as a column in your dataframe

How to apply rolling mean function while keeping all the observations with duplicated indices in time

I have a dataframe that has duplicated time indices and I would like to get the mean across all for the previous 2 days (I do not want to drop any observations; they are all information that I need). I've checked pandas documentation and read previous posts on Stackoverflow (such as Apply rolling mean function on data frames with duplicated indices in pandas), but could not find a solution. Here's an example of how my data frame look like and the output I'm looking for. Thank you in advance.
data:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],'t': [1, 2, 3, 2, 1, 2, 2, 3, 4],'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
output:
t
v2
1
-
2
-
3
4.167
4
5
5
6.667
A rough proposal to concatenate 2 copies of the input frame in which values in 't' are replaced respectively by values of 't+1' and 't+2'. This way, the meaning of the column 't' becomes "the target day".
Setup:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],
't': [1, 2, 3, 2, 1, 2, 2, 3, 4],
'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
Implementation:
len = df.shape[0]
incr = pd.DataFrame({'id': [0]*len, 't': [1]*len, 'v1':[0]*len}) # +1 in 't'
df2 = pd.concat([df + incr, df + incr + incr]).groupby('t').mean()
df2 = df2[1:-1] # Drop the days that have no full values for the 2 previous days
df2 = df2.rename(columns={'v1': 'v2'}).drop('id', axis=1)
Output:
v2
t
3 4.166667
4 5.000000
5 6.666667
Thank you for all the help. I ended up using groupby + rolling (2 Day), and then drop duplicates (keep the last observation).

How to sort column names in pandas dataframe by specifying keywords

Specify any keyword in list or dict format as follows
Is it possible to sort columns in a data frame?
df = pd.DataFrame ({
"col_cc_7": [0, 0, 0],
"col_aa_7": [1, 1, 1],
"col_bb_7": [2, 2, 2]})
# before
col_cc_7, col_aa_7, col_bb_7
0, 1, 2
0, 1, 2
0, 1, 2
# sort
custom_sort_key = ["aa", "bb", "cc"]
# ... sort codes ...
# after
col_aa_7, col_bb_7, col_cc_7
1, 2, 0
1, 2, 0
1, 2, 0
For me, your question is a little confusing.
If you only want to sort your columns values, a simple google search would do the trick, if not, I could not understand the question.
df= df.sort_values(by=['col','col2', "col3"],ascending=[True,True,False])
The by= sets the order of the sorting, and the ascending is self explanatory.
We can split by the middle value and create a dictionary of your columns, then apply a sort before we assign this back. I've added some extra columns not in your sort to show what will happen to them.
df = pd.DataFrame ({
"col_cc_7": [0, 0, 0],
"col_aa_7": [1, 1, 1],
"col_bb_7": [2, 2, 2],
"col_ee_7": [2, 2, 2],
"col_dd_7": [2, 2, 2]})
custom_sort_key = ["bb", "cc", "aa"]
col_dict = dict(zip(df.columns,[x.split('_')[1] for x in df.columns.tolist()]))
#{'col_cc_7': 'cc',
# 'col_aa_7': 'aa',
# 'col_bb_7': 'bb',
# 'col_ee_7': 'ee',
# 'col_dd_7': 'dd'}
d = {v:k for k,v in enumerate(custom_sort_key)}
# this will only work on python 3.6 +
new_cols = dict(sorted(col_dict.items(), key=lambda x: d.get(x[1], float('inf'))))
df[new_cols.keys()]
col_bb_7 col_cc_7 col_aa_7 col_ee_7 col_dd_7
0 2 0 1 2 2
1 2 0 1 2 2
2 2 0 1 2 2

Joining pandas dataframes based on minimization of distance

I have a dataset of stores with 2D locations at daily timestamps. I am trying to match up each row with weather measurements made at stations at some other locations, also with daily timestamps, such that the Cartesian distance between each store and matched station is minimized. The weather measurements have not been performed daily, and the station positions may vary, so this is a matter of finding the closest station for each specific store at each specific day.
I realize that I can construct nested loops to perform the matching, but I am wondering if anyone here can think of some neat way of using pandas dataframe operations to accomplish this. A toy example dataset is shown below. For simplicity, it has static weather station positions.
store_df = pd.DataFrame({
'store_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'x': [1, 1, 1, 4, 4, 4, 4, 4, 4],
'y': [1, 1, 1, 1, 1, 1, 4, 4, 4],
'date': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
weather_station_df = pd.DataFrame({
'station_id': [1, 1, 1, 2, 2, 3, 3, 3],
'weather': [20, 21, 19, 17, 16, 18, 19, 17],
'x': [0, 0, 0, 5, 5, 3, 3, 3],
'y': [2, 2, 2, 1, 1, 3, 3, 3],
'date': [1, 2, 3, 1, 3, 1, 2, 3]})
The data below is the desired outcome. I have included station_id only for clarification.
store_id date station_id weather
0 1 1 1 20
1 1 2 1 21
2 1 3 1 19
3 2 1 2 17
4 2 2 3 19
5 2 3 2 16
6 3 1 3 18
7 3 2 3 19
8 3 3 3 17
The idea of the solution is to build the table of all combinations,
df = store_df.merge(weather_station_df, on='date', suffixes=('_store', '_station'))
calculate the distance
df['dist'] = (df.x_store - df.x_station)**2 + (df.y_store - df.y_station)**2
and choose the minimum per group:
df.groupby(['store_id', 'date']).apply(lambda x: x.loc[x.dist.idxmin(), ['station_id', 'weather']]).reset_index()
If you have a lot of date the you can do the join per group.
import math
import numpy as np
def distance(x1, x2, y1, y2):
return np.sqrt((x2-x1)**2 + (y2-y1)**2)
#Join On Date to get all combinations of store and stations per day
df_all = store_df.merge(weather_station_df, on=['date'])
#Apply distance formula to each combination
df_all['distances'] = distance(df_all['x_y'], df_all['x_x'], df_all['y_y'], df_all['y_x'])
#Get Minimum distance for each day Per store_id
df_mins = df_all.groupby(['date', 'store_id'])['distances'].min().reset_index()
#Use resulting minimums to get the station_id matching the min distances
closest_stations_df = df_mins.merge(df_all, on=['date', 'store_id', 'distances'], how='left')
#filter out the unnecessary columns
result_df = closest_stations_df[['store_id', 'date', 'station_id', 'weather', 'distances']].sort_values(['store_id', 'date'])
edited: To use vectorized distance formula

Categories