Join two Pandas dataframes, sampling from the smaller dataframe

Join two Pandas dataframes, sampling from the smaller dataframe - python

I have two dataframes that look as follows:
import pandas as pd
import io
train_data="""input_example,user_id
example0.npy, jane
example1.npy, bob
example4.npy, alice
example5.npy, jane
example3.npy, bob
example2.npy, bob
"""
user_data="""user_data,user_id
data_jane0.npy, jane
data_jane1.npy, jane
data_bob0.npy, bob
data_bob1.npy, bob
data_alice0.npy, alice
data_alice1.npy, alice
data_alice2.npy, alice
"""
train_df = pd.read_csv(io.StringIO(train_data), sep=",")
user_df = pd.read_csv(io.StringIO(user_data), sep=",")
Suppose that the train_df table is many thousands of entries long, i.e., there are 1000s of unique "exampleN.npy" files. I was wondering if there was a straightforward way to merge the train_df and user_df tables where each row of the joined table matches on the key user_id but is subsampled from user_df.
Here is one example of a resulting dataframe (I'm trying to do uniform sampling, so theoretically, there are infinite possible result dataframes):
>>> result_df
input_example user_data user_id
0 example0.npy data_jane0.npy jane
1 example1.npy data_bob1.npy bob
2 example4.npy data_alice0.npy alice
3 example5.npy data_jane1.npy jane
4 example3.npy data_bob0.npy bob
5 example2.npy data_bob0.npy bob
That is, the user_data column is filled with a random choice of filename based on the corresponding user_id.
I know one could write this using some multi-line for-loop query-based approach, but perhaps there was a faster way using built-in Pandas functions, e.g., "sample", "merge", "join", or "combine".

You can sample by groups in user_df and then join that with train_df.
e.g.,
# this samples by fraction so each data is equally likely
user_df = user_df.groupby("user_id").sample(frac=0.5, replace=True)
user_data user_id
6 data_alice2.npy alice
4 data_alice0.npy alice
3 data_bob1.npy bob
0 data_jane0.npy jane
or
# this will sample 2 samples per group
user_df = user_df.groupby("user_id").sample(n=2, replace=True)
user_data user_id
6 data_alice2.npy alice
4 data_alice0.npy alice
2 data_bob0.npy bob
2 data_bob0.npy bob
0 data_jane0.npy jane
1 data_jane1.npy jane
Join
pd.merge(train_df, user_df)

I don't know if it is possible to merge with a sample without first merging both. This doesn't include a multi-line for loop:
merged = train_df.merge(user_df, on="user_id", how="left").\
groupby("input_example", as_index=False).\
apply(lambda x: x.sample(1)).\
reset_index(drop=True)
merge the two together, on "user_id", only taking those that appear in the left
group by "input_example", assuming these will all be unique (other could group on both columns of train_df)
take a sample of size 1 for these
reset the index
Sampling second, after the merge, means that rows with the same user_id will not necessarily be the same (but sampling user_df first would result in all rows in the output dataframe with the same user_id).

Think I figured out a solution myself, it's a one-liner but conceptually it's the same as what #Rawson suggested. First, I do a left-merge, which results in a table with many duplicates. Then I shuffle all the rows to give it randomness. Finally, I drop the duplicates. If I add "sort_index", the resulting table will have the same ordering as the original table.
I'm able to use the random_state kwarg to switch up which user_data file is used. See here:
>>> train_df.merge(user_df, on='user_id', how='left').sample(frac=1, random_state=0).drop_duplicates('input_example').sort_index()
input_example user_id user_data
1 example0.npy jane data_jane1.npy
2 example1.npy bob data_bob0.npy
6 example4.npy alice data_alice2.npy
8 example5.npy jane data_jane1.npy
10 example3.npy bob data_bob1.npy
11 example2.npy bob data_bob0.npy
>>> train_df.merge(user_df, on='user_id', how='left').sample(frac=1, random_state=1).drop_duplicates('input_example').sort_index()
input_example user_id user_data
1 example0.npy jane data_jane1.npy
2 example1.npy bob data_bob0.npy
4 example4.npy alice data_alice0.npy
7 example5.npy jane data_jane0.npy
10 example3.npy bob data_bob1.npy
12 example2.npy bob data_bob1.npy

Related

Pandas - Compare each row with one another across dataframe and list the amount of duplicate values

I would like to add a column to an existing dataframe that compares every row in the dataframe against each other and list the amount of duplicate values. (I don't want to remove any of the rows, even if they are entirely duplicated with another row)
The duplicates column should show something like this:
Name Name1 Name2 Name3 Name4 Duplicates
Mark Doug Jim Tom Alex 5
Mark Doug Jim Tom Peter 4
Mark Jim Doug Tom Alex 5
Josh Jesse Jim Tom Alex 3
Adam Cam Max Matt James 0

IIUC, you can convert your dataframe to an array of sets, then use numpy broadcasting to compare each combination (except the diagonal) and get the max intersection:
names = df.agg(set, axis=1)
a = df.agg(set, axis=1).to_numpy()
b = a&a[:,None]
np.fill_diagonal(b, {})
df['Duplicates'] = [max(map(len, x)) for x in b]
output:
Name Name1 Name2 Name3 Name4 Duplicates
0 Mark Doug Jim Tom Alex 5
1 Mark Doug Jim Tom Peter 4
2 Mark Jim Doug Tom Alex 5
3 Josh Jesse Jim Tom Alex 3
4 Adam Cam Max Matt James 0

something you can use is the DataFrame's handy dandy groupby method. This will allow you to group your data(frame) by a specified subset of attributes/columns, and use the size() or count() method to get the number of rows in each group:
# group pandas.DataFrame df by Names 1-4 and summarize with count
duplicates = df.groupby(['name_1','name_2','name_3','name_4']).size().sort_values(ascending=False).reset_index(name='duplicates')
Note: I use the reset_index() method here to return a DataFrame instead of a Series. I use the sort_values() method to order the dataframe by the size (in this case, the number of rows in each group).

This is something I came up with, not an optimal solution - however, works!
unique_values_dict={} # dictionary to store the unique values of each column
columns=df.columns # all columns are kept in the columns variable
for c in columns: # for each column, find the distribution and store in a dictionary
unique_values=[ (i,j) for i,j in df[c].value_counts().items()]
unique_values_dict[c]=unique_values
df['duplicates']=0 # add a column called 'duplicates' with 0 as the default value
def count_duplicates(row):
'''
row: Each row of the dataframe
returns: sum of duplicate items in the row, across the df
'''
dups=row['duplicates']
for c in columns:
#print(c,row[c],unique_values_dict[c][1][0])
if(row[c]==unique_values_dict[c][1][0]):
dups+=unique_values_dict[c][1][1]
#print(dups)
return dups
df['duplicates']=df.apply(lambda row:count_duplicates(row),axis=1)

Python Merging data frames

In python, I have a df that looks like this
Name ID
Anna 1
Polly 1
Sarah 2
Max 3
Kate 3
Ally 3
Steve 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 2
Lacy 2
Ryan 3
Colt 4
Tia 4
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Polly 1
Sarah 2
Max 3
Kate 3
Ally 3
Steve 3
Dan 4
Hallie 5
Cam 5
Lacy 5
Ryan 6
Colt 7
Tia 7
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time. I know that I can reset the index if ID is a unique identifier. But in this case, more than one person can have the same ID. So how can I account for that?

From the example that you have provided above, you can observe that we can obtain the final dataframe by: adding the maximum value of ID in first df to the second and then concatenating them, to explain this better:
Name df2 final_df
Dan 1 4
This value in final_df is obtained by doing a 1+(max value of ID from df1 i.e. 3) and this trend is followed for all entries for the dataframe.
Code:
import pandas as pd
df = pd.DataFrame({'Name':['Anna','Polly','Sarah','Max','Kate','Ally','Steve'],'ID':[1,1,2,3,3,3,3]})
df1 = pd.DataFrame({'Name':['Dan','Hallie','Cam','Lacy','Ryan','Colt','Tia'],'ID':[1,2,2,2,3,4,4]})
max_df = df['ID'].max()
df1['ID'] = df1['ID'].apply(lambda x: x+max_df)
final_df = pd.concat([df,df1])
print(final_df)

Find customers with at least 5 transactions - looking for a tidier way to do this

I have a dataframe with customer transactions:
customer_id transaction_amount
0 bob 12
1 bob 34
2 bob 56
3 bob 23
4 mary 12
5 mary 34
6 mary 3
7 mary 53
8 mary 23
9 mary 12
10 mary 5
11 jim 2
12 jim 5
I want to find customers with at least 5 transactions.
What I'm doing works, but it looks very messy. It's a very ugly groupby
I groupby customer_id, count the transaction_amount.
df.groupby('customer_id')['transaction_amount'].count()
customer_id
bob 4
jim 2
mary 7
Then I create a mask using the same groupby, but I add the filter >=5 , then get the index of the result.
mask = (df.groupby('customer_id')['transaction_amount'].count()>=5)
df.groupby('customer_id')['transaction_amount'].count()[mask].index
This gives me my result:
Index(['mary']
Surely there's a tidier way to do this? Two groupbys on top of eachother just feels wrong.
This is an operation I do a lot in work, so I'd like to know if it can be tidier or more "pythonic" I guess.

Assuming that each row with customer_id is a unique transaction, you don't need groupby at all. You can use value_counts
s = df['customer_id'].value_counts()
s[s >= 5].index

You can avoid groupby using pandas.Series.value_counts for that following way, consider following simple example
import pandas as pd
df = pd.DataFrame({'name':['bob','bob','mary','mary','mary','jim']})
cnt = df.name.value_counts()
morethan3 = list(cnt.where(cnt>=3).dropna().index)
print(morethan3)
output
['mary']
Explanation: value_counts() return pandas.Series with index being investigated objects and values being number of occurences, I replace all value which are not >=3 with NaN using .where then jettison Nan using .dropna and finally taking what index remained.

How to sum list of pandas dataframes by with respect to given column

I have list of pandas dataframes with two columns, basically class and value:
df1:
Name
Count
Bob
10
John
20
df2:
Name
Count
Mike
30
Bob
40
There might be same "Names" in different dataframes, there might be no same "Names", and list contains over 100 dataframes. But in each dataframe all "Names" are unique.
What I need is to iterate over all dataframes and create one big one, where presented all values from "Names" and their total sums of "Count" from all the dataframes, so like:
result:
Name
Count
Bob
50
John
20
Mike
30
Bob's data is summed, others are not, as they are only present once. Is there efficient way once there are many dataframes?

do pd.concat then groupby:
df = pd.concat(dfs) # where dfs is a list of dataframes
then you can do
gp = df.groupby(['Name'])['Count'].sum()

you can do the following (assuming you have ,more data that only conatined in one dataframe use fill_value=0 to still provide value..:
df1.set_index('Name').add(df2.set_index('Name'), fill_value=0).reset_index()
>>> Name Count
0 Bob 50.0
1 John 20.0
2 Mike 30.0

How to show zero counts in pandas groupby for large dataframes

I have a dataframe where I want to see the zero counts. It's essentially a duplicate of this question: Pandas Groupby How to Show Zero Counts in DataFrame
But unfortunately the answer is not a duplicate one. Whenever I try the MultiIndex.from_product approach, I get the error:
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
which is because I have several unique values for the groupby. I've confirmed though that the same script works for much smaller dataframes with fewer unique indices (and therefore, fewer elements in df.index.levels[i].values).
Here's an idea on the dataframe that I'm working with:
user1 user2 hour
-------------------
Alice Bob 0
Alice Carol 1
Alice Bob 13
Bob Eve 2
to
user1 user2 hour count
-------------------------------
Alice Bob 0 1
Alice Bob 1 0
Alice Bob 2 0
and so on but what I get is
user1 user2 hour count
-------------------------------
Alice Bob 0 1
Alice Bob 13 1
Alice Carol 1 1
However, I have ~1.2M unique combinations of user1-user2, so MultiIndex.from_product doesn't work.
EDIT: Here's the code I used for some dummy dataframe. It works for the dummy case, but not for the larger case:
import pandas as pd
df = pd.DataFrame({'id':[1,1,2,2,3,3],'hour':[0,1,0,0,1,1], 'to_count': [20,10,5,4,17,6]})
print(df)
agg_df = df.groupby(['id', 'hour']).agg({'to_count': 'count'})
print(df.groupby(['id', 'hour']).agg({'to_count':'count'}))
print(len(agg_df.index.levels))
levels = [agg_df.index.levels[i].values for i in range(len(agg_df.index.levels))]
levels[-1] = [0,1,2]
print(len(levels))
print(agg_df.index.names)
new_index = pd.MultiIndex.from_product(levels, names=agg_df.index.names)
# Reindex the agg_df and fill empty values with zero (NaN by default)
agg_df = agg_df.reindex(new_index, fill_value=0)
# Reset index
agg_df = agg_df.reset_index()
Is there a better way to show zero counts for groupby in large pandas dataframes?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Join two Pandas dataframes, sampling from the smaller dataframe - python

Related

Pandas - Compare each row with one another across dataframe and list the amount of duplicate values

Python Merging data frames

Find customers with at least 5 transactions - looking for a tidier way to do this

How to sum list of pandas dataframes by with respect to given column

How to show zero counts in pandas groupby for large dataframes

Categories

Resources