Optimizing cartesian product between two Pandas Dataframe

Optimizing cartesian product between two Pandas Dataframe - python

I have two dataframes with the same columns:
Dataframe 1:
attr_1 attr_77 ... attr_8
userID
John 1.2501 2.4196 ... 1.7610
Charles 0.0000 1.0618 ... 1.4813
Genarito 2.7037 4.6707 ... 5.3583
Mark 9.2775 6.7638 ... 6.0071
Dataframe 2:
attr_1 attr_77 ... attr_8
petID
Firulais 1.2501 2.4196 ... 1.7610
Connie 0.0000 1.0618 ... 1.4813
PopCorn 2.7037 4.6707 ... 5.3583
I want to generate a correlation and p-value dataframe of all posible combinations, this would be the result:
userId petID Correlation p-value
0 John Firulais 0.091447 1.222927e-02
1 John Connie 0.101687 5.313359e-03
2 John PopCorn 0.178965 8.103919e-07
3 Charles Firulais -0.078460 3.167896e-02
The problem is that the cartesian product generates more than 3 million tuples. Taking minutes to finish. This is my code, I've written two alternatives:
First of all, initial DataFrames:
df1 = pd.DataFrame({
'userID': ['John', 'Charles', 'Genarito', 'Mark'],
'attr_1': [1.2501, 0.0, 2.7037, 9.2775],
'attr_77': [2.4196, 1.0618, 4.6707, 6.7638],
'attr_8': [1.7610, 1.4813, 5.3583, 6.0071]
}).set_index('userID')
df2 = pd.DataFrame({
'petID': ['Firulais', 'Connie', 'PopCorn'],
'attr_1': [1.2501, 0.0, 2.7037],
'attr_77': [2.4196, 1.0618, 4.6707],
'attr_8': [1.7610, 1.4813, 5.3583]
}).set_index('petID')
Option 1:
# Pre-allocate space
df1_keys = df1.index
res_row_count = len(df1_keys) * df2.values.shape[0]
genes = np.empty(res_row_count, dtype='object')
mature_mirnas = np.empty(res_row_count, dtype='object')
coff = np.empty(res_row_count)
p_value = np.empty(res_row_count)
i = 0
for df1_key in df1_keys:
df1_values = df1.loc[df1_key, :].values
for df2_key in df2.index:
df2_values = df2.loc[df2_key, :]
pearson_res = pearsonr(df1_values, df2_values)
users[i] = df1_key
pets[i] = df2_key
coff[i] = pearson_res[0]
p_value[i] = pearson_res[1]
i += 1
# After loop, creates the resulting Dataframe
return pd.DataFrame(data={
'userID': users,
'petID': pets,
'Correlation': coff,
'p-value': p_value
})
Option 2 (slower), from here:
# Makes a merge between all the tuples
def df_crossjoin(df1_file_path, df2_file_path):
df1, df2 = prepare_df(df1_file_path, df2_file_path)
df1['_tmpkey'] = 1
df2['_tmpkey'] = 1
res = pd.merge(df1, df2, on='_tmpkey').drop('_tmpkey', axis=1)
res.index = pd.MultiIndex.from_product((df1.index, df2.index))
df1.drop('_tmpkey', axis=1, inplace=True)
df2.drop('_tmpkey', axis=1, inplace=True)
return res
# Computes Pearson Coefficient for all the tuples
def compute_pearson(row):
values = np.split(row.values, 2)
return pearsonr(values[0], values[1])
result = df_crossjoin(mrna_file, mirna_file).apply(compute_pearson, axis=1)
Is there a faster way to solve such a problem with Pandas? Or I'll have no more option than parallelize the iterations?
Edit:
As the size of the dataframe increases the second option results in a better runtime, but It's still taking seconds to finish.
Thanks in advance

Of all the alternatives tested, the one that gave me the best results was the following:
An iteration product was made with
itertools.product().
All the iterations on both iterrows were performed on a Pool of
parallel processes (using a map function).
To give it a little more performance, the function compute_row_cython was compiled with Cython as it is advised in this section of the Pandas documentation:
In the cython_modules.pyx file:
from scipy.stats import pearsonr
import numpy as np
def compute_row_cython(row):
(df1_key, df1_values), (df2_key, df2_values) = row
cdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)
return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]
Then I set up the setup.py:
from distutils.core import setup
from Cython.Build import cythonize
setup(name='Compiled Pearson',
ext_modules=cythonize("cython_modules.pyx")
Finally I compiled it with: python setup.py build_ext --inplace
The final code was left, then:
import itertools
import multiprocessing
from cython_modules import compute_row_cython
NUM_CORES = multiprocessing.cpu_count() - 1
pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()
Neither Dask, nor the merge function with the apply used gave me better results. Not even optimizing the apply with Cython. In fact, this alternative with those two methods gave me memory error, when implementing the solution with Dask I had to generate several partitions, which degraded the performance as it had to perform many I/O operations.
The solution with Dask can be found in my other question.

Here's another method using same cross join but using the built in pandas method DataFrame.corrwith and scipy.stats.ttest_ind. Since we use less "loopy" implementation, this should perform better.
from scipy.stats import ttest_ind
mrg = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop(columns='key')
x = mrg.filter(like='_x').rename(columns=lambda x: x.rsplit('_', 1)[0])
y = mrg.filter(like='_y').rename(columns=lambda x: x.rsplit('_', 1)[0])
df = mrg[['userID', 'petID']].join(x.corrwith(y, axis=1).rename('Correlation'))
df['p_value'] = ttest_ind(x, y, axis=1)[1]
userID petID Correlation p_value
0 John Firulais 1.000000 1.000000
1 John Connie 0.641240 0.158341
2 John PopCorn 0.661040 0.048041
3 Charles Firulais 0.641240 0.158341
4 Charles Connie 1.000000 1.000000
5 Charles PopCorn 0.999660 0.020211
6 Genarito Firulais 0.661040 0.048041
7 Genarito Connie 0.999660 0.020211
8 Genarito PopCorn 1.000000 1.000000
9 Mark Firulais -0.682794 0.006080
10 Mark Connie -0.998462 0.003865
11 Mark PopCorn -0.999569 0.070639

Related

Applying custom functions to groupby objects pandas

I have the following pandas dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"bird_type": ["falcon", "crane", "crane", "falcon"],
"avg_speed": [np.random.randint(50, 200) for _ in range(4)],
"no_of_birds_observed": [np.random.randint(3, 10) for _ in range(4)],
"reliability_of_data": [np.random.rand() for _ in range(4)],
}
)
# The dataframe looks like this.
bird_type avg_speed no_of_birds_observed reliability_of_data
0 falcon 66 3 0.553841
1 crane 159 8 0.472359
2 crane 158 7 0.493193
3 falcon 161 7 0.585865
Now, I would like to have the weighted average (according to the number_of_birds_surveyed) for the average_speed and reliability variables. For that I have a simple function as follows, which calculates the weighted average.
def func(data, numbers):
ans = 0
for a, b in zip(data, numbers):
ans = ans + a*b
ans = ans / sum(numbers)
return ans
How can I apply the function of func to both average speed and reliability variables?
I expect the answer to be a dataframe like follows
bird_type avg_speed no_of_birds_observed reliability_of_data
0 falcon 132.5 10 0.5762578
# how (66*3 + 161*7)/(3+7) (3+10) (0.553841×3+0.585865×7)/(3+7)
1 crane 158.53 15 0.4820815
# how (159*8 + 158*7)/(8+7) (8+7) (0.472359×8+0.493193×7)/(8+7)
I saw this question, but could not generalize the solution / understand it completely. I thought of not asking the question, but according to this blog post by SO and this meta question, with a different example, I think this question can be considered a "borderline duplicate". An answer will benefit me and probably some others will also find this useful. So finally decided to ask.

Don't use a function with apply, rather perform a classical aggregation:
cols = ['avg_speed', 'reliability_of_data']
# multiply relevant columns by no_of_birds_observed
# aggregate everything as sum
out = (df[cols].mul(df['no_of_birds_observed'], axis=0)
.combine_first(df)
.groupby('bird_type').sum()
)
# divide the relevant columns by the sum of no_of_birds_observed
out[cols] = out[cols].div(out['no_of_birds_observed'], axis=0)
Output:
avg_speed no_of_birds_observed reliability_of_data
bird_type
crane 158.533333 15 0.482082
falcon 132.500000 10 0.576258

If want aggregate by GroupBy.agg for weights parameter is used no_of_birds_observed by DataFrame.loc:
#for correct ouput need default (or unique values) index
df = df.reset_index(drop=True)
f = lambda x: np.average(x, weights= df.loc[x.index, 'no_of_birds_observed'])
df1 = (df.groupby('bird_type', sort=False, as_index=False)
.agg(avg=('avg_speed',f),
no_of_birds=('no_of_birds_observed','sum'),
reliability_of_data=('reliability_of_data', f)))
print (df1)
bird_type avg no_of_birds reliability_of_data
0 falcon 132.500000 10 0.576258
1 crane 158.533333 15 0.482082

Run functions over many dataframes, add results to another dataframe, and dynamically name the resulting column with the name of the original df

I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)

If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6

multiprocessing to add a column in a pandas dataframe

I try to add new columns in a huge pandas dataframe. I wrote a function to add the new columns and can now loop over the dataframe. This works, but since the dataframe is so big it takes quite a while. So I tried to use the multiprocessing module to speed up, but was not able to make it run.
Below is a MWE. I guess pool.map() cannot change the dataframe directly and I need to save the new columns first somewhere else. Note: In the "real" code I will add more than 100 new columns and those are also based on values in other dataframes (so I guess apply is not possible).
import pandas as pd
import numpy as np
from multiprocessing import Pool
df = pd.DataFrame({"Value1" : [1,2,3], "Value2" : [9,8,7]})
def make_new_columns(i):
df.loc[i, 'mean'] = np.mean([df.loc[i, 'Value1'], df.loc[i, 'Value2']])
df.loc[i, 'sd'] = np.std([df.loc[i, 'Value1'], df.loc[i, 'Value2']])
df.loc[i, 'cv'] = df.loc[i, 'mean'] / df.loc[i, 'sd']
# With a for loop it is working
# for i in range(len(df)):
# make_new_columns(i)
# With multiprocessing it isn't
pool = Pool()
pool.map(make_new_columns, range(len(df)))
Thanks for you input.
EDIT:
To give a bit more background. I have a data.frame containing tennis match data (Match_Table) which looks a bit like this:
Match_Table:
Date Player_1 Player_2 Winner Aces_1 Aces_2 [...]
----------------------------------------------------------------------
20200528 Thomas Peter Thomas 6 2
20200526 Peter Michael Peter 8 3
20200524 Donald Bill Bill 3 12
...
Now, I am interested into statistics of a specific matchup. So for example: "What was the winrate of e.g. Peter in the last 100 games?", "How many aces did he score on average?", "How many aces did his opponent score?", "How was his win rate against e.g. Bill in the last 100 games?", ...
I need this statistics also for different dates in the past (e.g. What was Peters win rate January 2018). Therefore, I make a second table with the required information (Statistic_Table):
Statistic_Table:
Date Player1 Player2
----------------------------------------------------------------------
202002 Thomas Peter
202002 Peter Michael
201905 Donald Bill
...
Then I wrote a function which filters the Match_Table and calculates all missing columns of Statistic_Table. I can now loop over each row, so it results in this:
Date Player Opponent Winrate Winrate_vs avgAces [...]
-------------------------------------------------------------
202002 Thomas Peter 0.47 0.45 4.5
202002 Peter Michael 0.54 0.64 8.4
201905 Donald Bill 0.63 0.78 6.5
...
Every thing works fine. But since for every cell in my quite large Statistic_Table, I have to subset another table and calculate statistics (not only mean or rates but also weighed averages, etc.), it takes several hours. This would be possible, since I need to create the table just once. But still, if I could split the workload on different cores it would be faster and also easier in the case I have to adjust some parameters.
I also looked in the possibility to use some apply method or optimize the code, but since I (hopefully) only need to generate the table, once I don't want to lose too much time on this. Thus, multiprocessing seemed an easy solution especially, since I have access to powerful computers.

There is a better module for your use case than multiprocessing. Use Ray.
import pandas as pd
import numpy as np
import ray
ray.init()
#ray.remote
class DataFrameActor:
def __init__(self, df):
self.df = df.copy()
def make_new_columns(self, i):
self.df.loc[i, 'mean'] = np.mean([self.df.loc[i, 'Value1'], self.df.loc[i, 'Value2']])
self.df.loc[i, 'sd'] = np.std([self.df.loc[i, 'Value1'], self.df.loc[i, 'Value2']])
self.df.loc[i, 'cv'] = self.df.loc[i, 'mean'] / self.df.loc[i, 'sd']
def to_df(self):
return self.df
#ray.remote
def worker(_df_actor, value):
_df_actor.make_new_columns.remote(i=value)
df = pd.DataFrame({"Value1" : [1,2,3], "Value2" : [9,8,7]})
df_actor = DataFrameActor.remote(df)
[worker.remote(df_actor, j) for j in range(len(df))]
print(ray.get(df_actor.to_df.remote()))

Assuming the functions in your MWE represents what you want to do in your real frame, you should work column-wise.
df['mean'] = df[['Value1', 'Value2']].mean(axis=1)
df['sd'] = df[['Value1', 'Value2']].std(axis=1)
df['cv'] = df['mean'] / df['sd']
Below is timing code (where df is built with more rows and values are randomly drawn integers)
import pandas as pd
import numpy as np
from multiprocessing import Pool
n_rows = 2000
df = pd.DataFrame({"Value1" : np.random.randint(1, high=100, size=n_rows),
"Value2" : np.random.randint(1, high=100, size=n_rows)})
# Function takes now df as input so no global variables is changed
def make_new_columns(df, i):
df.loc[i, 'mean'] = np.mean([df.loc[i, 'Value1'], df.loc[i, 'Value2']])
df.loc[i, 'sd'] = np.std([df.loc[i, 'Value1'], df.loc[i, 'Value2']])
df.loc[i, 'cv'] = df.loc[i, 'mean'] / df.loc[i, 'sd']
return df
#cellwise construction: 1.98s
%%timeit
df_2 = df.copy()
# With a for loop it is working
for i in range(len(df_2)):
make_new_columns(df_2, i)
# columnwise construction: 2.4**ms**
%%timeit
df_2= df.copy()
df_2['mean'] = df_2[['Value1', 'Value2']].mean(axis=1)
df_2['sd'] = df_2[['Value1', 'Value2']].std(axis=1)
df_2['cv'] = df_2['mean'] / df_2['sd']

Applying function to every cell in a Dataframe based on index and col

I have a pandas dataframe with a format exactly like the one in this question and I'm trying to achieve the same result. In my case, I am calculating the fuzz-ratio between the row's index and it's corresponding col.
If I try this code (based on the answer to the linked question)
def get_similarities(x):
return x.index + x.name
test_df = test_df.apply(get_similarities)
the concatenation of the row index and col name happens cell-wise, just as intended. Running type(test_df) returns pandas.core.frame.DataFrame, as expected.
However, if I adapt the code to my scenario like so
def get_similarities(x):
return fuzz.partial_ratio(x.index, x.name)
test_df = test_df.apply(get_similarities)
it doesn't work. Instead of a dataframe, I get back a series (the return type of that function is an int)
I don't understand why the two samples would not behave the same nor how to fix my code so it returns a dataframe, with the fuzzy.ratio for each cell between the a row's index for that cell and the col name for that cell.

what about the following approach?
assuming that we have two sets of strings:
In [245]: set1
Out[245]: ['car', 'bike', 'sidewalk', 'eatery']
In [246]: set2
Out[246]: ['walking', 'caring', 'biking', 'eating']
Solution:
In [247]: from itertools import product
In [248]: res = np.array([fuzz.partial_ratio(*tup) for tup in product(set1, set2)])
In [249]: res = pd.DataFrame(res.reshape(len(set1), -1), index=set1, columns=set2)
In [250]: res
Out[250]:
walking caring biking eating
car 33 100 0 33
bike 25 25 75 25
sidewalk 73 20 22 36
eatery 17 33 0 50

There is a way to accomplish this via DataFrame.apply with some row manipulations.
Assuming the 'test_df` is as follows:
In [73]: test_df
Out[73]:
walking caring biking eating
car carwalking carcaring carbiking careating
bike bikewalking bikecaring bikebiking bikeeating
sidewalk sidewalkwalking sidewalkcaring sidewalkbiking sidewalkeating
eatery eaterywalking eaterycaring eaterybiking eateryeating
In [74]: def get_ratio(row):
...: return row.index.to_series().apply(lambda x: fuzz.partial_ratio(x,
...: row.name))
...:
In [75]: test_df.apply(get_ratio)
Out[75]:
walking caring biking eating
car 33 100 0 33
bike 25 25 75 25
sidewalk 73 20 22 36
eatery 17 33 0 50

It took some digging, but I figured it out. The problem comes from the fact that DataFrame.apply is either applied column-wise or row-wise, not cell by cell. So your get_similarities function is actually getting access to an entire row or column of data at a time! By default it gets the entire column -- so to solve your problem, you just have to make a get_similarities function that returns a list where you manually call fuzz.partial_ratio on each element, like this:
import pandas as pd
from fuzzywuzzy import fuzz
def get_similarities(x):
l = []
for rname in x.index:
print "Getting ratio for %s and %s" % (rname, x.name)
score = fuzz.partial_ratio(rname,x.name)
print "Score %s" % score
l.append(score)
print len(l)
print
return l
a = pd.DataFrame([[1,2],[3,4]],index=['apple','banana'], columns=['aple','banada'])
c = a.apply(get_similarities,axis=0)
print c
print type(c)
I left my print statements in their so you can see what the DataFrame.apply call is doing for yourself -- that's when it clicked for me.

How to flatten individual pandas dataframes and stack them to achieve a new one?

I have a function which takes in data for a particular year and returns a dataframe.
For example:
df
year fruit license grade
1946 apple XYZ 1
1946 orange XYZ 1
1946 apple PQR 3
1946 orange PQR 1
1946 grape XYZ 2
1946 grape PQR 1
..
2014 grape LMN 1
Note:
1) a specific license value will exist only for a particular year and only once for a particular fruit (eg. XYZ only for 1946 and only once for apple, orange and grape).
2) Grade values are categorical.
I realize the below function isn't very efficient to achieve its intended goals,
but this is what I am currently working with.
def func(df, year):
#1. Filter out only the data for the year needed
df_year=df[df['year']==year]
'''
2. Transform DataFrame to the form:
XYZ PQR .. LMN
apple 1 3 1
orange 1 1 3
grape 2 1 1
Note that 'LMN' is just used for representation purposes.
It won't logically appear here because it can only appear for the year 2014.
'''
df_year = df_year.pivot(index='fruit',columns='license',values='grade')
#3. Remove all fruits that have ANY NaN values
df_year=df_year.dropna(axis=1, how="any")
#4. Some additional filtering
#5. Function to calculate similarity between fruits
def similarity_score(fruit1, fruit2):
agreements=np.sum( ( (fruit1 == 1) & (fruit2 == 1) ) | \
( (fruit1 == 3) & (fruit2 == 3) ))
disagreements=np.sum( ( (fruit1 == 1) & (fruit2 == 3) ) |\
( (fruit1 == 3) & (fruit2 == 1) ))
return (( (agreements-disagreements) /float(len(fruit1)) ) +1)/2)
#6. Create Network dataframe
network_df=pd.DataFrame(columns=['Source','Target','Weight'])
for i,c in enumerate(combinations(df_year,2)):
c1=df[[c[0]]].values.tolist()
c2=df[[c[1]]].values.tolist()
c1=[item for sublist in c1 for item in sublist]
c2=[item for sublist in c2 for item in sublist]
network_df.loc[i] = [c[0],c[1],similarity_score(c1,c2)]
return network_df
Running the above gives:
df_1946=func(df,1946)
df_1946.head()
Source Target Weight
Apple Orange 0.6
Apple Grape 0.3
Orange Grape 0.7
I want to flatten the above to a single row:
(Apple,Orange) (Apple,Grape) (Orange,Grape)
1946 0.6 0.3 0.7
Note the above will not have 3 columns, but in fact around 5000 columns.
Eventually, I want to stack the transformed dataframe rows to get something like:
df_all_years
(Apple,Orange) (Apple,Grape) (Orange,Grape)
1946 0.6 0.3 0.7
1947 0.7 0.25 0.8
..
2015 0.75 0.3 0.65
What is the best way to do this?

I would rearrange the computation a bit differently.
Instead of looping over the years:
for year in range(1946, 2015):
partial_result = func(df, year)
and then concatenating the partial results, you can get
better performance by doing as much work as possible on the whole DataFrame, df,
before calling df.groupby(...). Also, if you can express the computation in terms of builtin aggregators such as sum and count, the computation can be done more quickly than if you use custom functions with groupby/apply.
import itertools as IT
import numpy as np
import pandas as pd
np.random.seed(2017)
def make_df():
N = 10000
df = pd.DataFrame({'fruit': np.random.choice(['Apple', 'Orange', 'Grape'], size=N),
'grade': np.random.choice([1,2,3], p=[0.7,0.1,0.2], size=N),
'year': np.random.choice(range(1946,1950), size=N)})
df['manufacturer'] = (df['year'].astype(str) + '-'
+ df.groupby(['year', 'fruit'])['fruit'].cumcount().astype(str))
df = df.sort_values(by=['year'])
return df
def similarity_score(df):
"""
Compute the score between each pair of columns in df
"""
agreements = {}
disagreements = {}
for col in IT.combinations(df,2):
fruit1 = df[col[0]].values
fruit2 = df[col[1]].values
agreements[col] = ( ( (fruit1 == 1) & (fruit2 == 1) )
| ( (fruit1 == 3) & (fruit2 == 3) ))
disagreements[col] = ( ( (fruit1 == 1) & (fruit2 == 3) )
| ( (fruit1 == 3) & (fruit2 == 1) ))
agreements = pd.DataFrame(agreements, index=df.index)
disagreements = pd.DataFrame(disagreements, index=df.index)
numerator = agreements.astype(int)-disagreements.astype(int)
grouped = numerator.groupby(level='year')
total = grouped.sum()
count = grouped.count()
score = ((total/count) + 1)/2
return score
df = make_df()
df2 = df.set_index(['year','fruit','manufacturer'])['grade'].unstack(['fruit'])
df2 = df2.dropna(axis=0, how="any")
print(similarity_score(df2))
yields
Grape Orange
Apple Apple Grape
year
1946 0.629111 0.650426 0.641900
1947 0.644388 0.639344 0.633039
1948 0.613117 0.630566 0.616727
1949 0.634176 0.635379 0.637786

Here's one way of doing a pandas routine to pivot the table in the way you refer to; while it handles ~5,000 columns--as resulting combinatorially from two initially separate classes--quickly enough (bottleneck step took about 20 s on my quad-core MacBook), for much larger scaling there are definitely faster strategies. The data in this example is pretty sparse (5K columns, with 5K random samples from 70 rows of years [1947-2016]) so execution time might be some seconds longer with a fuller dataframe.
from itertools import chain
import pandas as pd
import numpy as np
import random # using python3 .choices()
import re
# Make bivariate data w/ 5000 total combinations (1000x5 categories)
# Also choose 5,000 randomly; some combinations may have >1 values or NaN
random_sample_data = np.array(
[random.choices(['Apple', 'Orange', 'Lemon', 'Lime'] +
['of Fruit' + str(i) for i in range(1000)],
k=5000),
random.choices(['Grapes', 'Are Purple', 'And Make Wine',
'From the Yeast', 'That Love Sugar'],
k=5000),
[random.random() for _ in range(5000)]]
).T
df = pd.DataFrame(random_sample_data, columns=[
"Source", "Target", "Weight"])
df['Year'] = random.choices(range(1947, 2017), k=df.shape[0])
# Three views of resulting df in jupyter notebook:
df
df[df.Year == 1947]
df.groupby(["Source", "Target"]).count().unstack()
To flatten the grouped-by-year data, since groupby requires a function to be applied, you can use a temporary df intermediary to:
push all data.groupby("Year") into individual rows but with separate dataframes per the two columns "Target" + "Source" (to later expand by) plus "Weight".
Use zip and pd.core.reshape.util.cartesian_product to create an empty properly shaped pivot df which will be the final table, arising from temp_df.
e.g.,
df_temp = df.groupby("Year").apply(
lambda s: pd.DataFrame([(s.Target, s.Source, s.Weight)],
columns=["Target", "Source", "Weight"])
).sort_index()
df_temp.index = df_temp.index.droplevel(1) # reduce MultiIndex to 1-d
# Predetermine all possible pairwise column category combinations
product_ts = [*zip(*(pd.core.reshape.util.cartesian_product(
[df.Target.unique(), df.Source.unique()])
))]
ts_combinations = [str(x + ' ' + y) for (x, y) in product_ts]
ts_combinations
Finally, use simple for-for nested iteration (again, not the fastest, though pd.DataFrame.iterrows might help speed things up, as shown). Because of the random sampling with replacement I had to handle multiple values, so you probably would want to remove the conditional below the second for loop, which is the step where the three separate dataframes are, for each year, accordingly zipped into a single row of all cells via the pivoted ("Weight") x ("Target"-"Source") relation.
df_pivot = pd.DataFrame(np.zeros((70, 5000)),
columns=ts_combinations)
df_pivot.index = df_temp.index
for year, values in df_temp.iterrows():
for (target, source, weight) in zip(*values):
bivar_pair = str(target + ' ' + source)
curr_weight = df_pivot.loc[year, bivar_pair]
if curr_weight == 0.0:
df_pivot.loc[year, bivar_pair] = [weight]
# append additional values if encountered
elif type(curr_weight) == list:
df_pivot.loc[year, bivar_pair] = str(curr_weight +
[weight])
# Spotcheck:
# Verifies matching data in pivoted table vs. original for Target+Source
# combination "And Make Wine of Fruit614" across all 70 years 1947-2016
df
df_pivot['And Make Wine of Fruit614']
df[(df.Year == 1947) & (df.Target == 'And Make Wine') & (df.Source == 'of Fruit614')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing cartesian product between two Pandas Dataframe - python

Related

Applying custom functions to groupby objects pandas

Run functions over many dataframes, add results to another dataframe, and dynamically name the resulting column with the name of the original df

multiprocessing to add a column in a pandas dataframe

Applying function to every cell in a Dataframe based on index and col

How to flatten individual pandas dataframes and stack them to achieve a new one?

Categories

Resources