I have the following pandas dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"bird_type": ["falcon", "crane", "crane", "falcon"],
"avg_speed": [np.random.randint(50, 200) for _ in range(4)],
"no_of_birds_observed": [np.random.randint(3, 10) for _ in range(4)],
"reliability_of_data": [np.random.rand() for _ in range(4)],
}
)
# The dataframe looks like this.
bird_type avg_speed no_of_birds_observed reliability_of_data
0 falcon 66 3 0.553841
1 crane 159 8 0.472359
2 crane 158 7 0.493193
3 falcon 161 7 0.585865
Now, I would like to have the weighted average (according to the number_of_birds_surveyed) for the average_speed and reliability variables. For that I have a simple function as follows, which calculates the weighted average.
def func(data, numbers):
ans = 0
for a, b in zip(data, numbers):
ans = ans + a*b
ans = ans / sum(numbers)
return ans
How can I apply the function of func to both average speed and reliability variables?
I expect the answer to be a dataframe like follows
bird_type avg_speed no_of_birds_observed reliability_of_data
0 falcon 132.5 10 0.5762578
# how (66*3 + 161*7)/(3+7) (3+10) (0.553841×3+0.585865×7)/(3+7)
1 crane 158.53 15 0.4820815
# how (159*8 + 158*7)/(8+7) (8+7) (0.472359×8+0.493193×7)/(8+7)
I saw this question, but could not generalize the solution / understand it completely. I thought of not asking the question, but according to this blog post by SO and this meta question, with a different example, I think this question can be considered a "borderline duplicate". An answer will benefit me and probably some others will also find this useful. So finally decided to ask.
Don't use a function with apply, rather perform a classical aggregation:
cols = ['avg_speed', 'reliability_of_data']
# multiply relevant columns by no_of_birds_observed
# aggregate everything as sum
out = (df[cols].mul(df['no_of_birds_observed'], axis=0)
.combine_first(df)
.groupby('bird_type').sum()
)
# divide the relevant columns by the sum of no_of_birds_observed
out[cols] = out[cols].div(out['no_of_birds_observed'], axis=0)
Output:
avg_speed no_of_birds_observed reliability_of_data
bird_type
crane 158.533333 15 0.482082
falcon 132.500000 10 0.576258
If want aggregate by GroupBy.agg for weights parameter is used no_of_birds_observed by DataFrame.loc:
#for correct ouput need default (or unique values) index
df = df.reset_index(drop=True)
f = lambda x: np.average(x, weights= df.loc[x.index, 'no_of_birds_observed'])
df1 = (df.groupby('bird_type', sort=False, as_index=False)
.agg(avg=('avg_speed',f),
no_of_birds=('no_of_birds_observed','sum'),
reliability_of_data=('reliability_of_data', f)))
print (df1)
bird_type avg no_of_birds reliability_of_data
0 falcon 132.500000 10 0.576258
1 crane 158.533333 15 0.482082
Related
I have a dataframe that looks like this:
index Rod_1 label
0 [[1.94559799] [1.94498416] [1.94618273] ... [1.8941952 ] [1.89461277] [1.89435902]] F0
1 [[1.94129488] [1.94268905] [1.94327065] ... [1.93593512] [1.93689935] [1.93802091]] F0
2 [[1.94034818] [1.93996006] [1.93940095] ... [1.92700882] [1.92514855] [1.92449449]] F0
3 [[1.95784532] [1.96333782] [1.96036528] ... [1.94958261] [1.95199495] [1.95308231]] F2
Each cell in the Rod_1 column has an array of 12 million values. I'm trying the calculate the difference between every two values in this array to remove seasonality. That way my model will perform better, potentially.
This is the code that I've written:
interval = 1
for j in range(0, len(df_all['Rod_1'])):
for i in range(1, len(df_all['Rod_1'][0])):
df_all['Rod_1'][j][i - interval] = df_all['Rod_1'][j][i] - df_all['Rod_1'][j][i - interval]
I have 45 rows, and as I said each cell has 12 million values, so it takes 20 min to for my laptop to calculate this. Is there a faster way to do this?
Thanks in advance.
This should be much faster, I've tested up till 1M elements per cell for 10 rows which took 1.5 seconds to calculate the diffs (but a lot longer to make the test table)
import pandas as pd
import numpy as np
import time
#Create test data
np.random.seed(1)
num_rows = 10
rod1_array_lens = 5 #I tried with this at 1000000
possible_labels = ['F0','F1']
df = pd.DataFrame({
'Rod_1':[[[np.random.randint(10)] for _ in range(rod1_array_lens)] for _ in range(num_rows)],
'label':np.random.choice(possible_labels, num_rows)
})
#flatten Rod_1 from [[1],[2],[3]] --> [1,2,3]
#then use np.roll to make the diffs, throwing away the last element since it rolls over
start = time.time() #starting timing now
df['flat_Rod_1'] = df['Rod_1'].apply(lambda v: np.array([z for x in v for z in x]))
df['diffs'] = df['flat_Rod_1'].apply(lambda v: (np.roll(v,-1)-v)[:-1])
print('Took',time.time()-start,'to calculate diff')
I have a CSV file that looks like below, this is same like my last question but this is by using Pandas.
Group Sam Dan Bori Son John Mave
A 0.00258844 0.983322 1.61479 1.2785 1.96963 10.6945
B 0.0026034 0.983305 1.61198 1.26239 1.9742 10.6838
C 0.0026174 0.983294 1.60913 1.24543 1.97877 10.6729
D 0.00263062 0.983289 1.60624 1.22758 1.98334 10.6618
E 0.00264304 0.98329 1.60332 1.20885 1.98791 10.6505
I have a function like below
def getnewno(value):
value = value + 30
if value > 40 :
value = value - 20
else:
value = value
return value
I want to send all these values to the getnewno function and get a newvalue and update the CSV file. How can this be accomplished in Pandas.
Expected output:
Group Sam Dan Bori Son John Mave
A 30.00258844 30.983322 31.61479 31.2785 31.96963 20.6945
B 30.0026034 30.983305 31.61198 31.26239 31.9742 20.6838
C 30.0026174 30.983294 31.60913 31.24543 31.97877 20.6729
D 30.00263062 30.983289 31.60624 31.22758 31.98334 20.6618
E 30.00264304 30.98329 31.60332 31.20885 31.98791 20.6505
The following should give you what you desire.
Applying a function
Your function can be simplified and here expressed as a lambda function.
It's then a matter of applying your function to all of the columns. There are a number of ways to do so. The first idea that comes to mind is to loop over df.columns. However, we can do better than this by using the applymap or transform methods:
import pandas as pd
# Read in the data from file
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
# Simplified function with which to transform data
getnewno = lambda value: value + 10 if value > 10 else value + 30
# Looping over columns
#for col in df.columns:
# df[col] = df[col].apply(getnewno)
# Apply to all columns without loop
df = df.applymap(getnewno)
# Write out updated data
df.to_csv('data_updated.csv')
Using broadcasting
You can achieve your result using broadcasting and a little boolean logic. This avoids looping over any columns, and should ultimately prove faster and less memory intensive (although if your dataset is small any speed-up would be negligible):
import pandas as pd
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
df += 30
make_smaller = df > 40
df[make_smaller] -= 20
First of all, your getnewno function looks too complicated... it can be simplified to e.g.:
def getnewno(value):
if value + 30 > 40:
return value - 20
else:
return value
you can even change value + 30 > 40 to value > 10.
Or even a oneliner if you want:
getnewno = lambda value: value-20 if value > 10 else value
Having the function you can apply it to specific values/columns. For example, if want you to create a column Mark_updated basing on Mark column, it should look like this (I assume your pandas DataFrame is called df):
df['Mark_updated'] = df['Mark'].apply(getnewno)
Use the mask function to do an if-else solution, before writing the data to csv
res = (df
.select_dtypes('number')
.add(30)
#the if-else comes in here
#if any entry in the dataframe is greater than 40, subtract 20 from it
#else leave as is
.mask(lambda x: x>40, lambda x: x.sub(20))
)
#insert the group column back
res.insert(0,'Group',df.Group.array)
write to csv
res.to_csv(filename)
Group Sam Dan Bori Son John Mave
0 A 30.002588 30.983322 31.61479 31.27850 31.96963 20.6945
1 B 30.002603 30.983305 31.61198 31.26239 31.97420 20.6838
2 C 30.002617 30.983294 31.60913 31.24543 31.97877 20.6729
3 D 30.002631 30.983289 31.60624 31.22758 31.98334 20.6618
4 E 30.002643 30.983290 31.60332 31.20885 31.98791 20.6505
def stack_plot(data, xtick, col2='project_is_approved', col3='total'):
ind = np.arange(data.shape[0])
plt.figure(figsize=(20,5))
p1 = plt.bar(ind, data[col3].values)
p2 = plt.bar(ind, data[col2].values)
plt.ylabel('Projects')
plt.title('Number of projects aproved vs rejected')
plt.xticks(ind, list(data[xtick].values))
plt.legend((p1[0], p2[0]), ('total', 'accepted'))
plt.show()
def univariate_barplots(data, col1, col2='project_is_approved', top=False):
# Count number of zeros in dataframe python: https://stackoverflow.com/a/51540521/4084039
temp = pd.DataFrame(project_data.groupby(col1)[col2].agg(lambda x: x.eq(1).sum())).reset_index()
# Pandas dataframe grouby count: https://stackoverflow.com/a/19385591/4084039
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
temp.sort_values(by=['total'],inplace=True, ascending=False)
if top:
temp = temp[0:top]
stack_plot(temp, xtick=col1, col2=col2, col3='total')
print(temp.head(5))
print("="*50)
print(temp.tail(5))
univariate_barplots(project_data, 'school_state', 'project_is_approved', False)
Error:
SpecificationError Traceback (most recent call last)
<ipython-input-21-2cace8f16608> in <module>()
----> 1 univariate_barplots(project_data, 'school_state', 'project_is_approved', False)
<ipython-input-20-856fcc83737b> in univariate_barplots(data, col1, col2, top)
4
5 # Pandas dataframe grouby count: https://stackoverflow.com/a/19385591/4084039
----> 6 temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
7 print (temp['total'].head(2))
8 temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
~\AppData\Roaming\Python\Python36\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
251 # but not the class list / tuple itself.
252 func = _maybe_mangle_lambdas(func)
--> 253 ret = self._aggregate_multiple_funcs(func)
254 if relabeling:
255 ret.columns = columns
~\AppData\Roaming\Python\Python36\site-packages\pandas\core\groupby\generic.py in _aggregate_multiple_funcs(self, arg)
292 # GH 15931
293 if isinstance(self._selected_obj, Series):
--> 294 raise SpecificationError("nested renamer is not supported")
295
296 columns = list(arg.keys())
SpecificationError: **nested renamer is not supported**
change
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
to
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg(total='count')).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg(Avg='mean')).reset_index()['Avg']
reason: in new pandas version named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).
source: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html
This error also happens if a column specified in the aggregation function dict does not exist in the dataframe:
In [190]: group = pd.DataFrame([[1, 2]], columns=['A', 'B']).groupby('A')
In [195]: group.agg({'B': 'mean'})
Out[195]:
B
A
1 2
In [196]: group.agg({'B': 'mean', 'non-existing-column': 'mean'})
...
SpecificationError: nested renamer is not supported
I found the way: Instead of going like
g2 = df.groupby(["Description","CustomerID"],as_index=False).agg({'Quantity':{"maxQ":np.max,"minQ":np.min,"meanQ":np.mean}})
g2.columns = ["Description","CustomerID","maxQ","minQ",'meanQ']
Do as follows:
g2 = df.groupby(["Description","CustomerID"],as_index=False).agg({'Quantity':{np.max,np.min,np.mean}})
g2.columns = ["Description","CustomerID","maxQ","minQ",'meanQ']
I had the same error and this is how I resolved it!
Do you get the same error if you change
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
to
temp['total'] = project_data.groupby(col1)[col2].agg(total=('total','count')).reset_index()['total']
Instead of using .agg({'total':'count'})), you can pass name with the function as a list of tuple like .agg([('total', 'count')])and use the same for Avg also. Hope it would work.
I have got the similar issue as #akshay jindal, but I check the documentation as suggested by #artikay Khanna, the problem solved, some functions has been adjusted, the old is deprecated. Here is the code warning provided per last time execute.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version. Use named aggregation instead.
>>> grouper.agg(name_1=func_1, name_2=func_2)
"""Entry point for launching an IPython kernel.
Therefore, I will suggest try
grouper.agg(name_1=func_1, name_2=func_2)
Hope this will help
Not a very elegant solution but this one works. As renaming the column is deprecated with the way you are doing. But there is work around. Create a temporary variable 'approved' , store the col2 in it. Because when you apply agg function , the original column values will change with column name. You can preserve the column name but then values in those column will change. So in order to preserve the original dataframe and to have two new columns with desired names, you can use the following code.
approved = temp[col2]
temp = pd.DataFrame(project_data.groupby(col1)[col2].agg([('Avg','mean'),('total','count')]).reset_index())
temp[col2] = approved
P.S : Seems like an assignment of AAIC, I am working on same :)
Sometimes it's convenient to keep an aggdict of how each column should be transformed under aggregation that will work with different column sets and different group by columns. You can do this with the new syntax fairly easily by unpacking the dict with **. Here's a minimal working example for simple data.
dfx=pd.DataFrame(columns=["A","B","C"],data=np.random.randint(0,5,size=(10,3)))
#dfx
#
# A B C
#0 4 4 1
#1 2 4 4
#2 1 3 3
#3 2 4 3
#4 1 2 1
#5 0 4 2
#6 2 3 4
#7 1 0 2
#8 2 1 4
#9 3 0 3
Maybe when you agg you want the first "A", the last "B", the mean "C" and sometimes your pipeline has a "D" (but not this time) that you also want the mean of.
aggdict = {"A":lambda x: x.iloc[0], "B": lambda x: x.iloc[-1], "C" : "mean" , "D":lambda x: "mean"}
You can build a simple dict like the old days and then unpack it with ** filtering on the relevant keys:
gb_col="C"
gbc = dfx.groupby(gb_col).agg(**{k:(k,v) for k,v in aggdict.items() if k in dfx.columns and k != gb_col})
# A B
#C
#1 4 2
#2 0 0
#3 1 4
#4 2 3
And then you can slice and dice how you want with the same syntax:
mygb = lambda gb_col: dfx.groupby(gb_col).agg(**{k:(k,v) for k,v in aggdict.items() if k in dfx.columns and k != gb_col})
allgb = [mygb(c) for c in dfx.columns]
I have tried alll the solutions and turned out to be the error with the name. If your column name has some inbuilt keywords such as "in", "is",etc., It is throwing error. In my case, My column name is "Points in Polygon" and I have resolved the issue by renaming the column to "Points"
#Rishi's solution worked for me. The original name of the column in my dataframe was net_value_budgeted_rate, which was essentially dollar value of the sale. I changed it to dollars and it worked.
Info = pd.DataFrame(df.groupby("school_state").agg(Approved=("project_is_approved",lambda x: x.eq(1).sum()),Total=("project_is_approved","count"),Avg=("project_is_approved","mean"))).reset_index().sort_values(by=["Total"],ascending=False).head()
You can break this into individual commands for better readability.
I have two dataframes with the same columns:
Dataframe 1:
attr_1 attr_77 ... attr_8
userID
John 1.2501 2.4196 ... 1.7610
Charles 0.0000 1.0618 ... 1.4813
Genarito 2.7037 4.6707 ... 5.3583
Mark 9.2775 6.7638 ... 6.0071
Dataframe 2:
attr_1 attr_77 ... attr_8
petID
Firulais 1.2501 2.4196 ... 1.7610
Connie 0.0000 1.0618 ... 1.4813
PopCorn 2.7037 4.6707 ... 5.3583
I want to generate a correlation and p-value dataframe of all posible combinations, this would be the result:
userId petID Correlation p-value
0 John Firulais 0.091447 1.222927e-02
1 John Connie 0.101687 5.313359e-03
2 John PopCorn 0.178965 8.103919e-07
3 Charles Firulais -0.078460 3.167896e-02
The problem is that the cartesian product generates more than 3 million tuples. Taking minutes to finish. This is my code, I've written two alternatives:
First of all, initial DataFrames:
df1 = pd.DataFrame({
'userID': ['John', 'Charles', 'Genarito', 'Mark'],
'attr_1': [1.2501, 0.0, 2.7037, 9.2775],
'attr_77': [2.4196, 1.0618, 4.6707, 6.7638],
'attr_8': [1.7610, 1.4813, 5.3583, 6.0071]
}).set_index('userID')
df2 = pd.DataFrame({
'petID': ['Firulais', 'Connie', 'PopCorn'],
'attr_1': [1.2501, 0.0, 2.7037],
'attr_77': [2.4196, 1.0618, 4.6707],
'attr_8': [1.7610, 1.4813, 5.3583]
}).set_index('petID')
Option 1:
# Pre-allocate space
df1_keys = df1.index
res_row_count = len(df1_keys) * df2.values.shape[0]
genes = np.empty(res_row_count, dtype='object')
mature_mirnas = np.empty(res_row_count, dtype='object')
coff = np.empty(res_row_count)
p_value = np.empty(res_row_count)
i = 0
for df1_key in df1_keys:
df1_values = df1.loc[df1_key, :].values
for df2_key in df2.index:
df2_values = df2.loc[df2_key, :]
pearson_res = pearsonr(df1_values, df2_values)
users[i] = df1_key
pets[i] = df2_key
coff[i] = pearson_res[0]
p_value[i] = pearson_res[1]
i += 1
# After loop, creates the resulting Dataframe
return pd.DataFrame(data={
'userID': users,
'petID': pets,
'Correlation': coff,
'p-value': p_value
})
Option 2 (slower), from here:
# Makes a merge between all the tuples
def df_crossjoin(df1_file_path, df2_file_path):
df1, df2 = prepare_df(df1_file_path, df2_file_path)
df1['_tmpkey'] = 1
df2['_tmpkey'] = 1
res = pd.merge(df1, df2, on='_tmpkey').drop('_tmpkey', axis=1)
res.index = pd.MultiIndex.from_product((df1.index, df2.index))
df1.drop('_tmpkey', axis=1, inplace=True)
df2.drop('_tmpkey', axis=1, inplace=True)
return res
# Computes Pearson Coefficient for all the tuples
def compute_pearson(row):
values = np.split(row.values, 2)
return pearsonr(values[0], values[1])
result = df_crossjoin(mrna_file, mirna_file).apply(compute_pearson, axis=1)
Is there a faster way to solve such a problem with Pandas? Or I'll have no more option than parallelize the iterations?
Edit:
As the size of the dataframe increases the second option results in a better runtime, but It's still taking seconds to finish.
Thanks in advance
Of all the alternatives tested, the one that gave me the best results was the following:
An iteration product was made with
itertools.product().
All the iterations on both iterrows were performed on a Pool of
parallel processes (using a map function).
To give it a little more performance, the function compute_row_cython was compiled with Cython as it is advised in this section of the Pandas documentation:
In the cython_modules.pyx file:
from scipy.stats import pearsonr
import numpy as np
def compute_row_cython(row):
(df1_key, df1_values), (df2_key, df2_values) = row
cdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)
return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]
Then I set up the setup.py:
from distutils.core import setup
from Cython.Build import cythonize
setup(name='Compiled Pearson',
ext_modules=cythonize("cython_modules.pyx")
Finally I compiled it with: python setup.py build_ext --inplace
The final code was left, then:
import itertools
import multiprocessing
from cython_modules import compute_row_cython
NUM_CORES = multiprocessing.cpu_count() - 1
pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()
Neither Dask, nor the merge function with the apply used gave me better results. Not even optimizing the apply with Cython. In fact, this alternative with those two methods gave me memory error, when implementing the solution with Dask I had to generate several partitions, which degraded the performance as it had to perform many I/O operations.
The solution with Dask can be found in my other question.
Here's another method using same cross join but using the built in pandas method DataFrame.corrwith and scipy.stats.ttest_ind. Since we use less "loopy" implementation, this should perform better.
from scipy.stats import ttest_ind
mrg = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop(columns='key')
x = mrg.filter(like='_x').rename(columns=lambda x: x.rsplit('_', 1)[0])
y = mrg.filter(like='_y').rename(columns=lambda x: x.rsplit('_', 1)[0])
df = mrg[['userID', 'petID']].join(x.corrwith(y, axis=1).rename('Correlation'))
df['p_value'] = ttest_ind(x, y, axis=1)[1]
userID petID Correlation p_value
0 John Firulais 1.000000 1.000000
1 John Connie 0.641240 0.158341
2 John PopCorn 0.661040 0.048041
3 Charles Firulais 0.641240 0.158341
4 Charles Connie 1.000000 1.000000
5 Charles PopCorn 0.999660 0.020211
6 Genarito Firulais 0.661040 0.048041
7 Genarito Connie 0.999660 0.020211
8 Genarito PopCorn 1.000000 1.000000
9 Mark Firulais -0.682794 0.006080
10 Mark Connie -0.998462 0.003865
11 Mark PopCorn -0.999569 0.070639
I have a pandas dataframe with a format exactly like the one in this question and I'm trying to achieve the same result. In my case, I am calculating the fuzz-ratio between the row's index and it's corresponding col.
If I try this code (based on the answer to the linked question)
def get_similarities(x):
return x.index + x.name
test_df = test_df.apply(get_similarities)
the concatenation of the row index and col name happens cell-wise, just as intended. Running type(test_df) returns pandas.core.frame.DataFrame, as expected.
However, if I adapt the code to my scenario like so
def get_similarities(x):
return fuzz.partial_ratio(x.index, x.name)
test_df = test_df.apply(get_similarities)
it doesn't work. Instead of a dataframe, I get back a series (the return type of that function is an int)
I don't understand why the two samples would not behave the same nor how to fix my code so it returns a dataframe, with the fuzzy.ratio for each cell between the a row's index for that cell and the col name for that cell.
what about the following approach?
assuming that we have two sets of strings:
In [245]: set1
Out[245]: ['car', 'bike', 'sidewalk', 'eatery']
In [246]: set2
Out[246]: ['walking', 'caring', 'biking', 'eating']
Solution:
In [247]: from itertools import product
In [248]: res = np.array([fuzz.partial_ratio(*tup) for tup in product(set1, set2)])
In [249]: res = pd.DataFrame(res.reshape(len(set1), -1), index=set1, columns=set2)
In [250]: res
Out[250]:
walking caring biking eating
car 33 100 0 33
bike 25 25 75 25
sidewalk 73 20 22 36
eatery 17 33 0 50
There is a way to accomplish this via DataFrame.apply with some row manipulations.
Assuming the 'test_df` is as follows:
In [73]: test_df
Out[73]:
walking caring biking eating
car carwalking carcaring carbiking careating
bike bikewalking bikecaring bikebiking bikeeating
sidewalk sidewalkwalking sidewalkcaring sidewalkbiking sidewalkeating
eatery eaterywalking eaterycaring eaterybiking eateryeating
In [74]: def get_ratio(row):
...: return row.index.to_series().apply(lambda x: fuzz.partial_ratio(x,
...: row.name))
...:
In [75]: test_df.apply(get_ratio)
Out[75]:
walking caring biking eating
car 33 100 0 33
bike 25 25 75 25
sidewalk 73 20 22 36
eatery 17 33 0 50
It took some digging, but I figured it out. The problem comes from the fact that DataFrame.apply is either applied column-wise or row-wise, not cell by cell. So your get_similarities function is actually getting access to an entire row or column of data at a time! By default it gets the entire column -- so to solve your problem, you just have to make a get_similarities function that returns a list where you manually call fuzz.partial_ratio on each element, like this:
import pandas as pd
from fuzzywuzzy import fuzz
def get_similarities(x):
l = []
for rname in x.index:
print "Getting ratio for %s and %s" % (rname, x.name)
score = fuzz.partial_ratio(rname,x.name)
print "Score %s" % score
l.append(score)
print len(l)
print
return l
a = pd.DataFrame([[1,2],[3,4]],index=['apple','banana'], columns=['aple','banada'])
c = a.apply(get_similarities,axis=0)
print c
print type(c)
I left my print statements in their so you can see what the DataFrame.apply call is doing for yourself -- that's when it clicked for me.