cosine_similarity giving different answer for dataframe and subset of dataframe - python

I have the following piece of code for my recommendation system and it gives different output.
Scenario 1:
a = df[df.index == 5031]
b = df[df.index == 9365]
print(cosine_similarity(a,b)) #0.33
Scenario 2:
cosine_sim = cosine_similarity(df)
print(cosine_sim[5031][9365]) #0.25
I think the output for both scenarios should be the same. I feel scenario 1 to be more accurate according to the data.
Can anyone help with this?
Dataframe looks like this.

You are mixing label index with location based index.
In scenario 1 you get the vectors by label index
# labels 5031 and 9365
a = df[df.index == 5031]
b = df[df.index == 9365]
The matrix which is returned by sklearn.metrics.pairwise.cosine_similarity does not know anything about the index labels.
Thus before you get the data from the matrix you need to know the location based index in the dataframe
idx_a = df.index.get_loc(5031)
idx_b = df.index.get_loc(9365)
cosine_sim[idx_a][idx_b]

Related

Speeding up loop over dataframes

I have written the code given below. There are two Pandas dataframes: df contains columns timestamp_milli and pressure and df2 contains columns timestamp_milli and acceleration_z. Both dataframes have around 100'000 rows. In the code shown below I'm searching for each timestamp of each row of df the rows of df2 where the time difference lies within a range and is minimal.
Unfortunately the code is extremly slow. Moreover, I'm getting the following message originating from the line df_temp["timestamp_milli"] = df_temp["timestamp_milli"] - row["timestamp_milli"]:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
How can I speedup the code and solve the warning?
acceleration = []
pressure = []
for index, row in df.iterrows():
mask = (df2["timestamp_milli"] >= (row["timestamp_milli"] - 5)) & (df2["timestamp_milli"] <= (row["timestamp_milli"] + 5))
df_temp = df2[mask]
# Select closest point
if len(df_temp) > 0:
df_temp["timestamp_milli"] = df_temp["timestamp_milli"] - row["timestamp_milli"]
df_temp["timestamp_milli"] = df_temp["timestamp_milli"].abs()
df_temp = df_temp.loc[df_temp["timestamp_milli"] == df_temp["timestamp_milli"].min()]
for index2, row2 in df_temp.iterrows():
pressure.append(row["pressure"])
acc = row2["acceleration_z"]
acceleration.append(acc)
I have faced a similar problem, using itertuples instead of iterrows shows significant reduction in time.
why iterrows have issues.
Hope this helps.

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

Comparing Values of Two arrays in Python

I have a large dataset trying to read with Pandas. I am trying to split the value of one of the column in two parts and check if there is any overlapping values between these sets. With the codes below the result is there are some value overlapping in array 'b' and array 'c'. I want to get those values specifically but don't know how? Can anybody point me in the right direction?
df = pd.read_csv('....csv')
df2 = df[df['Freq']>= 280]
a=df2['Ring'].values
b=df2['Ring'].drop_duplicates().values
df3 = df[df['Freq']<= 280]
df3['Ring'].values
c=df3['Ring'].drop_duplicates().values
if np.all(b) == np.all(c):
print ("They are overlapping")
else:
print ("They are not overlapping")
Based on the example provided, you can do the following:
import numpy as np
np.intersect1d(b, c)
or you can also do something like:
cond = df['Freq'] >= 280
np.intersect1d(df[cond]['Ring'], df[~cond]['Ring'])

How can I speed up an iterative function on my large pandas dataframe?

I am quite new to pandas and I have a pandas dataframe of about 500,000 rows filled with numbers. I am using python 2.x and am currently defining and calling the method shown below on it. It sets a predicted value to be equal to the corresponding value in series 'B', if two adjacent values in series 'A' are the same. However, it is running extremely slowly, about 5 rows are outputted per second and I want to find a way accomplish the same result more quickly.
def myModel(df):
A_series = df['A']
B_series = df['B']
seriesLength = A_series.size
# Make a new empty column in the dataframe to hold the predicted values
df['predicted_series'] = np.nan
# Make a new empty column to store whether or not
# prediction matches predicted matches B
df['wrong_prediction'] = np.nan
prev_B = B_series[0]
for x in range(1, seriesLength):
prev_A = A_series[x-1]
prev_B = B_series[x-1]
#set the predicted value to equal B if A has two equal values in a row
if A_series[x] == prev_A:
if df['predicted_series'][x] > 0:
df['predicted_series'][x] = df[predicted_series'][x-1]
else:
df['predicted_series'][x] = B_series[x-1]
Is there a way to vectorize this or to just make it run faster? Under the current circumstances, it is projected to take many hours. Should it really be taking this long? It doesn't seem like 500,000 rows should be giving my program that much problem.
Something like this should work as you described:
df['predicted_series'] = np.where(A_series.shift() == A_series, B_series, df['predicted_series'])
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
This will get rid of the for loop and set predicted_series to the value of B when A is equal to previous A.
edit:
per your comment, change your initialization of predicted_series to be all NAN and then front fill the values:
df['predicted_series'] = np.nan
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
df.predicted_series = df.predicted_series.fillna(method='ffill')
For fastest speed modifying ayhans answer a bit will perform best:
df['predicted_series'] = np.where(df.A.shift() == df.A, df.B, df['predicted_series'].shift())
That will give you your forward filled values and run faster than my original recommendation
Solution
df.loc[df.A == df.A.shift()] = df.B.shift()

Transition Matrix in Dataframe Not Passing the Value

I am trying to implement transition matrix.
Both data and transition matrix are in DataFrames using Pandas
states_mat = pd.DataFrame(None, index=range(0,24), columns=range(0,24))
def states_update(data):
states_vec = data['hr']
# Do nothing if there is no sequence
if len(states_vec) < 2:
return
for i in xrange(1, len(states_vec)):
prev = states_vec[i-1]
curr = states_vec[i]
states_mat[curr][prev] += 1
Data are in int64 type
It is not updating +1 count as I wanted. I believe it is some kind of type issue, but not sure how to force the type. I am using DataFrame for my data as I want to use group function to split the data and apply the above function. Any suggestions?
OK so the first problem and the one that resolved your issue is that you created your states_mat dataframe with a default value of None which becomes a numpy.NaN.
You cannot add an integer to a NaN:
In [24]:
NaN + 1
Out[24]:
nan
So change the DataFrame construction to:
states_mat = pd.DataFrame(0, index=range(0,24), columns=range(0,24))
Probably subindexing is fine in this case but you could have used loc also would work:
states_mat.loc[curr, prev] += 1

Categories