Calculating Spatial Distance returns operand error - python

It is a follow-up question to my previous question:
I have a dataframe like this
Company_id year dummy_1 dummy_2 dummy_3 dummy_4 dummy_5
1 1990 1 0 1 1 1
1 1991 0 0 1 1 0
1 1992 0 0 1 1 0
1 1993 1 0 1 1 0
1 1994 0 1 1 1 0
1 1995 0 0 1 1 0
1 1996 0 0 1 1 1
I created an numpy array by:
df = df.assign(vector = df.iloc[:, -5:].values.tolist())
df['vector'] = df['vector'].apply(np.array)
I want to compare company's distinctivness in terms of it's strategic practices compared to rivals in last 5 years. Here is the code that I use:
df.sort_values('year', ascending=False)
# These will be our lists of differences.
diffs = []
# Loop over all unique dates
for date in df.year.unique():
# Only take dates earlier then current date.
compare_df = df.loc[df.year - date <= 5 ].copy()
# Loop over each company for this date
for row in df.loc[df.year == date].itertuples():
# If no data available use nans.
if compare_df.empty:
diffs.append(float('nan'))
# Calculate cosine and fill in otherwise
else:
compare_df['distinctivness'] = spatial.distance.cosine(np.array(compare_df.vector) , np.array(row.vector))
row_of_interest = compare_df.distinctivness.mean()
diffs.append(row_of_interest.distinctivness.values[0])
However, I get
compare_df['distinctivness'] = spatial.distance.cosine(np.array(compare_df.vector) - np.array(row.vector))
ValueError: operands could not be broadcast together with shapes (29254,) (93,)
How could I get rid of this problem?

Related

Find the number of rows a value is away from a specific row value in a Python DataFrame

I have the following dataframe, df, and I would like to add the 'distance' column to it, such that:
date
active
distance
01/09/2022
1
0
02/09/2022
0
1
05/09/2022
0
2
06/09/2022
0
3
07/09/2022
0
4
08/09/2022
1
0
09/09/2022
0
1
Here, the distance is how far away each row is from the previous value of '1' in the active column, with the distance being the number of business days. I have tried using the following:
df['distance'] = np.where(
df['active'] == 1, 0, df['distance'].shift(1,fill_value=0).astype(int) + 1
)
But it seems that Python does not like me referencing a column as I am defining it. I tried to also define a function to run this but unsure how to do so using .shift() as this command seems necessary in order to use to take the previous value and add to it.
Other variations of the above code do not seem to work since Python really wants to concatenate the shift and the 1 instead of just summing them together.
Create groups by compare 1 with Series.cumsum and cumulative count them by GroupBy.cumcount:
df['distance'] = df.groupby(df['active'].eq(1).cumsum()).cumcount()
print (df)
date active distance
0 01/09/2022 1 0
1 02/09/2022 0 1
2 05/09/2022 0 2
3 06/09/2022 0 3
4 07/09/2022 0 4
5 08/09/2022 1 0
6 09/09/2022 0 1
your column can be entirely defined from the "active" column. your formula is the same as:
count_up = pd.Series(np.arange(len(df)), index=df.index)
distance = count_up - count_up.where(df.active).ffill()
Use cumsum to mark the active groups.
g = (df['active']==1).cumsum()
df.assign(distance=g.groupby(g).transform(lambda x: range(len(x))))
print(df)
Result
date active distance
0 01/09/2022 1 0
1 02/09/2022 0 1
2 05/09/2022 0 2
3 06/09/2022 0 3
4 07/09/2022 0 4
5 08/09/2022 1 0
6 09/09/2022 0 1
There are sure myriads of approaches all getting the same result. Here are six of them:
# ======================================================================
# ----------------------------------------------------------------------
# Provided in another answers (and fixed if necessary)
# Using merely pandas own methods:
df['distance'] = df.groupby(df['active'].eq(1).cumsum()).cumcount()
# nice pure pandas and short one - in my eyes the best choice
print(df)
# -------------------------------
cnt = pd.Series(np.arange(df.shape[0]), index=df.index)
distance = (cnt-cnt.where(df.active.astype(bool)).ffill()).astype(int)
df['distance'] = distance
# a much longer pure pandas one
print(df)
# -------------------------------
g = (df['active']==1).cumsum()
df.assign(distance=g.groupby(g).transform(lambda x: range(len(x))))
# using in addition a function as replacement for .cumcount()
print(df)
# ======================================================================
# ----------------------------------------------------------------------
# Using a loop over values in column 'active':
d=[];c=-1
for i in df['active']:
c+=1
if i: c = 0
d.append(c)
df["distance"] = d
print(df)
# ----------------------------------------------------------------------
# Using a function
c = -1
def f(i):
global c
if i: c=0
else: c+=1;
return c
# -------------------------------
# with a list comprehension:
df['distance'] = [ f(i) for i in df['active'] ]
print(df)
# -------------------------------
# or pandas apply() function:
df['distance'] = df['active'].apply(f)
print(df)
Below one of them including full code with data:
import pandas as pd
import numpy as np
df_print = """\
date active
01/09/2022 1
02/09/2022 0
05/09/2022 0
06/09/2022 0
07/09/2022 0
08/09/2022 1
09/09/2022 0"""
open('df_print', 'w').write(df_print)
df = pd.read_table('df_print', sep=r'\s\s\s*' ) # index_col = 0)
print(df)
distance = []
counter = -1
for index, row in df.iterrows():
if row['active']:
counter = 0
distance.append(counter)
continue
counter +=1
distance.append(counter)
df["distance"] = distance
print(df)
gives:
date active
0 01/09/2022 1
1 02/09/2022 0
2 05/09/2022 0
3 06/09/2022 0
4 07/09/2022 0
5 08/09/2022 1
6 09/09/2022 0
date active distance
0 01/09/2022 1 0
1 02/09/2022 0 1
2 05/09/2022 0 2
3 06/09/2022 0 3
4 07/09/2022 0 4
5 08/09/2022 1 0
6 09/09/2022 0 1

Check if n consecutive elements equals x and any previous element is greater than x

I have a pandas dataframe with 6 mins readings. I want to mark each row as either NF or DF.
NF = rows with 5 consecutive entries being 0 and at least one prior reading being greater than 0
DF = All other rows that do not meet the NF rule
[[4,6,7,2,1,0,0,0,0,0]
[6,0,0,0,0,0,2,2,2,5]
[0,0,0,0,0,0,0,0,0,0]
[0,0,0,0,0,4,6,7,2,1]]
Expected Result:
[NF, NF, DF, DF]
Can I use a sliding window for this? What is a good pythonic way of doing this?
using staring numpy vectorised solution, two conditions operating on truth matrix
uses fact that True is 1 so cumsum() can be used
position of 5th zero should be 4 places higher than 1st
if you just want the array, the np.where() gives that without assigning if back to a dataframe column
used another test case [1,0,0,0,0,1,0,0,0,0] where there are many zeros, but not 5 consecutive
df = pd.DataFrame([[4,6,7,2,1,0,0,0,0,0],
[6,0,0,0,0,0,2,2,2,5],
[0,0,0,0,0,0,0,0,0,0],
[1,0,0,0,0,1,0,0,0,0],
[0,0,0,0,0,4,6,7,2,1]])
df = df.assign(res=np.where(
# five consecutive zeros
((np.argmax(np.cumsum(df.eq(0).values, axis=1)==1, axis=1)+4) ==
np.argmax(np.cumsum(df.eq(0).values, axis=1)==5, axis=1)) &
# first zero somewhere other that 0th position
np.argmax(df.eq(0).values, axis=1)>0
,"NF","DF")
)
0
1
2
3
4
5
6
7
8
9
res
0
4
6
7
2
1
0
0
0
0
0
NF
1
6
0
0
0
0
0
2
2
2
5
NF
2
0
0
0
0
0
0
0
0
0
0
DF
3
1
0
0
0
0
1
0
0
0
0
DF
4
0
0
0
0
0
4
6
7
2
1
DF

Pandas - scoring column

I have data about product sales (1 column per product) at the customer level (1 row per customer).
I'm assessing which customers are more likely to be interested in a specific product. I have a list of the 10 most correlated products. (and I have this for multiple products, so I'm trying to build a scalable approach).
I'm trying to score all customers based on how many of those 10 products they buy.
Let's say my list is:
prod_x_corr_prod
How can I create a scoring column (say prox_x_propensity) which goes through the 10 relevant columns, for every row, and for each column with a value > 0 adds 1?
For instance, if customer Y bought 3 of the products correlated with product X, he would have a score of 3 in the "prox_x_score" column.
EDIT: thanks to all of you for the feedback.
For customer 5 I would ge a 2, while for 1,2,3 I would get 1. For 4, 0.
You can do:
df['prox_x_score'] = (df[prod_x_corr_prod] > 0).sum(axis=1)
Example with dummy data:
import numpy as np
import pandas as pd
prod_x_corr_prod = ["prod{}".format(i) for i in range(1, 11)]
df = pd.DataFrame({col:np.random.choice([0,1], size=5) for col in prod_x_corr_prod})
df['prox_x_score'] = (df[prod_x_corr_prod] > 0).sum(axis=1)
print(df)
Output:
prod1 prod10 prod2 prod3 prod4 prod5 prod6 prod7 prod8 prod9 \
0 1 1 1 0 0 1 1 1 1 0
1 1 1 1 0 1 0 0 1 1 0
2 1 1 1 1 0 1 0 0 1 0
3 0 0 0 0 0 0 1 0 1 0
4 0 0 0 0 0 0 0 1 1 0
prox_x_score
0 7
1 6
2 6
3 2
4 2

How can I add new columns using another dataframe (related to string columns) in Pandas

Confusing title, let me explain. I have 2 dataframes like this:
dataframe named df1: Looks like this (with million of rows in original):
id ` text c1
1 Hello world how are you people 1
2 Hello people I am fine people 1
3 Good Morning people -1
4 Good Evening -1
Dataframe named df2 looks like this:
Word count Points Percentage
hello 2 2 100
world 1 1 100
how 1 1 100
are 1 1 100
you 1 1 100
people 3 1 33.33
I 1 1 100
am 1 1 100
fine 1 1 100
Good 2 -2 -100
Morning 1 -1 -100
Evening 1 -1 -100
-1
df2 columns explaination:
count means the total number of times that word appeared in df1
points is points given to each word by some kind of algorithm
percentage = points/count*100
Now, I want to add 40 new columns in df1, according to the point & percentage. They will look like this:
perc_-90_2 perc_-80_2 perc_-70_2 perc_-60_2 perc_-50_2 perc_-40_2 perc_-20_2 perc_-10_2 perc_0_2 perc_10_2 perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 perc_80_2 perc_90_2
perc_-90_1 perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 perc_-20_1 perc_-10_1 perc_0_1 perc_10_1 perc_20_1 perc_30_1 perc_40_1 perc_50_1 perc_60_ perc_70_1 perc_80_1 perc_90_1
Let me break it down. The column name contain 3 parts:
1.) perc just a string, means nothing
2.) Numbers from range -90 to +90. For example, Here -90 means, the percentage is -90 in df2. Now for example, If a word has percentage value in range 81-90, then there will be a value of 1 in that row, and column named prec_-80_xx. The xx is the third part.
3.) The third part is the count. Here I want two type of counts. 1 and 2. As the example given in point 2, If the word count is in range of 0 to 1, then the value will be 1 in prec_-80_1 column. If the word count is 2 or more, then the value will be 1 in prec_-80_2 column.
I hope it is not very on confusing.
Use:
#change previous answer with add id for matching
df2 = (df.drop_duplicates(['id','Word'])
.groupby('Word', sort=False)
.agg({'c1':['sum','size'], 'id':'first'})
)
df2.columns = df2.columns.map(''.join)
df2 = df2.reset_index()
df2 = df2.rename(columns={'c1sum':'Points','c1size':'Totalcount','idfirst':'id'})
df2['Percentage'] = df2['Points'] / df2['Totalcount'] * 100
s1 = df2['Percentage'].div(10).astype(int).mul(10).astype(str)
s2 = np.where(df2['Totalcount'] == 1, '1', '2')
#s2= np.where(df1['Totalcount'].isin([0,1]), '1', '2')
#create colum by join
df2['new'] = 'perc_' + s1 + '_' +s2
#create indicator DataFrame
df3 = pd.get_dummies(df2[['id','new']].drop_duplicates().set_index('id'),
prefix='',
prefix_sep='').max(level=0)
print (df3)
#reindex for add missing columns
c = 'perc_' + pd.Series(np.arange(-100, 110, 10).astype(str)) + '_'
cols = (c + '1').append(c + '2')
#join to original df1
df = df1.join(df3.reindex(columns=cols, fill_value=0), on='id')
print (df)
id text c1 perc_-100_1 perc_-90_1 \
0 1 Hello world how are you people 1 0 0
1 2 Hello people I am fine people 1 0 0
2 3 Good Morning people -1 1 0
3 4 Good Evening -1 1 0
perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 ... perc_10_2 \
0 0 0 0 0 0 ... 0
1 0 0 0 0 0 ... 0
2 0 0 0 0 0 ... 0
3 0 0 0 0 0 ... 0
perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 \
0 0 1 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
perc_80_2 perc_90_2 perc_100_2
0 0 0 1
1 0 0 0
2 0 0 0
3 0 0 0
[4 rows x 45 columns]

Vectorizing conditional count in Pandas

I have a Pandas script that counts the number of readmissions to hospital within 30 days based on a few conditions. I wonder if it could be vectorized to improve performance. I've experimented with df.rolling().apply, but so far without luck.
Here's a table with contrived data to illustrate:
ID VISIT_NO ARRIVED LEFT HAD_A_MASSAGE BROUGHT_A_FRIEND
1 1 29/02/1996 01/03/1996 0 1
1 2 01/12/1996 04/12/1996 1 0
2 1 20/09/1996 21/09/1996 1 0
3 1 27/06/1996 28/06/1996 1 0
3 2 04/07/1996 06/07/1996 0 1
3 3 16/07/1996 18/07/1996 0 1
4 1 21/02/1996 23/02/1996 0 1
4 2 29/04/1996 30/04/1996 1 0
4 3 02/05/1996 02/05/1996 0 1
4 4 02/05/1996 03/05/1996 0 1
5 1 03/10/1996 05/10/1996 1 0
5 2 07/10/1996 08/10/1996 0 1
5 3 10/10/1996 11/10/1996 0 1
First, I create a dictionary with IDs:
ids = massage_df[massage_df['HAD_A_MASSAGE'] == 1]['ID']
id_dict = {id:0 for id in ids}
Everybody in this table has had a massage, but in my real dataset, not all people are so lucky.
Next, I run this bit of code:
for grp, df in massage_df.groupby(['ID']):
date_from = df.loc[df[df['HAD_A_MASSAGE']==1].index, 'LEFT']
date_to = date_from + DateOffset(days=30)
mask = ((date_from.values[0] < df['ARRIVED']) &
(df['ARRIVED'] <= date_to.values[0]) &
(df['BROGHT_A_FRIEND'] == 1))
if len(df[mask]) > 0:
id_dict[df['ID'].iloc[0]] = len(df[mask])
Basically, I want to count the number of times when someone originally came in for a massage (single or with a friend) and then came back within 30 days with a friend. The expected results for this table would be a total of 6 readmissions for IDs 3, 4 and 5.

Categories