How to calculate pairwise co-occurrence matrix based on dataframe? - python

I have a dataframe, about 800,000 rows and 16 columns, below is an example from the data,
import pandas as pd
import datetime
start = datetime.datetime.now()
print('Starting time,'+str(start))
dict1 = {'id':['person1','person2','person3','person4','person5'], \
'food1':['A','A','A','C','D' ], \
'food2':['B','C','B','A','B'], \
'food3':['','D','C','',''], 'food4':['','','D','','',] }
demo = pd.DataFrame(dict1)
demo
>>>Out[13]
Starting time,2022-11-30 12:08:41.414807
id food1 food2 food3 food4
0 person1 A B
1 person2 A C D
2 person3 A B C D
3 person4 C A
4 person5 D B
My ideal result format is as follows,
>>>Out[14]
A B C D
A 0 2 3 2
B 2 0 1 2
C 3 1 0 2
D 2 2 2 0
I did the following:
I've searched a bit through stackoverflow, google, but so far haven't come across an answer that helps with my problem.
I tried to code it myself, my idea was to first build each pairing, then combine everything into a string, and finally find the number of duplicates, but limited by my code capabilities, it's a work in progress.Also, the "new" combination of the next of one pair and the previous of another pair may cause errors in the process of finding duplicates.
Thank you for your help.

You could try this:
out = pd.get_dummies(data=demo.iloc[:,1:].stack()).sum(level=0).ne(0).astype(int)
final = out.T.dot(out).astype(float)
np.fill_diagonal(final.values, np.nan)
>>>final
A B C D
A NaN 2.0 3.0 2.0
B 2.0 NaN 1.0 2.0
C 3.0 1.0 NaN 2.0
D 2.0 2.0 2.0 NaN

If I understand your goal correctly you can use this:
uniques = demo[[x for x in demo.columns if 'id' not in x]].stack().unique()
pd.DataFrame(index = uniques, columns = uniques).fillna(np.NaN)

Related

Pandas groupby multiple columns to compare values

My df looks like this: (There are dozens of other columns in the df but these are the three I am focused on)
Param Value Limit
A 1.50 1
B 2.50 1
C 2.00 2
D 2.00 2.5
E 1.50 2
I am trying to use pandas to calculate how many [Value] that are less than [Limit] per [Param], Hoping to get a list like this:
Param Count
A 1
B 1
C 1
D 0
E 0
I've tried with a few methods, the first being
value_count = df.loc[df['Value'] < df['Limit']].count()
but this just gives the full count per column in the df.
I've also tried groupby function which I think could be the correct idea, by creating a subset of the df with the chosen columns
df_below_limit = df[df['Value'] < df['Limit']]
df_below_limit.groupby('Param')['Value'].count()
This is nearly what I want but it excludes values below which I also need. Not sure how to go about getting the list as I need it.
Assuming you want the count per Param, you can use:
out = df['Value'].ge(df['Limit']).groupby(df['Param']).sum()
output:
Param
A 1
B 2
C 1
D 0
E 0
dtype: int64
used input (with a duplicated row "B" for the example):
Param Value Limit
0 A 1.5 1.0
1 B 2.5 1.0
2 B 2.5 1.0
3 C 2.0 2.0
4 D 2.0 2.5
5 E 1.5 2.0
as DataFrame
df['Value'].ge(df['Limit']).groupby(df['Param']).sum().reset_index(name='Count')
# or
df['Value'].ge(df['Limit']).groupby(df['Param']).agg(Count='sum').reset_index()
output:
Param Count
0 A 1
1 B 2
2 C 1
3 D 0
4 E 0

Pandas fill row values using previous period

One additional note to address the problem better, in the actual data set there is also a column called store and the table can be grouped by store, date & product, When I tried the pivot solution and the cartesian product solution it did not work, is there a solution that could work for 3 grouping columns? Also the table has millions of rows.
Assuming a data frame with the following format:
d = {'product': ['a', 'b', 'c', 'a', 'b'], 'amount': [1, 2, 3, 5, 2], 'date': ['2020-6-6', '2020-6-6', '2020-6-6',
'2020-6-7', '2020-6-7']}
df = pd.DataFrame(data=d)
print(df)
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 c 3 2020-6-6
3 a 5 2020-6-7
4 b 2 2020-6-7
Product c is no longer present on the date 2020-6-7, I want to be able to calculate things like percent change or difference in the amount of each product.
For example: df['diff'] = df.groupby('product')['amount'].diff()
But in order for this to work and show for example that the difference of c is -3 and -100%, c would need to be present on the next date with the amount set to 0
This is the results I am looking for:
print(df)
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 c 3 2020-6-6
3 a 5 2020-6-7
4 b 2 2020-6-7
5 c 0 2020-6-7
Please note this is just a snipped data frame, in reality there might be many date periods, I am only looking to fill in the product and amount in the first date after it has been removed, not all dates after.
What is the best way to go about this?
Let us try pivot then unstack
out = df.pivot('product','date','amount').fillna(0).unstack().reset_index(name='amount')
date product amount
0 2020-6-6 a 1.0
1 2020-6-6 b 2.0
2 2020-6-6 c 3.0
3 2020-6-7 a 5.0
4 2020-6-7 b 2.0
5 2020-6-7 c 0.0
You could use the complete function from pyjanitor to explicitly expose the missing values and combine with fillna to fill the missing values with 0:
# pip install pyjanitor
# import janitor
df.complete(['date', 'product']).fillna(0)
date product amount
0 2020-6-6 a 1.0
1 2020-6-6 b 2.0
2 2020-6-6 c 3.0
3 2020-6-7 a 5.0
4 2020-6-7 b 2.0
5 2020-6-7 c 0.0
another way is to do create a cartesian product of your products & dates, then join that to your main dataframe to get the missing values.
#df['date'] = pd.to_datetime(df['date'])
#ensure you have a proper datetime object.
s = pd.merge( df[['product']].drop_duplicates().assign(ky=-1),
df[['date']].drop_duplicates().assign(ky=-1),
on=['ky']
).drop('ky',1)
df1 = pd.merge(df,s,
on = ['product','date']
,how='outer'
).fillna(0)
print(df1)
product amount date
0 a 1.0 2020-06-06
1 b 2.0 2020-06-06
2 c 3.0 2020-06-06
3 a 5.0 2020-06-07
4 b 2.0 2020-06-07
5 c 0.0 2020-06-07

Python Dataframe filling up non existing

I was wondering if there is an efficient way to add rows to a Dataframe that e.g. include the average or a predifined value in case there are not enough rows for a specific value in another column. I guess the description of the Problem is not the best that is why you find an example below:
Say we have the Dataframe
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
And we want to have 2 Rows for each client A, B, C, D, no matter if these 2 rows are already existing or not. So for Client A and B we can just copy the rows, for C we want to add a row which says Client = C, NumberOfProducts = average of existing rows = 9 and ID is not of interest (so we could set it to ID = smallest existing one - 1 = 0 any other value, even NaN, would also be possible). For Client D there does not exist a single row so we want to add 2 rows where NumberOfProducts is equal to the constant 2.5. The output should then look like this:
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
C 9 0
D 2.5 NaN
D 2.5 NaN
What I have done so far is to loop through the dataframe and add rows where necessary. Since this is pretty inefficient any better solution would be highly appreciated.
Use:
clients = ['A','B','C','D']
N = 2
#test only values from list and also filter only 2 rows for each client if necessary
df = df[df['Client'].isin(clients)].groupby('Client').head(N)
#create helper counter and reshape by unstack
df1 = df.set_index(['Client',df.groupby('Client').cumcount()]).unstack()
#set first if only 1 row per client - replace second NumberOfProducts by first
df1[('NumberOfProducts',1)] = df1[('NumberOfProducts',1)].fillna(df1[('NumberOfProducts',0)])
# ... replace second ID by first subtracted by 1
df1[('ID',1)] = df1[('ID',1)].fillna(df1[('ID',0)] - 1)
#add missing clients by reindex
df1 = df1.reindex(clients)
#replace NumberOfProducts by constant 2.5
df1['NumberOfProducts'] = df1['NumberOfProducts'].fillna(2.5)
print (df1)
NumberOfProducts ID
0 1 0 1
Client
A 1.0 5.0 2.0 1.0
B 1.0 6.0 2.0 1.0
C 9.0 9.0 1.0 0.0
D 2.5 2.5 NaN NaN
#last reshape to original
df2 = df1.stack().reset_index(level=1, drop=True).reset_index()
print (df2)
Client NumberOfProducts ID
0 A 1.0 2.0
1 A 5.0 1.0
2 B 1.0 2.0
3 B 6.0 1.0
4 C 9.0 1.0
5 C 9.0 0.0
6 D 2.5 NaN
7 D 2.5 NaN

How to replace missing values with group mode in Pandas?

I follow the method in this post to replace missing values with the group mode, but encounter the "IndexError: index out of bounds".
df['SIC'] = df.groupby('CIK').SIC.apply(lambda x: x.fillna(x.mode()[0]))
I guess this is probably because some groups have all missing values and do not have a mode. Is there a way to get around this? Thank you!
mode is quite difficult, given that there really isn't any agreed upon way to deal with ties. Plus it's typically very slow. Here's one way that will be "fast". We'll define a function that calculates the mode for each group, then we can fill the missing values afterwards with a map. We don't run into issues with missing groups, though for ties we arbitrarily choose the modal value that comes first when sorted:
def fast_mode(df, key_cols, value_col):
"""
Calculate a column mode, by group, ignoring null values.
Parameters
----------
df : pandas.DataFrame
DataFrame over which to calcualate the mode.
key_cols : list of str
Columns to groupby for calculation of mode.
value_col : str
Column for which to calculate the mode.
Return
------
pandas.DataFrame
One row for the mode of value_col per key_cols group. If ties,
returns the one which is sorted first.
"""
return (df.groupby(key_cols + [value_col]).size()
.to_frame('counts').reset_index()
.sort_values('counts', ascending=False)
.drop_duplicates(subset=key_cols)).drop(columns='counts')
Sample data df:
CIK SIK
0 C 2.0
1 C 1.0
2 B NaN
3 B 3.0
4 A NaN
5 A 3.0
6 C NaN
7 B NaN
8 C 1.0
9 A 2.0
10 D NaN
11 D NaN
12 D NaN
Code:
df.loc[df.SIK.isnull(), 'SIK'] = df.CIK.map(fast_mode(df, ['CIK'], 'SIK').set_index('CIK').SIK)
Output df:
CIK SIK
0 C 2.0
1 C 1.0
2 B 3.0
3 B 3.0
4 A 2.0
5 A 3.0
6 C 1.0
7 B 3.0
8 C 1.0
9 A 2.0
10 D NaN
11 D NaN
12 D NaN

populate missing values for multiple columns with multiple values

I have gone through the posts that are similar to filling out the multiple columns for pandas in one go, however it appears that my problem here is a little different, in the sense that I need to be able to populate a missing column value with a specific column value and be able to do that for multiple columns in one go.
Eg: I can use the commands as below individually to fill the NA's
result1_copy['BASE_B'] = np.where(pd.isnull(result1_copy['BASE_B']), result1_copy['BASE_S'], result1_copy['BASE_B'])
result1_copy['QWE_B'] = np.where(pd.isnull(result1_copy['QWE_B']), result1_copy['QWE_S'], result1_copy['QWE_B'])
However, if I were to try populating it one go, it does not work:
result1_copy['BASE_B','QWE_B'] = result1_copy['BASE_B', 'QWE_B'].fillna(result1_copy['BASE_S','QWE_S'])
Do we know why ?
Please note I have only used 2 columns here for ease of purpose, however I have 10s of columns to impute. And they are either object, float or datetime.
Is datatypes the issue here ?
You need add [] for filtered DataFrame and for align columns add rename:
d = {'BASE_S':'BASE_B', 'QWE_S':'QWE_B'}
result1_copy[['BASE_B','QWE_B']] = result1_copy[['BASE_B', 'QWE_B']]
.fillna(result1_copy[['BASE_S','QWE_S']]
.rename(columns=d))
More dynamic solution:
L = ['BASE_','QWE_']
orig = ['{}B'.format(x) for x in L]
new = ['{}S'.format(x) for x in L]
d = dict(zip(new, orig))
result1_copy[orig] = (result1_copy[orig].fillna(result1_copy[new]
.rename(columns=d)))
Another solution if match columns with B and S:
for x in ['BASE_','QWE_']:
result1_copy[x + 'B'] = result1_copy[x + 'B'].fillna(result1_copy[x + 'S'])
Sample:
result1_copy = pd.DataFrame({'A':list('abcdef'),
'BASE_B':[np.nan,5,4,5,5,np.nan],
'QWE_B':[np.nan,8,9,4,2,np.nan],
'BASE_S':[1,3,5,7,1,0],
'QWE_S':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (result1_copy)
A BASE_B BASE_S F QWE_B QWE_S
0 a NaN 1 a NaN 5
1 b 5.0 3 a 8.0 3
2 c 4.0 5 a 9.0 6
3 d 5.0 7 b 4.0 9
4 e 5.0 1 b 2.0 2
5 f NaN 0 b NaN 4
d = {'BASE_S':'BASE_B', 'QWE_S':'QWE_B'}
result1_copy[['BASE_B','QWE_B']] = (result1_copy[['BASE_B', 'QWE_B']]
.fillna(result1_copy[['BASE_S','QWE_S']]
.rename(columns=d)))
print (result1_copy)
A BASE_B BASE_S F QWE_B QWE_S
0 a 1.0 1 a 5.0 5
1 b 5.0 3 a 8.0 3
2 c 4.0 5 a 9.0 6
3 d 5.0 7 b 4.0 9
4 e 5.0 1 b 2.0 2
5 f 0.0 0 b 4.0 4

Categories