Build matrix of dummy indicators - python

I have a pandas dataframe which looks like the following
team_id
skill_id
inventor_id
1
A
Jack
1
B
Jack
1
A
Jill
1
B
Jill
2
A
Jack
2
B
Jack
2
A
Joe
2
B
Joe
So inventors can repeat over teams. I want to turn this data frame into a matrix A (I have included column names below for clarity, they wouldn't form part of the matrix) of dummy indicators, for those example A =
Jack_A
Jack_B
Jill_A
Jill_B
Joe_A
Joe_B
1
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
0
1
So that each row corresponds to one (team_id x skill_id combination), and each entry of the matrix is equal to one for that (inventor_id x skill_id) observation.
I tried to create an array of numpy zeros and thought of a double dictionary to map from each (team_id x skill), (inventor_id x skill) combination to an A_ij entry. However I believe this cannot be the most efficient method.
I need the method to be memory efficient as I have 220,000 (inventor x team x skill) observations. (So the dimension of the real df is (220,000, 3), not (8, 3) as in the example.

In addition to #Ben.T 's great answer I figured out another which allows me to keep memory efficient.
# Set the identifier for each row
inventor_data["team_id"] = inventor_data["team_id"].astype(str)
inventor_data["inv_skill_id"] = inventor_data["inventor_id"] + inventor_data["skill_id"]
inventor_data["team_skill_id"] = inventor_data["team_id"] + inventor_data["skill_id"]
# Using DictVectorizer requires a dictionary input
teams = list(inventor_data.groupby('team_skill_id')['inv_skill_id'].agg(dict))
# Change the dict entry from count to 1
for team_id, team in enumerate(teams):
teams[team_id] = {v: 1 for k, v in team.items()}
from sklearn.feature_extraction import DictVectorizer
vectoriser = DictVectorizer(sparse=False)
X = vectoriser.fit_transform(teams)

IIUC, you can use crosstab:
print(
pd.crosstab(
index=[df['team_id'],df['skill_id']],
columns=[df['inventor_id'], df['skill_id']]
)#.to_numpy()
)
# inventor_id Jack Jill Joe
# skill_id A B A B A B
# team_id skill_id
# 1 A 1 0 1 0 0 0
# B 0 1 0 1 0 0
# 2 A 1 0 0 0 1 0
# B 0 1 0 0 0 1
and if you just want the matrix, then uncomment .to_numpy() in the above code.
Note: if you have some skills that are not shared between teams or inventors, you may need to reindex with all the possibilities, so do:
pd.crosstab(
index=[df['team_id'],df['skill_id']],
columns=[df['inventor_id'], df['skill_id']]
).reindex(
index=pd.MultiIndex.from_product(
[df['team_id'].unique(),df['skill_id'].unique()]),
columns=pd.MultiIndex.from_product(
[df['inventor_id'].unique(),df['skill_id'].unique()]),
fill_value=0
)#.to_numpy()

Related

pandas: replace values in column with the last character in the column name

I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))

Converting data to matrix by group in Python

I want to create matrix for each observation in my dataset.
Each row should correspond to disease group (i.e. xx, yy, kk). Example data
id xx_z xx_y xx_a yy_b yy_c kk_t kk_r kk_m kk_y
1 1 1 0 0 1 0 0 1 1
2 0 0 1 0 0 1 1 0 1
Given that there are 3 types of diseases and there are maximum of 4 diseases in the dataset. The matrix should be by 3 X 4, and the output should look like:
id matrix
xx_z xx_y xx_a null
1 xx [ 1 1 0 0
yy_b yy_c null null
yy 0 1 0 0
kk_t kk_r kk_k kk_y
kk 0 0 1 1]
2 [ 0 0 1 0
0 0 0 0
1 1 0 1]
Please note that I do not know the exact number disease per disease group. How could I do it in python pandas?
P.S. I just need a nested matrix structure for each observation, later I will compare the matrices of different observations, e.g. Jaccard similarity of matrices for observation id == 1 and observation id == 2
Ok, how about something like this:
# make a copy just in case
d = df[:]
# get the groups, in case you don't have them already
groups = list({col.split('_')[0] for col in d.columns})
# define grouping condition (here, groups would be 'xx', 'yy', 'kk')
gb = d.groupby(d.columns.map(lambda x: x.split('_')[0]), axis=1)
# aggregate values of one group to list and save as extra columns
for g in groups:
d[g] = gb.get_group(g).values.tolist()
# now aggregate to list of lists
d['matrix'] = d[groups].values.tolist()
# convert list of lists to a matrix
d['matrix'] = d['matrix'].apply(lambda x: pd.DataFrame.from_records(x).fillna(0).astype(int).values)
# for the desired output
d[['matrix']]
Not the most elegant, but I'm hoping it does the job :)

How can I check the sparsity of a Pandas DataFrame?

In Pandas, How can I check how sparse a DataFrame? Is there any function available, or I will need to write my own?
For now, I have this:
df = pd.DataFrame({'a':[1,0,1,1,3], 'b':[0,0,0,0,1], 'c':[4,0,0,0,0], 'd':[0,0,3,0,0]})
a b c d
0 1 0 4 0
1 0 0 0 0
2 1 0 0 3
3 1 0 0 0
4 3 1 0 0
sparsity = sum((df == 0).astype(int).sum())/df.size
Which divides the number of zeros by the total number of elements, in this example it's 0.65.
Wanted to know if there is any better way to do this. And if there is any function which gives more information about the sparsity (like NaNs, any other prominent number like -1).
One idea for your solution is convert to numpy array, compare and use mean:
a = (df.to_numpy() == 0).mean()
print (a)
0.65
If want use Sparse dtypes is possible use:
#convert each column to SparseArray
sparr = df.apply(pd.arrays.SparseArray)
print (sparr)
a b c d
0 1 0 4 0
1 0 0 0 0
2 1 0 0 3
3 1 0 0 0
4 3 1 0 0
print (sparr.dtypes)
a Sparse[int64, 0]
b Sparse[int64, 0]
c Sparse[int64, 0]
d Sparse[int64, 0]
dtype: object
print (sparr.sparse.density)
0.35
As of September 16th, 2021 (and, I want to say, good for any version > 0.25.0, released July 2019) the sparse accessor gives DataFrame.sparse.density, which is exactly what you're looking for.
Of course, in order to do that you need to actually convert to a sparse DataFrame: df.astype(pd.SparseDtype("int", 0))

Python - Attempting to create binary features from a column with lists of strings

It was hard for me to come up with clear title but an example should make things more clear.
Index C1
1 [dinner]
2 [brunch, food]
3 [dinner, fancy]
Now, I'd like to create a set of binary features for each of the unique values in this column.
The example above would turn into:
Index C1 dinner brunch fancy food
1 [dinner] 1 0 0 0
2 [brunch, food] 0 1 0 1
3 [dinner, fancy] 1 0 1 0
Any help would be much appreciated.
For a performant solution, I recommend creating a new DataFrame by listifying your column.
pd.get_dummies(pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
brunch dinner fancy food
0 0 1 0 0
1 1 0 0 1
2 0 1 1 0
This is going to be so much faster than apply(pd.Series).
This works assuming lists don't have more of the same value (eg., ['dinner', ..., 'dinner']). If they do, then you'll need an extra groupby step:
(pd.get_dummies(
pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
.groupby(level=0, axis=1)
.sum())
Well, if your data is like this, then what you're looking for isn't "binary" anymore.
Maybe using MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.C1),columns=mlb.classes_,index=df.Index).reset_index()
Out[970]:
Index brunch dinner fancy food
0 1 0 1 0 0
1 2 1 0 0 1
2 3 0 1 1 0

Python Selecting and Adding row values of columns in dataframe to create an aggregated dataframe

I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:
Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p
You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p

Categories