Pandas crosstab on very large matrix? - python

I have a dataframe of dimensions (42 million rows, 6 columns) that I need to do a crosstab on to get counts of specific events for each person in the dataset that will result in a very large sparse matrix of size ~1.5 million rows by 36,000 columns. When I try this with pandas crosstab (pd.crosstab) function I run out of memory on my system. Is there some way to do this crosstab in chunks and join the resulting dataframes? To be clear, each row of the crosstab will count the number of times an event occurred for each person in the dataset (i.e. each row is a person, each column entry is the count of the times that person participated in a specific event). The ultimate goal is to factor the resulting person-event matrix using PCA/SVD.

Setup
source_0 = [*'ABCDEFGHIJ']
source_1 = [*'abcdefghij']
np.random.seed([3, 1415])
df = pd.DataFrame({
'source_0': np.random.choice(source_0, 100),
'source_1': np.random.choice(source_1, 100),
})
df
source_0 source_1
0 A b
1 C b
2 H f
3 D a
4 I h
.. ... ...
95 C f
96 F a
97 I j
98 I d
99 J b
Use pd.factorize to get an integer factorization... and unique values
ij, tups = pd.factorize(list(zip(*map(df.get, df))))
result = dict(zip(tups, np.bincount(ij)))
This is already a compact form. But you can convert it to a pandas.Series and unstack to verify it is what we want.
pd.Series(result).unstack(fill_value=0)
a b c d e f g h i j
A 2 1 0 0 0 1 0 2 1 1
B 0 1 0 0 0 1 0 1 0 1
C 0 3 1 3 0 2 0 0 0 0
D 3 0 0 2 0 0 1 3 0 2
E 3 0 0 1 0 1 2 5 0 0
F 4 0 2 1 1 1 1 1 1 0
G 0 2 1 0 0 2 3 0 3 1
H 1 3 2 0 2 1 1 1 0 2
I 2 2 1 1 2 0 1 2 0 2
J 0 1 1 0 1 1 0 1 0 1
Using sparse
from scipy.sparse import csr_matrix
i, r = pd.factorize(df['source_0'])
j, c = pd.factorize(df['source_1'])
ij, tups = pd.factorize(list(zip(i, j)))
a = csr_matrix((np.bincount(ij), tuple(zip(*tups))))
b = pd.DataFrame.sparse.from_spmatrix(a, r, c).sort_index().sort_index(axis=1)
b
a b c d e f g h i j
A 2 1 0 0 0 1 0 2 1 1
B 0 1 0 0 0 1 0 1 0 1
C 0 3 1 3 0 2 0 0 0 0
D 3 0 0 2 0 0 1 3 0 2
E 3 0 0 1 0 1 2 5 0 0
F 4 0 2 1 1 1 1 1 1 0
G 0 2 1 0 0 2 3 0 3 1
H 1 3 2 0 2 1 1 1 0 2
I 2 2 1 1 2 0 1 2 0 2
J 0 1 1 0 1 1 0 1 0 1

Related

Is it possible to create multiple columns that are based on each other?

Let's say I have this DF:
Range = np.arange(0,10,1)
Table = pd.DataFrame({"Row": Range})
Now I'd like to add 2 new columns that are based on each other -
For example -
Table["A"]=np.where(Table["B"].shift(1)>0,1,0)
Table["B"]=np.where(Table["A"]==1,1,0)
In Excel for example there isn't an error for a circle of invalid data.
I wonder if that is possible on df in python.
Thanks.
You can try an iterative approach, initializing B to something (e.g., all items equal to None or 1 or 0, or some other initialization pattern), setting A as in your question, then looping and updating A and B until each converges or a maximum iteration threshold is reached:
import pandas as pd
import numpy as np
Range = np.arange(0,4,1)
Table = pd.DataFrame({"Row": Range})
print(Table)
for pair in [(np.nan, np.nan), (None, None), (0,0),(0,1),(1,0),(1,1)]:
print(f'\npair is {pair}')
Table["B"] = np.where(Table["Row"] % 2, pair[0], pair[1])
Table["A"] = np.where(Table["B"].shift(1) > 0, 1, 0)
i = 0
prevA, prevB = None, None
while prevA is None or not prevB.equals(Table["B"]) or not prevA.equals(Table["A"]):
if i >= 20:
print('max iterations exceeded')
break
print(f'iteration {i}')
print(Table)
i += 1
prevA, prevB = Table["A"].copy(), Table["B"].copy()
Table["A"] = np.where(Table["B"].shift(1) > 0, 1, 0)
Table["B"] = np.where(Table["A"] == 1, 1, 0)
Note that with the example logic from OP, this does indeed seem to converge for a few initialization patterns for B (at least it does using a shorter input array of length 4).
Sample output:
pair is (nan, nan)
iteration 0
Row B A
0 0 NaN 0
1 1 NaN 0
2 2 NaN 0
3 3 NaN 0
iteration 1
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
pair is (None, None)
iteration 0
Row B A
0 0 None 0
1 1 None 0
2 2 None 0
3 3 None 0
iteration 1
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
pair is (0, 0)
iteration 0
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
pair is (0, 1)
iteration 0
Row B A
0 0 1 0
1 1 0 1
2 2 1 0
3 3 0 1
iteration 1
Row B A
0 0 0 0
1 1 1 1
2 2 0 0
3 3 1 1
iteration 2
Row B A
0 0 0 0
1 1 0 0
2 2 1 1
3 3 0 0
iteration 3
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 1 1
iteration 4
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
pair is (1, 0)
iteration 0
Row B A
0 0 0 0
1 1 1 0
2 2 0 1
3 3 1 0
iteration 1
Row B A
0 0 0 0
1 1 0 0
2 2 1 1
3 3 0 0
iteration 2
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 1 1
iteration 3
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
pair is (1, 1)
iteration 0
Row B A
0 0 1 0
1 1 1 1
2 2 1 1
3 3 1 1
iteration 1
Row B A
0 0 0 0
1 1 1 1
2 2 1 1
3 3 1 1
iteration 2
Row B A
0 0 0 0
1 1 0 0
2 2 1 1
3 3 1 1
iteration 3
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 1 1
iteration 4
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0

Convert one-hot encoded data-frame columns into one column

In the pandas data frame, the one-hot encoded vectors are present as columns, i.e:
Rows A B C D E
0 0 0 0 1 0
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 1 0 0 0 0
4 0 0 0 0 1
How to convert these columns into one data frame column by label encoding them in python? i.e:
Rows A
0 4
1 3
2 2
3 4
4 1
5 5
Also need suggestion on this that some rows have multiple 1s, how to handle those rows because we can have only one category at a time.
Try with argmax
#df=df.set_index('Rows')
df['New']=df.values.argmax(1)+1
df
Out[231]:
A B C D E New
Rows
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
4 0 0 0 0 1 5
argmaxis the way to go, adding another way using idxmax and get_indexer:
df['New'] = df.columns.get_indexer(df.idxmax(1))+1
#df.idxmax(1).map(df.columns.get_loc)+1
print(df)
Rows A B C D E New
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5
Also need suggestion on this that some rows have multiple 1s, how to
handle those rows because we can have only one category at a time.
In this case you dot your DataFrame of dummies with an array of all the powers of 2 (based on the number of columns). This ensures that the presence of any unique combination of dummies (A, A+B, A+B+C, B+C, ...) will have a unique category label. (Added a few rows at the bottom to illustrate the unique counting)
df['Category'] = df.dot(2**np.arange(df.shape[1]))
A B C D E Category
Rows
0 0 0 0 1 0 8
1 0 0 1 0 0 4
2 0 1 0 0 0 2
3 0 0 0 1 0 8
4 1 0 0 0 0 1
5 0 0 0 0 1 16
6 1 0 0 0 1 17
7 0 1 0 0 1 18
8 1 1 0 0 1 19
Another readable solution on top of other great solutions provided that works for ANY type of variables in your dataframe:
df['variables'] = np.where(df.values)[1]+1
output:
A B C D E variables
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5

Cumulative count in a pandas df

I am trying to export a cumulative count based off two columns in a pandas df.
An example is the df below. I'm trying to export a count based off Value and Count. So when the count increase I want attribute that to the adjacent value
import pandas as pd
d = ({
'Value' : ['A','A','B','C','D','A','B','A'],
'Count' : [0,1,1,2,3,3,4,5],
})
df = pd.DataFrame(d)
I have used this:
for val in ['A','B','C','D']:
cond = df.Value.eq(val) & df.Count.eq(int)
df.loc[cond, 'Count_' + val] = cond[cond].cumsum()
If I alter int to a specific number it will return the count. But I need this to read any number as the Count column keeps increasing.
My intended output is:
Value Count A_Count B_Count C_Count D_Count
0 A 0 0 0 0 0
1 A 1 1 0 0 0
2 B 1 1 0 0 0
3 C 2 1 0 1 0
4 D 3 1 0 1 1
5 A 3 1 0 1 1
6 B 4 1 1 1 1
7 A 5 2 1 1 1
So the count increase on the second row so 1 to Value A. Count increases again on row 4 and it's the first time for Value C so 1. Same again for rows 5 and 7. The count increases on row 8 so A becomes 2.
You could use str.get_dummies and diff and cumsum
In [262]: df['Value'].str.get_dummies().multiply(df['Count'].diff().gt(0), axis=0).cumsum()
Out[262]:
A B C D
0 0 0 0 0
1 1 0 0 0
2 1 0 0 0
3 1 0 1 0
4 1 0 1 1
5 1 0 1 1
6 1 1 1 1
7 2 1 1 1
Which is
In [266]: df.join(df['Value'].str.get_dummies()
.multiply(df['Count'].diff().gt(0), axis=0)
.cumsum().add_suffix('_Count'))
Out[266]:
Value Count A_Count B_Count C_Count D_Count
0 A 0 0 0 0 0
1 A 1 1 0 0 0
2 B 1 1 0 0 0
3 C 2 1 0 1 0
4 D 3 1 0 1 1
5 A 3 1 0 1 1
6 B 4 1 1 1 1
7 A 5 2 1 1 1

Find first row with condition after each row satisfying another condition

in pandas I have the following data frame:
a b
0 0
1 1
2 1
0 0
1 0
2 1
Now I want to do the following:
Create a new column c, and for each row where a = 0 fill c with 1. Then c should be filled with 1s until the first row after each column fulfilling that, where b = 1 (and here im hanging), so the output should look like this:
a b c
0 0 1
1 1 1
2 1 0
0 0 1
1 0 1
2 1 1
Thanks!
It seems you need:
df['c'] = df.groupby(df.a.eq(0).cumsum())['b'].cumsum().le(1).astype(int)
print (df)
a b c
0 0 0 1
1 1 1 1
2 2 1 0
3 0 0 1
4 1 0 1
5 2 1 1
Detail:
print (df.a.eq(0).cumsum())
0 1
1 1
2 1
3 2
4 2
5 2
Name: a, dtype: int32

How to set index to an existing dataframe in the form of cartesian product?

I have a list. I want to set_index of dataframe in the form of a cartesian product of list values with dataframe i.e
li = ['A','B']
df = pd.DataFrame([[0,0,0],[1,1,1],[2,2,2]])
I want the resulting dataframe to be like
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
How can I do this?
​
Option 1
pd.concat with keys argument
pd.concat([df] * len(li), keys=li)
0 1 2
A 0 0 0 0
1 1 1 1
2 2 2 2
B 0 0 0 0
1 1 1 1
2 2 2 2
To replicate your output exactly:
pd.concat([df] * len(li), keys=li).reset_index(1, drop=True)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
Option 2
np.tile and np.repeat
pd.DataFrame(np.tile(df, [len(li), 1]), np.repeat(li, len(df)), df.columns)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
Use MultiIndex.from_product with reindex:
mux = pd.MultiIndex.from_product([li, df.index])
df = df.reindex(mux, level=1).reset_index(level=1, drop=True)
print (df)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
Or you can using .
li = [['A','B']]
df['New']=li*len(df)
df.set_index([0,1,2])['New'].apply(pd.Series).stack().to_frame().rename(columns={0:'keys'})\
.reset_index().drop('level_3',1).sort_values('keys')
Out[698]:
0 1 2 keys
0 0 0 0 A
2 1 1 1 A
4 2 2 2 A
1 0 0 0 B
3 1 1 1 B
5 2 2 2 B

Categories