How to cope around memory overflow with Pivot table? - python

I have two medium-sized datasets which looks like:
books_df.head()
ISBN Book-Title Book-Author
0 0195153448 Classical Mythology Mark P. O. Morford
1 0002005018 Clara Callan Richard Bruce Wright
2 0060973129 Decision in Normandy Carlo D'Este
3 0374157065 Flu: The Story of the Great Influenza Pandemic... Gina Bari Kolata
4 0393045218 The Mummies of Urumchi E. J. W. Barber
and
ratings_df.head()
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
And I wanna get a pivot table like this:
ISBN 1 2 3 4 5 6 7 8 9 10 ... 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952
User-ID
1 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I've tried:
R_df = ratings_df.pivot(index = 'User-ID', columns ='ISBN', values = 'Book-Rating').fillna(0) # Memory overflow
which failed for:
MemoryError:
and this:
R_df = q_data.groupby(['User-ID', 'ISBN'])['Book-Rating'].mean().unstack()
which failed for the same.
I want to use it for singular value decomposition and matrix factorization.
Any ideas?
The dataset I'm working with is: http://www2.informatik.uni-freiburg.de/~cziegler/BX/

One option is to use pandas Sparse functionality, since your data here is (very) sparse:
In [11]: df
Out[11]:
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
In [12]: res = df.groupby(['User-ID', 'ISBN'])['Book-Rating'].mean().astype('Sparse[int]')
In [13]: res.unstack(fill_value=0)
Out[13]:
ISBN 0155061224 034545104X 0446520802 052165615X 0521795028
User-ID
276725 0 0 0 0 0
276726 5 0 0 0 0
276727 0 0 0 0 0
276729 0 0 0 3 6
In [14]: _.dtypes
Out[14]:
ISBN
0155061224 Sparse[int64, 0]
034545104X Sparse[int64, 0]
0446520802 Sparse[int64, 0]
052165615X Sparse[int64, 0]
0521795028 Sparse[int64, 0]
dtype: object
My understanding is that you can then use this with scipy e.g. for SVD:
In [15]: res.unstack(fill_value=0).sparse.to_coo()
Out[15]:
<4x5 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>

Related

Pandas: create new (sub-level) columns in a multi-index dataframe and assign values

Let's be given a data-frame like the following one:
import pandas as pd
import numpy as np
a = ['a', 'b']
b = ['i', 'ii']
mi = pd.MultiIndex.from_product([a,b], names=['first', 'second'])
A = pd.DataFrame(np.zeros([3,4]), columns=mi)
first a b
second i ii i ii
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
I would like to create new columns iii for all first-level columns and assign the value of a new array (of matching size). I tried the following, to no avail.
A.loc[:,pd.IndexSlice[:,'iii']] = np.arange(6).reshape(3,-1)
The result should look like this:
a b
i ii iii i ii iii
0 0.0 0.0 0.0 0.0 0.0 1.0
1 0.0 0.0 2.0 0.0 0.0 3.0
2 0.0 0.0 4.0 0.0 0.0 5.0
Since you have multiple index in columns , I recommend create the additional append df , then concat it back
appenddf=pd.DataFrame(np.arange(6).reshape(3,-1),
index=A.index,
columns=pd.MultiIndex.from_product([A.columns.levels[0],['iii']]))
appenddf
a b
iii iii
0 0 1
1 2 3
2 4 5
A=pd.concat([A,appenddf],axis=1).sort_index(level=0,axis=1)
A
first a b
second i ii iii i ii iii
0 0.0 0.0 0 0.0 0.0 1
1 0.0 0.0 2 0.0 0.0 3
2 0.0 0.0 4 0.0 0.0 5
Another workable solution
for i,x in enumerate(A.columns.levels[0]):
A[x,'iii']=np.arange(6).reshape(3,-1)[:,i]
A
first a b a b
second i ii i ii iii iii
0 0.0 0.0 0.0 0.0 0 1
1 0.0 0.0 0.0 0.0 2 3
2 0.0 0.0 0.0 0.0 4 5
# here I did not add `sort_index`

How to Add some columns in dataframe?

I have a dataframe like this
value
msno features days
B num_50 1 0
C num_100 3 1
A num_100 400 2
I used
df = df.unstack(level=-1,fill_value = '0')
df = df.unstack(level=-1,fill_value = '0')
df = df.stack()
then df looks like :
value
days 1 3 400
msno features
B num_50 0 0 0
num_100 0 0 0
C num_50 0 0 0
num_100 0 1 0
A num_50 0 0 0
num_100 0 0 2
Now, I want to fill this df with 0. But still keep original data ,like this:
value
days 1 2 3 4 ... 400
msno features
B num_50 0 0 0 0 ... 0
num_100 0 0 0 0 ... 0
C num_50 0 0 0 0 ... 0
num_100 0 0 1 0 ... 0
A num_50 0 0 0 0 ... 0
num_100 0 0 0 0 ... 2
I want to add columns which are in 1 - 400 and fill the columns by 0.
Could someone tell me how to do that?
By using reindex
df.columns=df.columns.droplevel()
df.reindex(columns=list(range(1,20))).fillna(0)
Out[414]:
days 1 2 3 4 5 6 7 8 9 10 11 12 13 \
msno features
A num_100 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B num_100 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C num_100 0 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
days 14 15 16 17 18 19
msno features
A num_100 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0.0 0.0 0.0 0.0 0.0 0.0
B num_100 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0.0 0.0 0.0 0.0 0.0 0.0
C num_100 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0.0 0.0 0.0 0.0 0.0 0.0

"Cannot reindex from a duplicate axis" when groupby.apply() on MultiIndex columns

I'm playing around with computing subtotals within a DataFrame that looks like this (note the MultiIndex):
0 1 2 3 4 5
A 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
B 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
I can successfully add the subtotals with the following code:
(
df
.groupby(level=0)
.apply(
lambda df: pd.concat(
[df.xs(df.name), df.sum().to_frame('Total').T]
)
)
)
And it looks like this:
0 1 2 3 4 5
A 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
Total 0.0 0.0 0.0 0.0 0.0 0.0
B 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
Total 0.0 0.0 0.0 0.0 0.0 0.0
However, when I work with the transposed DataFrame, it does not work. The DataFrame looks like:
A B
1 2 1 2
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
And I use the following code:
(
df2
.groupby(level=0, axis=1)
.apply(
lambda df: pd.concat(
[df.xs(df.name, axis=1), df.sum(axis=1).to_frame('Total')],
axis=1
)
)
)
I have specified axis=1 everywhere I can think of, but I get an error:
ValueError: cannot reindex from a duplicate axis
I would expect the output to be:
A B
1 2 Total 1 2 Total
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0
Is this a bug? Or have I not specified the axis correctly everywhere? As a workaround, I can obviously transpose the DataFrame, produce the totals, and transpose back, but I'd like to know why it's not working here, and submit a bug report if necessary.
The problem DataFrame can be generated with:
df2 = pd.DataFrame(
np.zeros([6, 4]),
columns=pd.MultiIndex.from_product([['A', 'B'], [1, 2]])
)

Joining dataframes in Pandas

I'm getting the error:
TypeError: 'method' object is not subscriptable
When I try to join two Pandas dataframes...
Can't see what is wrong with them!
For info, doing the Kaggle titanic problem:
titanic_df.head()
Out[102]:
Survived Pclass SibSp Parch Fare has_cabin C Q title Person
0 0 3 1 0 7.2500 1 0.0 0.0 Mr male
1 1 1 1 0 71.2833 0 1.0 0.0 Mrs female
2 1 3 0 0 7.9250 1 0.0 0.0 Miss female
3 1 1 1 0 53.1000 0 0.0 0.0 Mrs female
4 0 3 0 0 8.0500 1 0.0 0.0 Mr male
In [103]:
sns.barplot(x=titanic_df["Survived"],y=titanic_df["title"])
Out[103]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b5edb00>
In [125]:
title_dummies=pd.get_dummies(titanic_df["title"])
title_dummies=title_dummies.drop([" Don"," Rev"," Dr"," Col"," Capt"," Jonkheer"," Major"," Mr"],axis=1)
title_dummies.head()
Out[125]:
Lady Master Miss Mlle Mme Mrs Ms Sir the Countess
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [126]:
titanic_df=title_dummies.join[titanic_df]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-126-a0e0fe306754> in <module>()
----> 1 titanic_df=title_dummies.join[titanic_df]
TypeError: 'method' object is not subscriptable
You need change [] to () in DataFrame.join function:
titanic_df=title_dummies.join(titanic_df)
print (titanic_df)
Lady Master Miss Mlle Mme Mrs Ms Sir the Countess Survived \
0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
1 1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
2 2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1
3 3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
4 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
Pclass SibSp Parch Fare has_cabin C Q title Person
0 3 1 0 7.2500 1 0.0 0.0 Mr male
1 1 1 0 71.2833 0 1.0 0.0 Mrs female
2 3 0 0 7.9250 1 0.0 0.0 Miss female
3 1 1 0 53.1000 0 0.0 0.0 Mrs female
4 3 0 0 8.0500 1 0.0 0.0 Mr male

Reshape dataframe to have the same index as another dataframe

I have two dataframes:
dayData
power_comparison final_average_delta_power calculated_power
1 0.0 0.0 0
2 0.0 0.0 0
3 0.0 0.0 0
4 0.0 0.0 0
5 0.0 0.0 0
7 0.0 0.0 0
and
historicPower
power
0 0.0
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0
I'm trying to reindex the historicPower dataframe to have the same shape as the dayData dataframe (so in this example it would looks like):
power
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0
The dataframes in reality will be alot larger with different shapes.
I think you can use reindex if index has no duplicates:
historicPower = historicPower.reindex(dayData.index)
print (historicPower)
power
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0

Categories