Joining dataframes in Pandas - python

I'm getting the error:
TypeError: 'method' object is not subscriptable
When I try to join two Pandas dataframes...
Can't see what is wrong with them!
For info, doing the Kaggle titanic problem:
titanic_df.head()
Out[102]:
Survived Pclass SibSp Parch Fare has_cabin C Q title Person
0 0 3 1 0 7.2500 1 0.0 0.0 Mr male
1 1 1 1 0 71.2833 0 1.0 0.0 Mrs female
2 1 3 0 0 7.9250 1 0.0 0.0 Miss female
3 1 1 1 0 53.1000 0 0.0 0.0 Mrs female
4 0 3 0 0 8.0500 1 0.0 0.0 Mr male
In [103]:
sns.barplot(x=titanic_df["Survived"],y=titanic_df["title"])
Out[103]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b5edb00>
In [125]:
title_dummies=pd.get_dummies(titanic_df["title"])
title_dummies=title_dummies.drop([" Don"," Rev"," Dr"," Col"," Capt"," Jonkheer"," Major"," Mr"],axis=1)
title_dummies.head()
Out[125]:
Lady Master Miss Mlle Mme Mrs Ms Sir the Countess
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [126]:
titanic_df=title_dummies.join[titanic_df]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-126-a0e0fe306754> in <module>()
----> 1 titanic_df=title_dummies.join[titanic_df]
TypeError: 'method' object is not subscriptable

You need change [] to () in DataFrame.join function:
titanic_df=title_dummies.join(titanic_df)
print (titanic_df)
Lady Master Miss Mlle Mme Mrs Ms Sir the Countess Survived \
0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
1 1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
2 2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1
3 3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
4 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
Pclass SibSp Parch Fare has_cabin C Q title Person
0 3 1 0 7.2500 1 0.0 0.0 Mr male
1 1 1 0 71.2833 0 1.0 0.0 Mrs female
2 3 0 0 7.9250 1 0.0 0.0 Miss female
3 1 1 0 53.1000 0 0.0 0.0 Mrs female
4 3 0 0 8.0500 1 0.0 0.0 Mr male

Related

Pandas long reshape for several variables

I want to reshape my long dataframe to wide by sorting it by Session. In this example Session is from 1-10.
Session Tube Window Counts Length
0 1 1 1 0.0 0.0
1 1 1 2 0.0 0.0
2 1 1 3 0.0 0.0
3 1 1 4 0.0 0.0
4 1 1 5 0.0 0.0
... ... ... ... ... ...
17995 10 53 36 0.0 0.0
17996 10 53 37 0.0 0.0
17997 10 53 38 0.0 0.0
17998 10 53 39 0.0 0.0
17999 10 53 40 0.0 0.0
What I am expecting is something like:
Session Tube Window Counts_1 Length_1 Session Counts_2 Length_2
0 1 1 1 0.0 0.0 0 2 0.0 0.0
1 1 1 2 0.0 0.0 1 2 0.0 0.0
2 1 1 3 0.0 0.0 2 2 0.0 0.0
3 1 1 4 0.0 0.0 3 2 0.0 0.0
4 1 1 5 0.0 0.0 4 2 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
17995 10 53 36 0.0 0.0
I could not find the solution. What I tried leads to a complete wide dataset.
df['idx'] = df.groupby('Session').cumcount()+1
df = df.pivot_table(index=['Session'], columns='idx',
values=['Counts', 'Length'], aggfunc='first')
df = df.sort_index(axis=1, level=1)
df.columns = [f'{x}_{y}' for x,y in df.columns]
df = df.reset_index()
Session Counts_1 Length_1 Counts_2 Length_2 Counts_3 Length_3 Counts_4 Length_4 Counts_5 Length_5 ... Length_1795 Counts_1796 Length_1796 Counts_1797 Length_1797 Counts_1798 Length_1798 Counts_1799 Length_1799 Counts_1800 Length_1800
0 1 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 ... 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000
1 2 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 ... 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000
2 3 0.0 6.892889 0.0 2.503830 0.0 3.108580 0.0 5.188438 0.0 9.779242 ... 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000
3 4 1.0 12.787159 0.0 13.847412 7.0 44.928269 0.0 48.511435 2.0 33.264356 ... 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000
4 5 0.0 13.345436 2.0 27.415005 20.0 83.130315 19.0 85.475996 2.0 10.147958 ... 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000
5 6 2.0 13.141503 8.0 22.965002 5.0 48.737279 15.0 85.403915 1.0 17.414609 ... 0.000000 6.0 12.399834 0.0 0.710808 0.0 0.000000 0.0 1.661978 0.0 0.000000
6 7 1.0 7.852842 0.0 13.613426 14.0 46.148978 23.0 87.446535 0.0 13.759176 ... 2.231295 8.0 39.022340 1.0 7.304392 3.0 9.228959 0.0 6.885822 0.0 1.606200
7 8 0.0 0.884018 3.0 35.323813 8.0 32.846301 10.0 71.691744 0.0 4.310296 ... 2.753615 6.0 25.003670 6.0 22.113324 0.0 0.615790 0.0 11.812815 2.0 9.991712
8 9 4.0 24.700817 13.0 31.637755 3.0 30.312104 5.0 50.490115 0.0 3.830024 ... 5.977912 11.0 44.305738 1.0 13.523643 0.0 1.374856 1.0 9.066218 1.0 8.376995
9 10 0.0 17.651236 10.0 44.311858 29.0 55.415964 12.0 43.457016 1.0 41.503212 ... 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000
You could try to pivot your dataframe, after building a custom index per session:
df2 = df.assign(index=df.groupby(['Session']).cumcount()).pivot(
'index', 'Session', ['Tube', 'Window', 'Counts', 'Length']).rename_axis(index=None)
With you sample data it would give:
Tube Window Counts Length
Session 1 10 1 10 1 10 1 10
0 1.0 53.0 1.0 36.0 0.0 0.0 0.0 0.0
1 1.0 53.0 2.0 37.0 0.0 0.0 0.0 0.0
2 1.0 53.0 3.0 38.0 0.0 0.0 0.0 0.0
3 1.0 53.0 4.0 39.0 0.0 0.0 0.0 0.0
4 1.0 53.0 5.0 40.0 0.0 0.0 0.0 0.0
Not that bad but we have a MultiIndex for the columns and in a wrong order. Let us go further:
df2.columns = df2.columns.to_flat_index()
df2 = df2.reindex(columns=sorted(df2.columns, key=lambda x: x[1]))
We now have:
(Tube, 1) (Window, 1) ... (Counts, 10) (Length, 10)
0 1.0 1.0 ... 0.0 0.0
1 1.0 2.0 ... 0.0 0.0
2 1.0 3.0 ... 0.0 0.0
3 1.0 4.0 ... 0.0 0.0
4 1.0 5.0 ... 0.0 0.0
Last step:
df2 = df2.rename(columns=lambda x: '_'.join(str(i) for i in x))
to finaly get:
Tube_1 Window_1 Counts_1 ... Window_10 Counts_10 Length_10
0 1.0 1.0 0.0 ... 36.0 0.0 0.0
1 1.0 2.0 0.0 ... 37.0 0.0 0.0
2 1.0 3.0 0.0 ... 38.0 0.0 0.0
3 1.0 4.0 0.0 ... 39.0 0.0 0.0
4 1.0 5.0 0.0 ... 40.0 0.0 0.0

How to cope around memory overflow with Pivot table?

I have two medium-sized datasets which looks like:
books_df.head()
ISBN Book-Title Book-Author
0 0195153448 Classical Mythology Mark P. O. Morford
1 0002005018 Clara Callan Richard Bruce Wright
2 0060973129 Decision in Normandy Carlo D'Este
3 0374157065 Flu: The Story of the Great Influenza Pandemic... Gina Bari Kolata
4 0393045218 The Mummies of Urumchi E. J. W. Barber
and
ratings_df.head()
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
And I wanna get a pivot table like this:
ISBN 1 2 3 4 5 6 7 8 9 10 ... 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952
User-ID
1 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I've tried:
R_df = ratings_df.pivot(index = 'User-ID', columns ='ISBN', values = 'Book-Rating').fillna(0) # Memory overflow
which failed for:
MemoryError:
and this:
R_df = q_data.groupby(['User-ID', 'ISBN'])['Book-Rating'].mean().unstack()
which failed for the same.
I want to use it for singular value decomposition and matrix factorization.
Any ideas?
The dataset I'm working with is: http://www2.informatik.uni-freiburg.de/~cziegler/BX/
One option is to use pandas Sparse functionality, since your data here is (very) sparse:
In [11]: df
Out[11]:
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
In [12]: res = df.groupby(['User-ID', 'ISBN'])['Book-Rating'].mean().astype('Sparse[int]')
In [13]: res.unstack(fill_value=0)
Out[13]:
ISBN 0155061224 034545104X 0446520802 052165615X 0521795028
User-ID
276725 0 0 0 0 0
276726 5 0 0 0 0
276727 0 0 0 0 0
276729 0 0 0 3 6
In [14]: _.dtypes
Out[14]:
ISBN
0155061224 Sparse[int64, 0]
034545104X Sparse[int64, 0]
0446520802 Sparse[int64, 0]
052165615X Sparse[int64, 0]
0521795028 Sparse[int64, 0]
dtype: object
My understanding is that you can then use this with scipy e.g. for SVD:
In [15]: res.unstack(fill_value=0).sparse.to_coo()
Out[15]:
<4x5 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>

Getting column name where a condition matches in a row

I have a pandas dataframe which looks like this:
A B C D E F G H I
1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Now, for each row, I have to check which column contains 1 and then record this column name in a new column. The final dataframe would look like this:
A B C D E F G H I IsTrue
1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 B
2 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 A
3 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 B
Is there any faster and pythonic way to do it?
Here's one way using DataFrame.dot:
df['isTrue'] = df.astype(bool).dot(df.columns)
A B C D E F G H I isTrue
1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 B
2 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 A
3 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 B
For an even better performance you can use:
df['isTrue'] = df.columns[df.to_numpy().argmax(1)]
What you described is the definition of idxmax
>>> df.idxmax(1)
1 B
2 A
3 B
dtype: object

How to Add some columns in dataframe?

I have a dataframe like this
value
msno features days
B num_50 1 0
C num_100 3 1
A num_100 400 2
I used
df = df.unstack(level=-1,fill_value = '0')
df = df.unstack(level=-1,fill_value = '0')
df = df.stack()
then df looks like :
value
days 1 3 400
msno features
B num_50 0 0 0
num_100 0 0 0
C num_50 0 0 0
num_100 0 1 0
A num_50 0 0 0
num_100 0 0 2
Now, I want to fill this df with 0. But still keep original data ,like this:
value
days 1 2 3 4 ... 400
msno features
B num_50 0 0 0 0 ... 0
num_100 0 0 0 0 ... 0
C num_50 0 0 0 0 ... 0
num_100 0 0 1 0 ... 0
A num_50 0 0 0 0 ... 0
num_100 0 0 0 0 ... 2
I want to add columns which are in 1 - 400 and fill the columns by 0.
Could someone tell me how to do that?
By using reindex
df.columns=df.columns.droplevel()
df.reindex(columns=list(range(1,20))).fillna(0)
Out[414]:
days 1 2 3 4 5 6 7 8 9 10 11 12 13 \
msno features
A num_100 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B num_100 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C num_100 0 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
days 14 15 16 17 18 19
msno features
A num_100 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0.0 0.0 0.0 0.0 0.0 0.0
B num_100 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0.0 0.0 0.0 0.0 0.0 0.0
C num_100 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0.0 0.0 0.0 0.0 0.0 0.0

Reshape dataframe to have the same index as another dataframe

I have two dataframes:
dayData
power_comparison final_average_delta_power calculated_power
1 0.0 0.0 0
2 0.0 0.0 0
3 0.0 0.0 0
4 0.0 0.0 0
5 0.0 0.0 0
7 0.0 0.0 0
and
historicPower
power
0 0.0
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0
I'm trying to reindex the historicPower dataframe to have the same shape as the dayData dataframe (so in this example it would looks like):
power
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0
The dataframes in reality will be alot larger with different shapes.
I think you can use reindex if index has no duplicates:
historicPower = historicPower.reindex(dayData.index)
print (historicPower)
power
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0

Categories