Pandas long reshape for several variables - python

I want to reshape my long dataframe to wide by sorting it by Session. In this example Session is from 1-10.
Session Tube Window Counts Length
0 1 1 1 0.0 0.0
1 1 1 2 0.0 0.0
2 1 1 3 0.0 0.0
3 1 1 4 0.0 0.0
4 1 1 5 0.0 0.0
... ... ... ... ... ...
17995 10 53 36 0.0 0.0
17996 10 53 37 0.0 0.0
17997 10 53 38 0.0 0.0
17998 10 53 39 0.0 0.0
17999 10 53 40 0.0 0.0
What I am expecting is something like:
Session Tube Window Counts_1 Length_1 Session Counts_2 Length_2
0 1 1 1 0.0 0.0 0 2 0.0 0.0
1 1 1 2 0.0 0.0 1 2 0.0 0.0
2 1 1 3 0.0 0.0 2 2 0.0 0.0
3 1 1 4 0.0 0.0 3 2 0.0 0.0
4 1 1 5 0.0 0.0 4 2 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
17995 10 53 36 0.0 0.0
I could not find the solution. What I tried leads to a complete wide dataset.
df['idx'] = df.groupby('Session').cumcount()+1
df = df.pivot_table(index=['Session'], columns='idx',
values=['Counts', 'Length'], aggfunc='first')
df = df.sort_index(axis=1, level=1)
df.columns = [f'{x}_{y}' for x,y in df.columns]
df = df.reset_index()
Session Counts_1 Length_1 Counts_2 Length_2 Counts_3 Length_3 Counts_4 Length_4 Counts_5 Length_5 ... Length_1795 Counts_1796 Length_1796 Counts_1797 Length_1797 Counts_1798 Length_1798 Counts_1799 Length_1799 Counts_1800 Length_1800
0 1 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 ... 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000
1 2 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 ... 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000
2 3 0.0 6.892889 0.0 2.503830 0.0 3.108580 0.0 5.188438 0.0 9.779242 ... 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000
3 4 1.0 12.787159 0.0 13.847412 7.0 44.928269 0.0 48.511435 2.0 33.264356 ... 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000
4 5 0.0 13.345436 2.0 27.415005 20.0 83.130315 19.0 85.475996 2.0 10.147958 ... 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000
5 6 2.0 13.141503 8.0 22.965002 5.0 48.737279 15.0 85.403915 1.0 17.414609 ... 0.000000 6.0 12.399834 0.0 0.710808 0.0 0.000000 0.0 1.661978 0.0 0.000000
6 7 1.0 7.852842 0.0 13.613426 14.0 46.148978 23.0 87.446535 0.0 13.759176 ... 2.231295 8.0 39.022340 1.0 7.304392 3.0 9.228959 0.0 6.885822 0.0 1.606200
7 8 0.0 0.884018 3.0 35.323813 8.0 32.846301 10.0 71.691744 0.0 4.310296 ... 2.753615 6.0 25.003670 6.0 22.113324 0.0 0.615790 0.0 11.812815 2.0 9.991712
8 9 4.0 24.700817 13.0 31.637755 3.0 30.312104 5.0 50.490115 0.0 3.830024 ... 5.977912 11.0 44.305738 1.0 13.523643 0.0 1.374856 1.0 9.066218 1.0 8.376995
9 10 0.0 17.651236 10.0 44.311858 29.0 55.415964 12.0 43.457016 1.0 41.503212 ... 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000

You could try to pivot your dataframe, after building a custom index per session:
df2 = df.assign(index=df.groupby(['Session']).cumcount()).pivot(
'index', 'Session', ['Tube', 'Window', 'Counts', 'Length']).rename_axis(index=None)
With you sample data it would give:
Tube Window Counts Length
Session 1 10 1 10 1 10 1 10
0 1.0 53.0 1.0 36.0 0.0 0.0 0.0 0.0
1 1.0 53.0 2.0 37.0 0.0 0.0 0.0 0.0
2 1.0 53.0 3.0 38.0 0.0 0.0 0.0 0.0
3 1.0 53.0 4.0 39.0 0.0 0.0 0.0 0.0
4 1.0 53.0 5.0 40.0 0.0 0.0 0.0 0.0
Not that bad but we have a MultiIndex for the columns and in a wrong order. Let us go further:
df2.columns = df2.columns.to_flat_index()
df2 = df2.reindex(columns=sorted(df2.columns, key=lambda x: x[1]))
We now have:
(Tube, 1) (Window, 1) ... (Counts, 10) (Length, 10)
0 1.0 1.0 ... 0.0 0.0
1 1.0 2.0 ... 0.0 0.0
2 1.0 3.0 ... 0.0 0.0
3 1.0 4.0 ... 0.0 0.0
4 1.0 5.0 ... 0.0 0.0
Last step:
df2 = df2.rename(columns=lambda x: '_'.join(str(i) for i in x))
to finaly get:
Tube_1 Window_1 Counts_1 ... Window_10 Counts_10 Length_10
0 1.0 1.0 0.0 ... 36.0 0.0 0.0
1 1.0 2.0 0.0 ... 37.0 0.0 0.0
2 1.0 3.0 0.0 ... 38.0 0.0 0.0
3 1.0 4.0 0.0 ... 39.0 0.0 0.0
4 1.0 5.0 0.0 ... 40.0 0.0 0.0

Related

Slice multi-index pandas dataframe by date

Say I have the following multi-index dataframe:
arrays = [np.array(['bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'foo', 'foo']),
pd.to_datetime(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'])]
df = pd.DataFrame(np.zeros((8, 4)), index=arrays)
0 1 2 3
bar 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 0.0 0.0 0.0 0.0
2020-01-04 0.0 0.0 0.0 0.0
foo 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 0.0 0.0 0.0 0.0
2020-01-04 0.0 0.0 0.0 0.0
How do I select only the part of this dataframe where the first index level = 'bar', and date > 2020.01.02, such that I can add 1 to this part?
To be clearer, the expected output would be:
0 1 2 3
bar 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 1.0 1.0 1.0 1.0
2020-01-04 1.0 1.0 1.0 1.0
foo 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 0.0 0.0 0.0 0.0
2020-01-04 0.0 0.0 0.0 0.0
I managed slicing it according to the first index:
df.loc['bar']
But then I am not able to apply the condition on the date.
Here is possible compare each level and then set 1, there is : for all columns in DataFrame.loc:
m1 = df.index.get_level_values(0) =='bar'
m2 = df.index.get_level_values(1) > '2020-01-02'
df.loc[m1 & m2, :] = 1
print (df)
0 1 2 3
bar 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 1.0 1.0 1.0 1.0
2020-01-04 1.0 1.0 1.0 1.0
foo 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 0.0 0.0 0.0 0.0
2020-01-04 0.0 0.0 0.0 0.0
#give ur index names :
df.index = df.index.set_names(["names","dates"])
#get the indices that match ur condition
index = df.query('names=="bar" and dates>"2020-01-02"').index
#assign 1 to the relevant points
#IndexSlice makes slicing multiindexes easier ... here though, it might be seen as overkill
idx = pd.IndexSlice
df.loc[idx[index],:] = 1
0 1 2 3
names dates
bar 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 1.0 1.0 1.0 1.0
2020-01-04 1.0 1.0 1.0 1.0
foo 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 0.0 0.0 0.0 0.0
2020-01-04 0.0 0.0 0.0 0.0

add new elements to existing cosine similarity matrix

I calculated a cosine similarity matrix with cosine_similarity from sklearn.metrics.pairwise.
Matrix:
2414514 413915 419480 473104 534621 609406 654913 654914 \
2414514 1.000000 0.0 0.0 0.0 0.0 0.0 0.755929 0.755929
413915 0.000000 1.0 0.0 0.0 0.0 1.0 0.000000 0.000000
419480 0.000000 0.0 1.0 1.0 1.0 0.0 0.000000 0.000000
473104 0.000000 0.0 1.0 1.0 1.0 0.0 0.000000 0.000000
534621 0.000000 0.0 1.0 1.0 1.0 0.0 0.000000 0.000000
609406 0.000000 1.0 0.0 0.0 0.0 1.0 0.000000 0.000000
654913 0.755929 0.0 0.0 0.0 0.0 0.0 1.000000 1.000000
654914 0.755929 0.0 0.0 0.0 0.0 0.0 1.000000 1.000000
668130 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000
668743 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000
679691 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000
707669 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000
749049 0.000000 1.0 0.0 0.0 0.0 1.0 0.000000 0.000000
770946 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000
668130 668743 679691 707669 749049 770946
2414514 0.0 0.0 0.0 0.0 0.0 0.0
413915 0.0 0.0 0.0 0.0 1.0 0.0
419480 0.0 0.0 0.0 0.0 0.0 0.0
473104 0.0 0.0 0.0 0.0 0.0 0.0
534621 0.0 0.0 0.0 0.0 0.0 0.0
609406 0.0 0.0 0.0 0.0 1.0 0.0
654913 0.0 0.0 0.0 0.0 0.0 0.0
654914 0.0 0.0 0.0 0.0 0.0 0.0
668130 1.0 1.0 0.0 1.0 0.0 0.0
668743 1.0 1.0 0.0 1.0 0.0 0.0
679691 0.0 0.0 1.0 0.0 0.0 1.0
707669 1.0 1.0 0.0 1.0 0.0 0.0
749049 0.0 0.0 0.0 0.0 1.0 0.0
770946 0.0 0.0 1.0 0.0 0.0 1.0
But every day, i've new items. Is there a way to update the existing matrix with the new items without calculate all items?
You can compute only the similarity of the newly added vectors with the already existing, use the fact that cosine distance is symmetrical and concatenate it to the previous matrix:
****X
****X
****X
****X
XXXX0
where *s are the original similarity matrix, XXXX is the newly computed similarity vector.
According to documentation
sklearn.metrics.pairwise.cosine_similarity(X, Y=None, dense_output=True)
Compute cosine similarity between samples in X and Y.
Which means you can every day compute the Y new data against already existing X for which you already have the cosine_similarity and then you combine the results.

Getting column name where a condition matches in a row

I have a pandas dataframe which looks like this:
A B C D E F G H I
1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Now, for each row, I have to check which column contains 1 and then record this column name in a new column. The final dataframe would look like this:
A B C D E F G H I IsTrue
1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 B
2 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 A
3 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 B
Is there any faster and pythonic way to do it?
Here's one way using DataFrame.dot:
df['isTrue'] = df.astype(bool).dot(df.columns)
A B C D E F G H I isTrue
1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 B
2 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 A
3 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 B
For an even better performance you can use:
df['isTrue'] = df.columns[df.to_numpy().argmax(1)]
What you described is the definition of idxmax
>>> df.idxmax(1)
1 B
2 A
3 B
dtype: object

"Cannot reindex from a duplicate axis" when groupby.apply() on MultiIndex columns

I'm playing around with computing subtotals within a DataFrame that looks like this (note the MultiIndex):
0 1 2 3 4 5
A 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
B 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
I can successfully add the subtotals with the following code:
(
df
.groupby(level=0)
.apply(
lambda df: pd.concat(
[df.xs(df.name), df.sum().to_frame('Total').T]
)
)
)
And it looks like this:
0 1 2 3 4 5
A 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
Total 0.0 0.0 0.0 0.0 0.0 0.0
B 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
Total 0.0 0.0 0.0 0.0 0.0 0.0
However, when I work with the transposed DataFrame, it does not work. The DataFrame looks like:
A B
1 2 1 2
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
And I use the following code:
(
df2
.groupby(level=0, axis=1)
.apply(
lambda df: pd.concat(
[df.xs(df.name, axis=1), df.sum(axis=1).to_frame('Total')],
axis=1
)
)
)
I have specified axis=1 everywhere I can think of, but I get an error:
ValueError: cannot reindex from a duplicate axis
I would expect the output to be:
A B
1 2 Total 1 2 Total
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0
Is this a bug? Or have I not specified the axis correctly everywhere? As a workaround, I can obviously transpose the DataFrame, produce the totals, and transpose back, but I'd like to know why it's not working here, and submit a bug report if necessary.
The problem DataFrame can be generated with:
df2 = pd.DataFrame(
np.zeros([6, 4]),
columns=pd.MultiIndex.from_product([['A', 'B'], [1, 2]])
)

Joining dataframes in Pandas

I'm getting the error:
TypeError: 'method' object is not subscriptable
When I try to join two Pandas dataframes...
Can't see what is wrong with them!
For info, doing the Kaggle titanic problem:
titanic_df.head()
Out[102]:
Survived Pclass SibSp Parch Fare has_cabin C Q title Person
0 0 3 1 0 7.2500 1 0.0 0.0 Mr male
1 1 1 1 0 71.2833 0 1.0 0.0 Mrs female
2 1 3 0 0 7.9250 1 0.0 0.0 Miss female
3 1 1 1 0 53.1000 0 0.0 0.0 Mrs female
4 0 3 0 0 8.0500 1 0.0 0.0 Mr male
In [103]:
sns.barplot(x=titanic_df["Survived"],y=titanic_df["title"])
Out[103]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b5edb00>
In [125]:
title_dummies=pd.get_dummies(titanic_df["title"])
title_dummies=title_dummies.drop([" Don"," Rev"," Dr"," Col"," Capt"," Jonkheer"," Major"," Mr"],axis=1)
title_dummies.head()
Out[125]:
Lady Master Miss Mlle Mme Mrs Ms Sir the Countess
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [126]:
titanic_df=title_dummies.join[titanic_df]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-126-a0e0fe306754> in <module>()
----> 1 titanic_df=title_dummies.join[titanic_df]
TypeError: 'method' object is not subscriptable
You need change [] to () in DataFrame.join function:
titanic_df=title_dummies.join(titanic_df)
print (titanic_df)
Lady Master Miss Mlle Mme Mrs Ms Sir the Countess Survived \
0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
1 1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
2 2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1
3 3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
4 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
Pclass SibSp Parch Fare has_cabin C Q title Person
0 3 1 0 7.2500 1 0.0 0.0 Mr male
1 1 1 0 71.2833 0 1.0 0.0 Mrs female
2 3 0 0 7.9250 1 0.0 0.0 Miss female
3 1 1 0 53.1000 0 0.0 0.0 Mrs female
4 3 0 0 8.0500 1 0.0 0.0 Mr male

Categories