This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I have a dataframe that looks like this:
df
Out[262]:
klass varde
0 1.0 53.801840
1 1.0 58.524591
2 1.0 51.879057
3 1.0 48.391662
4 1.0 48.451202
5 1.0 53.072189
6 1.0 55.418486
7 1.0 56.561995
8 1.0 59.161386
9 1.0 53.033094
0 1.0 52.438421
1 1.0 53.554198
2 1.0 38.968125
3 1.0 53.895055
4 1.0 55.335374
5 1.0 48.885893
6 1.0 48.173335
7 1.0 45.083425
8 1.0 50.846878
9 1.0 47.132339
0 2.0 88.804034
1 2.0 105.083136
2 2.0 96.204701
3 2.0 94.890052
4 2.0 90.846715
5 2.0 99.433425
6 2.0 113.972773
7 2.0 94.816123
8 2.0 114.141583
9 2.0 91.235912
0 2.0 104.331863
1 2.0 106.283919
2 2.0 105.769039
3 2.0 97.678197
4 2.0 106.136627
5 2.0 90.884468
6 2.0 104.920153
7 2.0 81.463938
8 2.0 107.859278
9 2.0 90.248085
I want to reshape the dataframe so that values 'varde' with same index and same value in column 'klass' are put beside each other like this:
klass varde varde
0 1.0 53.801840 52.438421
1 1.0 58.524591 53.554198
2 1.0 51.879057 38.968125
3 1.0 48.391662 53.895055
4 1.0 48.451202 55.335374
5 1.0 53.072189 48.885893
6 1.0 55.418486 48.173335
7 1.0 56.561995 45.083425
8 1.0 59.161386 50.846878
9 1.0 53.033094 47.132339
0 2.0 88.804034 104.331863
1 2.0 105.083136 106.283919
2 2.0 96.204701 105.769039
3 2.0 94.890052 97.678197
4 2.0 90.846715 106.136627
5 2.0 99.433425 90.884468
6 2.0 113.972773 104.920153
7 2.0 94.816123 81.463938
8 2.0 114.141583 107.859278
9 2.0 91.235912 0.248085
I'm really stuck on this...
We can stack several commands
>>> df.groupby(["id", "klass"])['varde'].apply(list).apply(pd.Series).reset_index()
id klass 0 1
0 0 1.0 53.801840 52.438421
1 0 2.0 88.804034 104.331863
2 1 1.0 58.524591 53.554198
3 1 2.0 105.083136 106.283919
4 2 1.0 51.879057 38.968125
5 2 2.0 96.204701 105.769039
6 3 1.0 48.391662 53.895055
7 3 2.0 94.890052 97.678197
8 4 1.0 48.451202 55.335374
9 4 2.0 90.846715 106.136627
10 5 1.0 53.072189 48.885893
11 5 2.0 99.433425 90.884468
12 6 1.0 55.418486 48.173335
13 6 2.0 113.972773 104.920153
14 7 1.0 56.561995 45.083425
15 7 2.0 94.816123 81.463938
16 8 1.0 59.161386 50.846878
17 8 2.0 114.141583 107.859278
18 9 1.0 53.033094 47.132339
19 9 2.0 91.235912 90.248085
Related
Assume we have a table looks like the following:
id
week_num
people
date
level
a
b
1
1
20
1990101
1
2
3
1
2
30
1990108
1
2
3
1
3
40
1990115
1
2
3
1
5
100
1990129
1
2
3
1
7
100
1990212
1
2
3
week_num skip the "4" and "6" because the corresponding "people" is 0. However, we want the all the rows included like the following table.
id
week_num
people
date
level
a
b
1
1
20
1990101
1
2
3
1
2
30
1990108
1
2
3
1
3
40
1990115
1
2
3
1
4
0
1990122
1
2
3
1
5
100
1990129
1
2
3
1
6
0
1990205
1
2
3
1
7
100
1990212
1
2
3
The date starts with 1990101, the next row must +7 days if it is a continuous week_num(Ex: 1,2 is continuous; 1,3 is not).
How can we use python(pandas) to achieve this goal?
Note: Each id has 10 week_num(1,2,3,...,10), the output must include all "week_num" with corresponding "people" and "date".
Update: Other columns like "level","a","b" should stay the same even we add the skipped week_num.
This assumes that the date restarts at 1990-01-01 for each id:
import itertools
# reindex to get all combinations of ids and week numbers
df_full = (df.set_index(["id", "week_num"])
.reindex(list(itertools.product([1,2], range(1, 11))))
.reset_index())
# fill people with zero
df_full = df_full.fillna({"people": 0})
# forward fill some other columns
cols_ffill = ["level", "a", "b"]
df_full[cols_ffill] = df_full[cols_ffill].ffill()
# reconstruct date from week starting from 1990-01-01 for each id
df_full["date"] = pd.to_datetime("1990-01-01") + (df_full.week_num - 1) * pd.Timedelta("1w")
df_full
# out:
id week_num people date level a b
0 1 1 20.0 1990-01-01 1.0 2.0 3.0
1 1 2 30.0 1990-01-08 1.0 2.0 3.0
2 1 3 40.0 1990-01-15 1.0 2.0 3.0
3 1 4 0.0 1990-01-22 1.0 2.0 3.0
4 1 5 100.0 1990-01-29 1.0 2.0 3.0
5 1 6 0.0 1990-02-05 1.0 2.0 3.0
6 1 7 100.0 1990-02-12 1.0 2.0 3.0
7 1 8 0.0 1990-02-19 1.0 2.0 3.0
8 1 9 0.0 1990-02-26 1.0 2.0 3.0
9 1 10 0.0 1990-03-05 1.0 2.0 3.0
10 2 1 0.0 1990-01-01 1.0 2.0 3.0
11 2 2 0.0 1990-01-08 1.0 2.0 3.0
12 2 3 0.0 1990-01-15 1.0 2.0 3.0
13 2 4 0.0 1990-01-22 1.0 2.0 3.0
14 2 5 0.0 1990-01-29 1.0 2.0 3.0
15 2 6 0.0 1990-02-05 1.0 2.0 3.0
16 2 7 0.0 1990-02-12 1.0 2.0 3.0
17 2 8 0.0 1990-02-19 1.0 2.0 3.0
18 2 9 0.0 1990-02-26 1.0 2.0 3.0
19 2 10 0.0 1990-03-05 1.0 2.0 3.0
I have a dataframe as example:
A B C
0 1
1 1
2 1
3 1 2
4 1 2
5 1 2
6 2 3
7 2 3
8 2 3
9 3
10 3
11 3
And I would like to remove nan values of each column to get the result:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
Do I have an easy way to do that?
You can apply a custom sorting function for each column that doesn't actually sort numerically, it justs moves all the NaN values to the end of the column. Then, dropna:
df = df.apply(lambda x: sorted(x, key=lambda v: isinstance(v, float) and np.isnan(v))).dropna()
Output:
>>> df
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 1.0 2.0 3.0
5 1.0 2.0 3.0
Given
>>> df
A B C
0 1.0 NaN NaN
1 1.0 NaN NaN
2 1.0 NaN NaN
3 1.0 2.0 NaN
4 1.0 2.0 NaN
5 1.0 2.0 NaN
6 NaN 2.0 3.0
7 NaN 2.0 3.0
8 NaN 2.0 3.0
9 NaN NaN 3.0
10 NaN NaN 3.0
11 NaN NaN 3.0
use
>>> df.apply(lambda s: s.dropna().to_numpy())
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 1.0 2.0 3.0
5 1.0 2.0 3.0
i'm in this situation,
my df is like that
A B
0 0.0 2.0
1 3.0 4.0
2 NaN 1.0
3 2.0 NaN
4 NaN 1.0
5 4.8 NaN
6 NaN 1.0
and i want to apply this line of code:
df['A'] = df['B'].fillna(df['A'])
and I expect a workflow and final output like that:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 NaN NaN
4 1.0 1.0
5 NaN NaN
6 1.0 1.0
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0
but I receive this error:
TypeError: Unsupported type Series
probably because each time there is an NA it tries to fill it with the whole series and not with the single element with the same index of the B column.
I receive the same error with a syntax like that:
df['C'] = df['B'].fillna(df['A'])
so the problem seems not to be the fact that I'm first changing the values of A with the ones of B and then trying to fill the "B" NA with the values of a column that is technically the same as B
I'm in a databricks environment and I'm working with koalas data frames but they work as the pandas ones.
can you help me?
Another option
Suppose the following dataset
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'State':[1,2,3,4,5,6, 7, 8, 9, 10],
'Sno Center': ["Guntur", "Nellore", "Visakhapatnam", "Biswanath", "Doom-Dooma", "Guntur", "Labac-Silchar", "Numaligarh", "Sibsagar", "Munger-Jamalpu"],
'Mar-21': [121, 118.8, 131.6, 123.7, 127.8, 125.9, 114.2, 114.2, 117.7, 117.7],
'Apr-21': [121.1, 118.3, 131.5, np.NaN, 128.2, 128.2, 115.4, 115.1, np.NaN, 118.3]})
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 NaN
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 NaN
9 10 Munger-Jamalpu 117.7 118.3
Then
df.loc[(df["Mar-21"].notnull()) & (df["Apr-21"].isna()), "Apr-21"] = df["Mar-21"]
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 123.7
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 117.7
9 10 Munger-Jamalpu 117.7 118.3
IIUC:
try with max():
df['A']=df[['A','B']].max(axis=1)
output of df:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0
I've got df as follows:
a b
0 1 NaN
1 2 NaN
2 1 1.0
3 4 NaN
4 9 1.0
5 6 NaN
6 5 2.0
7 8 NaN
8 9 2.0
I'd like fill nan's only between numbers to get df like this:
a b
0 1 NaN
1 2 NaN
2 1 1.0
3 4 1.0
4 9 1.0
5 6 NaN
6 5 2.0
7 8 2.0
8 9 2.0
and then create two new dataframes:
a b
2 1 1.0
3 4 1.0
4 9 1.0
a b
6 5 2.0
7 8 2.0
8 9 2.0
meaning select all columns and rows with fiiled out nan only.
My idea for first part, this with filling out nan is to create separate dataframe with row indexes like:
2 1.0
4 1.0
6 2.0
8 2.0
and based on that create range of row indexes to fill out.
My question is maybe there is, related to this part with replacing nan, more pythonic function to do this.
How about
df[df.b.ffill()==df.b.bfill()].ffill()
results in
# a b
# 2 1 1.0
# 3 4 1.0
# 4 9 1.0
# 6 5 2.0
# 7 8 2.0
# 8 9 2.0
Explanation:
df['c'] = df.b.ffill()
df['d'] = df.b.bfill()
# a b c d
# 0 1 NaN NaN 1.0
# 1 2 NaN NaN 1.0
# 2 1 1.0 1.0 1.0
# 3 4 NaN 1.0 1.0
# 4 9 1.0 1.0 1.0
# 5 6 NaN 1.0 2.0
# 6 5 2.0 2.0 2.0
# 7 8 NaN 2.0 2.0
# 8 9 2.0 2.0 2.0
I have dataset like this:
Block Vector
blk_-1 0.0 2 3, 0.5 3 8, 0.7 33 5
blk_-2 1.0 4 1, 2.0 2 4
blk_-3 0.0 0 0, 6.0 0 7
blk_-4 8.0 3 0, 7.0 5 8
blk_-5 9.0 0 5, 5.0 0 2, 5.2 3 2, 5.9 5 3
dat = {'Block': ['blk_-1', 'blk_-2', 'blk_-3', 'blk_-4', 'blk_-5'],\
'Vector': ['0.0 2 3, 0.5 3 8, 0.7 33 5',\
'1.0 4 1, 2.0 2 4',\
'0.0 0 0, 6.0 0 7',\
'8.0 3 0, 7.0 5 8',\
'9.0 0 5, 5.0 0 2, 5.2 3 2, 5.9 5 3']
}
I want to get:
Block Vector
blk_-1 0.0 2 3
blk_-1 0.5 3 8
blk_-1 0.7 33 5
blk_-2 1.0 4 1
blk_-2 2.0 2 4
blk_-3 0.0 0 0
blk_-3 6.0 0 7
blk_-4 8.0 3 0
blk_-4 7.0 5 8
blk_-5 9.0 0 5
blk_-5 5.0 0 2
blk_-5 5.2 3 2
blk_-5 5.9 5 3
Try:
df['Vector'] = df['Vector'].apply(lambda x : list(map(str, x.split(','))))
df.Vector.apply(pd.Series) \
.merge(df, left_index = True, right_index = True) \
.drop(["Vector"], axis = 1)
Get:
0 1 2 3 Block
0 0.0 2 3 0.5 3 8 0.7 33 5 NaN blk_-1
1 1.0 4 1 2.0 2 4 NaN NaN blk_-2
2 0.0 0 0 6.0 0 7 NaN NaN blk_-3
3 8.0 3 0 7.0 5 8 NaN NaN blk_-4
4 9.0 0 5 5.0 0 2 5.2 3 2 5.9 5 3 blk_-5
Actually
Stuck at this moment. Waiting for your ideas and comments :)
You can use split, explode and join.
df[['Block']].join(df.Vector.str.split(',').explode())
Block Vector
0 blk_-1 0.0 2 3
0 blk_-1 0.5 3 8
0 blk_-1 0.7 33 5
1 blk_-2 1.0 4 1
1 blk_-2 2.0 2 4
2 blk_-3 0.0 0 0
2 blk_-3 6.0 0 7
3 blk_-4 8.0 3 0
3 blk_-4 7.0 5 8
4 blk_-5 9.0 0 5
4 blk_-5 5.0 0 2
4 blk_-5 5.2 3 2
4 blk_-5 5.9 5 3
Solution for pandas 0.25+ - Series.str.split column and assign back with DataFrame.assign, use DataFrame.explode and last for default index add DataFrame.reset_index with drop=True:
df = pd.DataFrame(dat)
df = df.assign(Vector=df['Vector'].str.split(',')).explode('Vector').reset_index(drop=True)
print (df)
Block Vector
0 blk_-1 0.0 2 3
1 blk_-1 0.5 3 8
2 blk_-1 0.7 33 5
3 blk_-2 1.0 4 1
4 blk_-2 2.0 2 4
5 blk_-3 0.0 0 0
6 blk_-3 6.0 0 7
7 blk_-4 8.0 3 0
8 blk_-4 7.0 5 8
9 blk_-5 9.0 0 5
10 blk_-5 5.0 0 2
11 blk_-5 5.2 3 2
12 blk_-5 5.9 5 3
Version for oldier pandas versions - use pop + split + stack + reset_index + rename for new Series and then join to original:
df = (df.join(df.pop('Vector')
.str.split(',',expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('Vector')).reset_index(drop=True))
print (df)
Block Vector
0 blk_-1 0.0 2 3
1 blk_-1 0.5 3 8
2 blk_-1 0.7 33 5
3 blk_-2 1.0 4 1
4 blk_-2 2.0 2 4
5 blk_-3 0.0 0 0
6 blk_-3 6.0 0 7
7 blk_-4 8.0 3 0
8 blk_-4 7.0 5 8
9 blk_-5 9.0 0 5
10 blk_-5 5.0 0 2
11 blk_-5 5.2 3 2
12 blk_-5 5.9 5 3
For lower than version .25:
final=df.merge(df['Vector'].str.split(',',expand=True).stack().reset_index(0,name='Vector'),
left_index=True,right_on='level_0',suffixes=('_x','')).drop(['level_0','Vector_x'],1)
print(final)
Block Vector
0 blk_-1 0.0 2 3
1 blk_-1 0.5 3 8
2 blk_-1 0.7 33 5
0 blk_-2 1.0 4 1
1 blk_-2 2.0 2 4
0 blk_-3 0.0 0 0
1 blk_-3 6.0 0 7
0 blk_-4 8.0 3 0
1 blk_-4 7.0 5 8
0 blk_-5 9.0 0 5
1 blk_-5 5.0 0 2
2 blk_-5 5.2 3 2
3 blk_-5 5.9 5 3