How to "unconcatenate" a dataframe in Pandas?

How to "unconcatenate" a dataframe in Pandas? - python

I have the following pandas DataFrame.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df)
time Group blocks
0 1 A 4
1 2 A 7
2 3 A 12
3 4 A 17
4 5 A 21
5 6 A 26
6 7 A 33
7 8 A 39
8 9 A 48
9 10 A 59
.... .... ....
36 35 A 231
37 1 B 1
38 2 B 1.5
39 3 B 3
40 4 B 5
41 5 B 6
.... .... ....
911 35 Z 349
This is a dataframe with multiple time series-esque data, from min=1 to max=35. Each Group has a relationship in the range time=1 to time=35 .
I would like to segment this dataframe into columns Group A, Group B, Group C, etc.
How does one "unconcatenate" this dataframe?

is that what you want?
In [84]: df.pivot_table(index='time', columns='Group')
Out[84]:
blocks
Group A B
time
1 4.0 1.0
2 7.0 1.5
3 12.0 3.0
4 17.0 5.0
5 21.0 6.0
6 26.0 NaN
7 33.0 NaN
8 39.0 NaN
9 48.0 NaN
10 59.0 NaN
35 231.0 NaN
data:
In [86]: df
Out[86]:
time Group blocks
0 1 A 4.0
1 2 A 7.0
2 3 A 12.0
3 4 A 17.0
4 5 A 21.0
5 6 A 26.0
6 7 A 33.0
7 8 A 39.0
8 9 A 48.0
9 10 A 59.0
36 35 A 231.0
37 1 B 1.0
38 2 B 1.5
39 3 B 3.0
40 4 B 5.0
41 5 B 6.0

Related

Pandas merge 2 dataframes

I am trying merge 2 dataframes.
df1
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
df2
Date B
07.01.2021 14
08.01.2021 27
09.01.2021 28
10.01.2021 29
11.01.2021 30
12.01.2021 31
13.01.2021 32
Both dataframes have one same row (although there could be several overlappings).
So I want to get df3 that looks as follows:
df3
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
08.01.2021 Nan 27 Nan
09.01.2021 Nan 28 Nan
10.01.2021 Nan 29 Nan
11.01.2021 Nan 30 Nan
12.01.2021 Nan 31 Nan
13.01.2021 Nan 32 Nan
I've tried
df3=df1.merge(df2, on='Date', how='outer') but it gives extra A,B,C columns. Could you give some idea how to get df3?
Thanks a lot.

merge outer without specifying on (default on is the intersection of columns between the two DataFrames in this case ['Date', 'B']):
df3 = df1.merge(df2, how='outer')
df3:
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
7 08.01.2021 NaN 27 NaN
8 09.01.2021 NaN 28 NaN
9 10.01.2021 NaN 29 NaN
10 11.01.2021 NaN 30 NaN
11 12.01.2021 NaN 31 NaN
12 13.01.2021 NaN 32 NaN

Assuming you always want to keep the first full version, you can concat the df2 on the end of df1 and drop duplicates on the Date column.
pd.concat([df1,df2]).drop_duplicates(subset='Date')
Output
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
1 08.01.2021 NaN 27 NaN
2 09.01.2021 NaN 28 NaN
3 10.01.2021 NaN 29 NaN
4 11.01.2021 NaN 30 NaN
5 12.01.2021 NaN 31 NaN
6 13.01.2021 NaN 32 NaN

Get surface weighted average of multiple columns in pandas dataframe

I want to take the surface-weighted average of the columns in my dataframe. I have two surface-columns and two U-value-columns. I want to create an extra column 'U_av' (surface-weighted-average U-value) and U_av = (A1*U1 + A2*U2) / (A1+A2). If NaN occurs in one of the columns, NaN should be returned.
Initial df:
ID A1 A2 U1 U2
0 14 2 1.0 10.0 11
1 16 2 2.0 12.0 12
2 18 2 3.0 24.0 13
3 20 2 NaN 8.0 14
4 22 4 5.0 84.0 15
5 24 4 6.0 84.0 16
Desired Output:
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.33
1 16 2 2.0 12.0 12 12
2 18 2 3.0 24.0 13 17.4
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.66
5 24 4 6.0 84.0 16 43.2
Code:
import numpy as np import pandas as pd  df = pd.DataFrame({"ID": [14,16,18,20,22,24],
"A1": [2,2,2,2,4,4], "U1": [10,12,24,8,84,84], "A2": [1,2,3,np.nan,5,6], "U2": [11,12,13,14,15,16]})  print(df)
#the mean of two columns U1 and U2 and dropping NaN is easy (U1+U2/2 in this case)
#but what to do for the surface-weighted mean (U_av = (A1*U1 + A2*U2) / (A1+A2))?
df.loc[:,'Umean'] = df[['U1','U2']].dropna().mean(axis=1) EDIT:
adding to the solutions below:
df["U_av"] = (df.A1.mul(df.U1) + df.A2.mul(df.U2)).div(df[['A1','A2']].sum(axis=1))

Hope I got you correct:
df['U_av'] = (df['A1']*df['U1'] + df['A2']*df['U2']) / (df['A1']+df['A2'])
df
ID A1 U1 A2 U2 U_av
0 14 2 10 1.0 11 10.333333
1 16 2 12 2.0 12 12.000000
2 18 2 24 3.0 13 17.400000
3 20 2 8 NaN 14 NaN
4 22 4 84 5.0 15 45.666667
5 24 4 84 6.0 16 43.200000

Try this code:
numerator = df.A1.mul(df.U1) + (df.A2.mul(df.U2))
denominator = df.A1.add(df.A2)
df["U_av"] = numerator.div(denominator)
df
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.333333
1 16 2 2.0 12.0 12 12.000000
2 18 2 3.0 24.0 13 17.400000
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.666667
5 24 4 6.0 84.0 16 43.200000

Convert dataframe of floats to integers in pandas?

How do I convert every numeric element of my pandas dataframe to an integer? I have not seen any documentation online for how to do so, which is surprising given Pandas is so popular...

If you have a data frame of ints, simply use astype directly.
df.astype(int)
If not, use select_dtypes first to select numeric columns.
df.select_dtypes(np.number).astype(int)
df = pd.DataFrame({'col1': [1.,2.,3.,4.], 'col2': [10.,20.,30.,40.]})
col1 col2
0 1.0 10.0
1 2.0 20.0
2 3.0 30.0
3 4.0 40.0
>>> df.astype(int)
col1 col2
0 1 10
1 2 20
2 3 30
3 4 40

You can use apply for this purpose:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.arange(1.0, 20.0), 'B':np.arange(101.0, 120.0)})
print(df)
A B
0 1.0 101.0
1 2.0 102.0
2 3.0 103.0
3 4.0 104.0
4 5.0 105.0
5 6.0 106.0
6 7.0 107.0
7 8.0 108.0
8 9.0 109.0
9 10.0 110.0
10 11.0 111.0
11 12.0 112.0
12 13.0 113.0
13 14.0 114.0
14 15.0 115.0
15 16.0 116.0
16 17.0 117.0
17 18.0 118.0
18 19.0 119.0
df2 = df.apply(lambda a: [int(b) for b in a])
print(df2)
A B
0 1 101
1 2 102
2 3 103
3 4 104
4 5 105
5 6 106
6 7 107
7 8 108
8 9 109
9 10 110
10 11 111
11 12 112
12 13 113
13 14 114
14 15 115
15 16 116
16 17 117
17 18 118
18 19 119
A better approach is to change the type at the level of series:
for col in df.columns:
if df[col].dtype == np.float64:
df[col] = df[col].astype('int')
print(df)
A B
0 1 101
1 2 102
2 3 103
3 4 104
4 5 105
5 6 106
6 7 107
7 8 108
8 9 109
9 10 110
10 11 111
11 12 112
12 13 113
13 14 114
14 15 115
15 16 116
16 17 117
17 18 118
18 19 119

Try this:
column_types = dict(df.dtypes)
for column in df.columns:
if column_types[column] == 'float64':
df[column] = df[column].astype('int')
df[column] = df[column].apply(lambda x: int(x))

How to sum each row and replace each value in row with sum?

If I have a data frame like this:
1 2 3 4 5
1 2 4 5 NaN 3
2 3 5 6 1 2
3 3 1 1 1 1
How do I sum each row and replace the values in that row with the sum so I get something like this:
1 2 3 4 5
1 14 14 14 NaN 14
2 17 17 17 17 17
3 7 7 7 7 7

Use mask for replace all non missing values by sum:
df = df.mask(df.notnull(), df.sum(axis=1), axis=0)
print (df)
1 2 3 4 5
1 14 14 14 NaN 14
2 17 17 17 17.0 17
3 7 7 7 7.0 7
Or use numpy.broadcast_to with numpy.where:
arr = df.values
a = np.broadcast_to(np.nansum(arr, axis=1)[:, None], df.shape)
df = pd.DataFrame(np.where(np.isnan(arr), np.nan, a),
index=df.index, columns=df.columns)
#alternative
df[:] = np.where(np.isnan(arr), np.nan, a)
print (df)
1 2 3 4 5
1 14.0 14.0 14.0 NaN 14.0
2 17.0 17.0 17.0 17.0 17.0
3 7.0 7.0 7.0 7.0 7.0

Using mul
df.notnull().replace(False,np.nan).mul(df.sum(1),axis=0).astype(float)
1 2 3 4 5
1 14.0 14.0 14.0 NaN 14.0
2 17.0 17.0 17.0 17.0 17.0
3 7.0 7.0 7.0 7.0 7.0

fill pandas missing places with different method for different column

I have a pandas dataframe df and I want the final output dataframe final_df as
In [17]: df
Out[17]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 12 a 25 4
3 13 a 29 5
In [18]: final_df
Out[18]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 25.0 4.0
5 13 a 29.0 5.0
6 14 a 0.0 5.0
In [19]: dates=[10,11,12,13,14]
That is as you can see I want to fill up the missing dates and fill the corresponding values with 0 for cost column but for column prev I want to fill it with the value from previous date. As the single date may contains multiple symbol I am using the pivot_table.
If I use the ffill
In [12]: df.pivot_table(index="Date",columns="symbol").reindex(dates,method="ffill").stack().reset_index()
Out[12]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 30.0 9.0
3 11 b 33.0 10.0
4 12 a 25.0 4.0
5 13 a 29.0 5.0
6 14 a 29.0 5.0
This gives almost final data structure (it has 7 rows as final_df) except for cost column where it copies previous data but I want 0 there.
So I tried to fill missing values of different columns with different method, but that gives a problem, like
In [13]: df1=df.pivot_table(index="Date",columns="symbol").reindex(dates)
In [14]: df1["cost"]=df1["cost"].fillna(0)
In [15]: df1["prev"]=df1["prev"].ffill()
In [16]: df1.stack().reset_index()
Out[16]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 25.0 4.0
5 12 b 0.0 10.0
6 13 a 29.0 5.0
7 13 b 0.0 10.0
8 14 a 0.0 5.0
9 14 b 0.0 10.0
As you can see in output there is data with symbol "b" for date 12,13,14 but I don't want that because in initial dataframe there was no data data with symbol "b" for date 12,13 and I want to keep it that way and also there must not be one in new date 14 as it follows 13.
So how can I solve this problem and get the final_df output?
EDIT
Here is another example to check the program.
In [17]: df
Out[17]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 14 a 29 5
In [18]: dates=range(10,17)
In [19]: final_df
Out[19]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 11 a 0 9
3 11 b 0 10
4 12 a 0 9
5 12 b 0 10
6 13 a 0 9
7 13 b 0 10
8 14 a 29 5
9 15 a 0 5
10 16 a 0 5
Solution
I have found this way to the problem. Here I using a trick that keeps track of the missing places in in the initial pivot_table and removes finally.
In [44]: df1=df.pivot_table(index="Date",columns='symbol',fill_value="missing").reindex(dates)
In [45]: df1["cost"]= df1["cost"].fillna(0)
In [46]: df1["prev"]=df1["prev"].ffill()
In [47]: df1.stack().replace(to_replace="missing",value=np.nan).dropna().reset_index()
Out[47]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 0.0 9.0
5 12 b 0.0 10.0
6 13 a 0.0 9.0
7 13 b 0.0 10.0
8 14 a 29.0 5.0
9 15 a 0.0 5.0
10 16 a 0.0 5.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to "unconcatenate" a dataframe in Pandas? - python

Related

Pandas merge 2 dataframes

Get surface weighted average of multiple columns in pandas dataframe

Convert dataframe of floats to integers in pandas?

How to sum each row and replace each value in row with sum?

fill pandas missing places with different method for different column

Categories

Resources