I have the following pandas DataFrame.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df)
time Group blocks
0 1 A 4
1 2 A 7
2 3 A 12
3 4 A 17
4 5 A 21
5 6 A 26
6 7 A 33
7 8 A 39
8 9 A 48
9 10 A 59
.... .... ....
36 35 A 231
37 1 B 1
38 2 B 1.5
39 3 B 3
40 4 B 5
41 5 B 6
.... .... ....
911 35 Z 349
This is a dataframe with multiple time series-esque data, from min=1 to max=35. Each Group has a relationship in the range time=1 to time=35 .
I would like to segment this dataframe into columns Group A, Group B, Group C, etc.
How does one "unconcatenate" this dataframe?
is that what you want?
In [84]: df.pivot_table(index='time', columns='Group')
Out[84]:
blocks
Group A B
time
1 4.0 1.0
2 7.0 1.5
3 12.0 3.0
4 17.0 5.0
5 21.0 6.0
6 26.0 NaN
7 33.0 NaN
8 39.0 NaN
9 48.0 NaN
10 59.0 NaN
35 231.0 NaN
data:
In [86]: df
Out[86]:
time Group blocks
0 1 A 4.0
1 2 A 7.0
2 3 A 12.0
3 4 A 17.0
4 5 A 21.0
5 6 A 26.0
6 7 A 33.0
7 8 A 39.0
8 9 A 48.0
9 10 A 59.0
36 35 A 231.0
37 1 B 1.0
38 2 B 1.5
39 3 B 3.0
40 4 B 5.0
41 5 B 6.0
Related
I am trying merge 2 dataframes.
df1
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
df2
Date B
07.01.2021 14
08.01.2021 27
09.01.2021 28
10.01.2021 29
11.01.2021 30
12.01.2021 31
13.01.2021 32
Both dataframes have one same row (although there could be several overlappings).
So I want to get df3 that looks as follows:
df3
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
08.01.2021 Nan 27 Nan
09.01.2021 Nan 28 Nan
10.01.2021 Nan 29 Nan
11.01.2021 Nan 30 Nan
12.01.2021 Nan 31 Nan
13.01.2021 Nan 32 Nan
I've tried
df3=df1.merge(df2, on='Date', how='outer') but it gives extra A,B,C columns. Could you give some idea how to get df3?
Thanks a lot.
merge outer without specifying on (default on is the intersection of columns between the two DataFrames in this case ['Date', 'B']):
df3 = df1.merge(df2, how='outer')
df3:
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
7 08.01.2021 NaN 27 NaN
8 09.01.2021 NaN 28 NaN
9 10.01.2021 NaN 29 NaN
10 11.01.2021 NaN 30 NaN
11 12.01.2021 NaN 31 NaN
12 13.01.2021 NaN 32 NaN
Assuming you always want to keep the first full version, you can concat the df2 on the end of df1 and drop duplicates on the Date column.
pd.concat([df1,df2]).drop_duplicates(subset='Date')
Output
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
1 08.01.2021 NaN 27 NaN
2 09.01.2021 NaN 28 NaN
3 10.01.2021 NaN 29 NaN
4 11.01.2021 NaN 30 NaN
5 12.01.2021 NaN 31 NaN
6 13.01.2021 NaN 32 NaN
I want to take the surface-weighted average of the columns in my dataframe. I have two surface-columns and two U-value-columns. I want to create an extra column 'U_av' (surface-weighted-average U-value) and U_av = (A1*U1 + A2*U2) / (A1+A2). If NaN occurs in one of the columns, NaN should be returned.
Initial df:
ID A1 A2 U1 U2
0 14 2 1.0 10.0 11
1 16 2 2.0 12.0 12
2 18 2 3.0 24.0 13
3 20 2 NaN 8.0 14
4 22 4 5.0 84.0 15
5 24 4 6.0 84.0 16
Desired Output:
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.33
1 16 2 2.0 12.0 12 12
2 18 2 3.0 24.0 13 17.4
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.66
5 24 4 6.0 84.0 16 43.2
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame({"ID": [14,16,18,20,22,24],
"A1": [2,2,2,2,4,4],
"U1": [10,12,24,8,84,84],
"A2": [1,2,3,np.nan,5,6],
"U2": [11,12,13,14,15,16]})
print(df)
#the mean of two columns U1 and U2 and dropping NaN is easy (U1+U2/2 in this case)
#but what to do for the surface-weighted mean (U_av = (A1*U1 + A2*U2) / (A1+A2))?
df.loc[:,'Umean'] = df[['U1','U2']].dropna().mean(axis=1)
EDIT:
adding to the solutions below:
df["U_av"] = (df.A1.mul(df.U1) + df.A2.mul(df.U2)).div(df[['A1','A2']].sum(axis=1))
Hope I got you correct:
df['U_av'] = (df['A1']*df['U1'] + df['A2']*df['U2']) / (df['A1']+df['A2'])
df
ID A1 U1 A2 U2 U_av
0 14 2 10 1.0 11 10.333333
1 16 2 12 2.0 12 12.000000
2 18 2 24 3.0 13 17.400000
3 20 2 8 NaN 14 NaN
4 22 4 84 5.0 15 45.666667
5 24 4 84 6.0 16 43.200000
Try this code:
numerator = df.A1.mul(df.U1) + (df.A2.mul(df.U2))
denominator = df.A1.add(df.A2)
df["U_av"] = numerator.div(denominator)
df
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.333333
1 16 2 2.0 12.0 12 12.000000
2 18 2 3.0 24.0 13 17.400000
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.666667
5 24 4 6.0 84.0 16 43.200000
How do I convert every numeric element of my pandas dataframe to an integer? I have not seen any documentation online for how to do so, which is surprising given Pandas is so popular...
If you have a data frame of ints, simply use astype directly.
df.astype(int)
If not, use select_dtypes first to select numeric columns.
df.select_dtypes(np.number).astype(int)
df = pd.DataFrame({'col1': [1.,2.,3.,4.], 'col2': [10.,20.,30.,40.]})
col1 col2
0 1.0 10.0
1 2.0 20.0
2 3.0 30.0
3 4.0 40.0
>>> df.astype(int)
col1 col2
0 1 10
1 2 20
2 3 30
3 4 40
You can use apply for this purpose:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.arange(1.0, 20.0), 'B':np.arange(101.0, 120.0)})
print(df)
A B
0 1.0 101.0
1 2.0 102.0
2 3.0 103.0
3 4.0 104.0
4 5.0 105.0
5 6.0 106.0
6 7.0 107.0
7 8.0 108.0
8 9.0 109.0
9 10.0 110.0
10 11.0 111.0
11 12.0 112.0
12 13.0 113.0
13 14.0 114.0
14 15.0 115.0
15 16.0 116.0
16 17.0 117.0
17 18.0 118.0
18 19.0 119.0
df2 = df.apply(lambda a: [int(b) for b in a])
print(df2)
A B
0 1 101
1 2 102
2 3 103
3 4 104
4 5 105
5 6 106
6 7 107
7 8 108
8 9 109
9 10 110
10 11 111
11 12 112
12 13 113
13 14 114
14 15 115
15 16 116
16 17 117
17 18 118
18 19 119
A better approach is to change the type at the level of series:
for col in df.columns:
if df[col].dtype == np.float64:
df[col] = df[col].astype('int')
print(df)
A B
0 1 101
1 2 102
2 3 103
3 4 104
4 5 105
5 6 106
6 7 107
7 8 108
8 9 109
9 10 110
10 11 111
11 12 112
12 13 113
13 14 114
14 15 115
15 16 116
16 17 117
17 18 118
18 19 119
Try this:
column_types = dict(df.dtypes)
for column in df.columns:
if column_types[column] == 'float64':
df[column] = df[column].astype('int')
df[column] = df[column].apply(lambda x: int(x))
If I have a data frame like this:
1 2 3 4 5
1 2 4 5 NaN 3
2 3 5 6 1 2
3 3 1 1 1 1
How do I sum each row and replace the values in that row with the sum so I get something like this:
1 2 3 4 5
1 14 14 14 NaN 14
2 17 17 17 17 17
3 7 7 7 7 7
Use mask for replace all non missing values by sum:
df = df.mask(df.notnull(), df.sum(axis=1), axis=0)
print (df)
1 2 3 4 5
1 14 14 14 NaN 14
2 17 17 17 17.0 17
3 7 7 7 7.0 7
Or use numpy.broadcast_to with numpy.where:
arr = df.values
a = np.broadcast_to(np.nansum(arr, axis=1)[:, None], df.shape)
df = pd.DataFrame(np.where(np.isnan(arr), np.nan, a),
index=df.index, columns=df.columns)
#alternative
df[:] = np.where(np.isnan(arr), np.nan, a)
print (df)
1 2 3 4 5
1 14.0 14.0 14.0 NaN 14.0
2 17.0 17.0 17.0 17.0 17.0
3 7.0 7.0 7.0 7.0 7.0
Using mul
df.notnull().replace(False,np.nan).mul(df.sum(1),axis=0).astype(float)
1 2 3 4 5
1 14.0 14.0 14.0 NaN 14.0
2 17.0 17.0 17.0 17.0 17.0
3 7.0 7.0 7.0 7.0 7.0
I have a pandas dataframe df and I want the final output dataframe final_df as
In [17]: df
Out[17]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 12 a 25 4
3 13 a 29 5
In [18]: final_df
Out[18]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 25.0 4.0
5 13 a 29.0 5.0
6 14 a 0.0 5.0
In [19]: dates=[10,11,12,13,14]
That is as you can see I want to fill up the missing dates and fill the corresponding values with 0 for cost column but for column prev I want to fill it with the value from previous date. As the single date may contains multiple symbol I am using the pivot_table.
If I use the ffill
In [12]: df.pivot_table(index="Date",columns="symbol").reindex(dates,method="ffill").stack().reset_index()
Out[12]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 30.0 9.0
3 11 b 33.0 10.0
4 12 a 25.0 4.0
5 13 a 29.0 5.0
6 14 a 29.0 5.0
This gives almost final data structure (it has 7 rows as final_df) except for cost column where it copies previous data but I want 0 there.
So I tried to fill missing values of different columns with different method, but that gives a problem, like
In [13]: df1=df.pivot_table(index="Date",columns="symbol").reindex(dates)
In [14]: df1["cost"]=df1["cost"].fillna(0)
In [15]: df1["prev"]=df1["prev"].ffill()
In [16]: df1.stack().reset_index()
Out[16]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 25.0 4.0
5 12 b 0.0 10.0
6 13 a 29.0 5.0
7 13 b 0.0 10.0
8 14 a 0.0 5.0
9 14 b 0.0 10.0
As you can see in output there is data with symbol "b" for date 12,13,14 but I don't want that because in initial dataframe there was no data data with symbol "b" for date 12,13 and I want to keep it that way and also there must not be one in new date 14 as it follows 13.
So how can I solve this problem and get the final_df output?
EDIT
Here is another example to check the program.
In [17]: df
Out[17]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 14 a 29 5
In [18]: dates=range(10,17)
In [19]: final_df
Out[19]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 11 a 0 9
3 11 b 0 10
4 12 a 0 9
5 12 b 0 10
6 13 a 0 9
7 13 b 0 10
8 14 a 29 5
9 15 a 0 5
10 16 a 0 5
Solution
I have found this way to the problem. Here I using a trick that keeps track of the missing places in in the initial pivot_table and removes finally.
In [44]: df1=df.pivot_table(index="Date",columns='symbol',fill_value="missing").reindex(dates)
In [45]: df1["cost"]= df1["cost"].fillna(0)
In [46]: df1["prev"]=df1["prev"].ffill()
In [47]: df1.stack().replace(to_replace="missing",value=np.nan).dropna().reset_index()
Out[47]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 0.0 9.0
5 12 b 0.0 10.0
6 13 a 0.0 9.0
7 13 b 0.0 10.0
8 14 a 29.0 5.0
9 15 a 0.0 5.0
10 16 a 0.0 5.0