I have a pandas dataframe df and I want the final output dataframe final_df as
In [17]: df
Out[17]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 12 a 25 4
3 13 a 29 5
In [18]: final_df
Out[18]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 25.0 4.0
5 13 a 29.0 5.0
6 14 a 0.0 5.0
In [19]: dates=[10,11,12,13,14]
That is as you can see I want to fill up the missing dates and fill the corresponding values with 0 for cost column but for column prev I want to fill it with the value from previous date. As the single date may contains multiple symbol I am using the pivot_table.
If I use the ffill
In [12]: df.pivot_table(index="Date",columns="symbol").reindex(dates,method="ffill").stack().reset_index()
Out[12]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 30.0 9.0
3 11 b 33.0 10.0
4 12 a 25.0 4.0
5 13 a 29.0 5.0
6 14 a 29.0 5.0
This gives almost final data structure (it has 7 rows as final_df) except for cost column where it copies previous data but I want 0 there.
So I tried to fill missing values of different columns with different method, but that gives a problem, like
In [13]: df1=df.pivot_table(index="Date",columns="symbol").reindex(dates)
In [14]: df1["cost"]=df1["cost"].fillna(0)
In [15]: df1["prev"]=df1["prev"].ffill()
In [16]: df1.stack().reset_index()
Out[16]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 25.0 4.0
5 12 b 0.0 10.0
6 13 a 29.0 5.0
7 13 b 0.0 10.0
8 14 a 0.0 5.0
9 14 b 0.0 10.0
As you can see in output there is data with symbol "b" for date 12,13,14 but I don't want that because in initial dataframe there was no data data with symbol "b" for date 12,13 and I want to keep it that way and also there must not be one in new date 14 as it follows 13.
So how can I solve this problem and get the final_df output?
EDIT
Here is another example to check the program.
In [17]: df
Out[17]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 14 a 29 5
In [18]: dates=range(10,17)
In [19]: final_df
Out[19]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 11 a 0 9
3 11 b 0 10
4 12 a 0 9
5 12 b 0 10
6 13 a 0 9
7 13 b 0 10
8 14 a 29 5
9 15 a 0 5
10 16 a 0 5
Solution
I have found this way to the problem. Here I using a trick that keeps track of the missing places in in the initial pivot_table and removes finally.
In [44]: df1=df.pivot_table(index="Date",columns='symbol',fill_value="missing").reindex(dates)
In [45]: df1["cost"]= df1["cost"].fillna(0)
In [46]: df1["prev"]=df1["prev"].ffill()
In [47]: df1.stack().replace(to_replace="missing",value=np.nan).dropna().reset_index()
Out[47]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 0.0 9.0
5 12 b 0.0 10.0
6 13 a 0.0 9.0
7 13 b 0.0 10.0
8 14 a 29.0 5.0
9 15 a 0.0 5.0
10 16 a 0.0 5.0
Related
I have a df with a column
Value
3
8
10
15
I would like to obtain a dataframe that has an 'interpolated' From To values as the following:
Value From To
Nan 0 3
3 3 5
Nan 5 8
8 8 10
10 10 12
Nan 12 15
15 15 17
The increment is always 2 when a value exist.
I found a solution
This is the starting dataframe:
df = pd.DataFrame({'Value' : [0, 2,5,7,9,14,21,25]})
df['From'] = df['Value']
df['To'] = df['Value'] + 2
>>> print(df)
Value From To
0 0 0 2
1 2 2 4
2 5 5 7
3 7 7 9
4 9 9 11
5 14 14 16
6 21 21 23
7 25 25 27
And this code allow to habe Fromand To columns without empties intervals
new_rows = []
for i in range(len(df)-1):
if df.loc[i, 'To'] != df.loc[i+1, 'Value']:
new_row = {'Value': np.nan,'From': df.loc[i, 'To'], 'To': df.loc[i+1, 'Value']}
else:
new_row = {'Value': np.nan, 'From': np.nan, 'To': np.nan}
new_rows.append(new_row)
df1 = df.append(new_rows, ignore_index=True)
df1.dropna(how='all', inplace=True)
df1.sort_values(by=['To'], inplace=True)
df1
>>> df1
Value From To
0 0.0 0.0 2.0
1 2.0 2.0 4.0
9 NaN 4.0 5.0
2 5.0 5.0 7.0
3 7.0 7.0 9.0
4 9.0 9.0 11.0
12 NaN 11.0 14.0
5 14.0 14.0 16.0
13 NaN 16.0 21.0
6 21.0 21.0 23.0
14 NaN 23.0 25.0
7 25.0 25.0 27.0
It is probably improvable....
I want to take the surface-weighted average of the columns in my dataframe. I have two surface-columns and two U-value-columns. I want to create an extra column 'U_av' (surface-weighted-average U-value) and U_av = (A1*U1 + A2*U2) / (A1+A2). If NaN occurs in one of the columns, NaN should be returned.
Initial df:
ID A1 A2 U1 U2
0 14 2 1.0 10.0 11
1 16 2 2.0 12.0 12
2 18 2 3.0 24.0 13
3 20 2 NaN 8.0 14
4 22 4 5.0 84.0 15
5 24 4 6.0 84.0 16
Desired Output:
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.33
1 16 2 2.0 12.0 12 12
2 18 2 3.0 24.0 13 17.4
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.66
5 24 4 6.0 84.0 16 43.2
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame({"ID": [14,16,18,20,22,24],
"A1": [2,2,2,2,4,4],
"U1": [10,12,24,8,84,84],
"A2": [1,2,3,np.nan,5,6],
"U2": [11,12,13,14,15,16]})
print(df)
#the mean of two columns U1 and U2 and dropping NaN is easy (U1+U2/2 in this case)
#but what to do for the surface-weighted mean (U_av = (A1*U1 + A2*U2) / (A1+A2))?
df.loc[:,'Umean'] = df[['U1','U2']].dropna().mean(axis=1)
EDIT:
adding to the solutions below:
df["U_av"] = (df.A1.mul(df.U1) + df.A2.mul(df.U2)).div(df[['A1','A2']].sum(axis=1))
Hope I got you correct:
df['U_av'] = (df['A1']*df['U1'] + df['A2']*df['U2']) / (df['A1']+df['A2'])
df
ID A1 U1 A2 U2 U_av
0 14 2 10 1.0 11 10.333333
1 16 2 12 2.0 12 12.000000
2 18 2 24 3.0 13 17.400000
3 20 2 8 NaN 14 NaN
4 22 4 84 5.0 15 45.666667
5 24 4 84 6.0 16 43.200000
Try this code:
numerator = df.A1.mul(df.U1) + (df.A2.mul(df.U2))
denominator = df.A1.add(df.A2)
df["U_av"] = numerator.div(denominator)
df
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.333333
1 16 2 2.0 12.0 12 12.000000
2 18 2 3.0 24.0 13 17.400000
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.666667
5 24 4 6.0 84.0 16 43.200000
Hello i have a data set that i need to sort.
I am sorting it 3 different ways, (Forward fill, Backwards fill and drop")
My code works as is, but i have a problem, i been trying to solve with no luck.
If "forward fill is enabled and the first row is -1, then the drop function should be run.
and if "backwards fill" is enabled and the last row is -1 the the drop function should be run.
I tried alot of if statements but i always get the "the truth value is ambigous" error
def load_measurements(filename, fmode):
RD=pd.read_csv(filename, names=["Year","Month","Day","hour","Minute", "Second", "Zone1", "Zone2", "Zone3", "Zone4"])
#Forward fill replaces rows with -1 with previous data row
if fmode =='forward fill':
if RD.loc[6,0]==-1:
RD="blah"
else:
RD=RD.replace(-1, np.nan).ffill()
#Drop deletes all rows if a the data is=-1
if fmode=="drop":
RD = RD[RD.Zone2 != -1]
RD = RD[RD.Zone1 != -1]
RD = RD[RD.Zone3 != -1]
RD = RD[RD.Zone4 != -1]
print(RD)
#Backward fill replaces rows with -1 with the next data row
if fmode =='backward fill':
RD=RD.replace(-1, np.nan).bfill()
#Splits RD into tvec and data
tvec=RD.iloc[:, 0:6]
data=RD.iloc[:,6:]
return (data, tvec)
print(load_measurements("test.csv", 'forward fill'))
As I finally realized you want something like this:
Note, that I relaced pandas.DataFrame.bfill() and pandas.DataFrame.ffill() as these are aliasses of pandas.DataFrame.fillna(mode="bfill") and pandas.DataFrame.fillna(mode="ffill") respectively and bfill() doesn't gave the required result, I guess it is some bug.
import numpy as np
import pandas as pd
def load_measurements(filename, fmode):
RD = pd.read_csv(filename, names=["Year","Month","Day","hour","Minute", "Second", "Zone1", "Zone2", "Zone3", "Zone4"], sep=',')
print(RD)
RD = RD.replace(-1, np.NaN)
#Forward fill replaces rows with NaN with previous data row
if fmode =='forward fill':
if RD.iloc[0, 6:].isna().sum() == 0:
RD = RD.fillna(method='ffill')
else:
print("\nThere were -1 in the first row!\n")
#Backward fill replaces rows with NaN with the next data row
elif fmode =='backward fill':
if RD.iloc[-1, 6:].isna().sum() == 0:
RD = RD.fillna(method='bfill')
else:
print("\nThere were -1 in the last row!\n")
RD = RD.dropna()
#Splits RD into tvec and data
tvec=RD.iloc[:, 0:6]
data=RD.iloc[:, 6:]
return (data, tvec)
data , twec = load_measurements("test.csv", 'backward fill')
print(data)
print(twec)
Out("forward fill"):
Year Month Day hour Minute Second Zone1 Zone2 Zone3 Zone4
0 2006 1 15 4 0 0 -1.0 2.0 3.0 4.0
1 2006 2 11 6 1 0 5.0 6.0 1.0 8.0
2 2006 4 21 8 2 0 3.0 -1.0 -1.0 6.0
3 2006 7 14 9 3 0 2.0 3.0 4.0 5.0
4 2006 10 2 9 4 0 3.0 2.0 5.0 -1.0
There were -1 in the first row!
Zone1 Zone2 Zone3 Zone4
1 5.0 6.0 1.0 8.0
3 2.0 3.0 4.0 5.0
Year Month Day hour Minute Second
1 2006 2 11 6 1 0
3 2006 7 14 9 3 0
Out("drop"):
Year Month Day hour Minute Second Zone1 Zone2 Zone3 Zone4
0 2006 1 15 4 0 0 -1.0 2.0 3.0 4.0
1 2006 2 11 6 1 0 5.0 6.0 1.0 8.0
2 2006 4 21 8 2 0 3.0 -1.0 -1.0 6.0
3 2006 7 14 9 3 0 2.0 3.0 4.0 5.0
4 2006 10 2 9 4 0 3.0 2.0 5.0 -1.0
Zone1 Zone2 Zone3 Zone4
1 5.0 6.0 1.0 8.0
3 2.0 3.0 4.0 5.0
Year Month Day hour Minute Second
1 2006 2 11 6 1 0
3 2006 7 14 9 3 0
Out("backward fill"):
Year Month Day hour Minute Second Zone1 Zone2 Zone3 Zone4
0 2006 1 15 4 0 0 -1.0 2.0 3.0 4.0
1 2006 2 11 6 1 0 5.0 6.0 1.0 8.0
2 2006 4 21 8 2 0 3.0 -1.0 -1.0 6.0
3 2006 7 14 9 3 0 2.0 3.0 4.0 5.0
4 2006 10 2 9 4 0 3.0 2.0 5.0 -1.0
There were -1 in the last row!
Zone1 Zone2 Zone3 Zone4
1 5.0 6.0 1.0 8.0
3 2.0 3.0 4.0 5.0
Year Month Day hour Minute Second
1 2006 2 11 6 1 0
3 2006 7 14 9 3 0
I have two dataframes of different size:
df1 = pd.DataFrame({'A':[1,2,None,4,None,6,7,8,None,10], 'B':[11,12,13,14,15,16,17,18,19,20]})
df1
A B
0 1.0 11
1 2.0 12
2 NaN 13
3 4.0 14
4 NaN 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
df2 = pd.DataFrame({'A':[2,3,4,5,6,8], 'B':[12,13,14,15,16,18]})
df2['A'] = df2['A'].astype(float)
df2
A B
0 2.0 12
1 3.0 13
2 4.0 14
3 5.0 15
4 6.0 16
5 8.0 18
I need to fill missing values (and only them) in column A of the first dataframe with values from the second dataframe with common key in the column B. It is equivalent to a SQL query:
UPDATE df1 JOIN df2
ON df1.B = df2.B
SET df1.A = df2.A WHERE df1.A IS NULL;
I tried to use answers to similar questions from this site, but it does not work as I need:
df1.fillna(df2)
A B
0 1.0 11
1 2.0 12
2 4.0 13
3 4.0 14
4 6.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
df1.combine_first(df2)
A B
0 1.0 11
1 2.0 12
2 4.0 13
3 4.0 14
4 6.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
Intended output is:
A B
0 1.0 11
1 2.0 12
2 3.0 13
3 4.0 14
4 5.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
How do I get this result?
You were right about using combine_first(), except that both dataframes must share the same index, and the index must be the column B:
df1.set_index('B').combine_first(df2.set_index('B')).reset_index()
I have the following pandas DataFrame.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df)
time Group blocks
0 1 A 4
1 2 A 7
2 3 A 12
3 4 A 17
4 5 A 21
5 6 A 26
6 7 A 33
7 8 A 39
8 9 A 48
9 10 A 59
.... .... ....
36 35 A 231
37 1 B 1
38 2 B 1.5
39 3 B 3
40 4 B 5
41 5 B 6
.... .... ....
911 35 Z 349
This is a dataframe with multiple time series-esque data, from min=1 to max=35. Each Group has a relationship in the range time=1 to time=35 .
I would like to segment this dataframe into columns Group A, Group B, Group C, etc.
How does one "unconcatenate" this dataframe?
is that what you want?
In [84]: df.pivot_table(index='time', columns='Group')
Out[84]:
blocks
Group A B
time
1 4.0 1.0
2 7.0 1.5
3 12.0 3.0
4 17.0 5.0
5 21.0 6.0
6 26.0 NaN
7 33.0 NaN
8 39.0 NaN
9 48.0 NaN
10 59.0 NaN
35 231.0 NaN
data:
In [86]: df
Out[86]:
time Group blocks
0 1 A 4.0
1 2 A 7.0
2 3 A 12.0
3 4 A 17.0
4 5 A 21.0
5 6 A 26.0
6 7 A 33.0
7 8 A 39.0
8 9 A 48.0
9 10 A 59.0
36 35 A 231.0
37 1 B 1.0
38 2 B 1.5
39 3 B 3.0
40 4 B 5.0
41 5 B 6.0