Create two columns (interval) from a column in pandas - python

I have a df with a column
Value
3
8
10
15
I would like to obtain a dataframe that has an 'interpolated' From To values as the following:
Value From To
Nan 0 3
3 3 5
Nan 5 8
8 8 10
10 10 12
Nan 12 15
15 15 17
The increment is always 2 when a value exist.

I found a solution
This is the starting dataframe:
df = pd.DataFrame({'Value' : [0, 2,5,7,9,14,21,25]})
df['From'] = df['Value']
df['To'] = df['Value'] + 2
>>> print(df)
Value From To
0 0 0 2
1 2 2 4
2 5 5 7
3 7 7 9
4 9 9 11
5 14 14 16
6 21 21 23
7 25 25 27
And this code allow to habe Fromand To columns without empties intervals
new_rows = []
for i in range(len(df)-1):
if df.loc[i, 'To'] != df.loc[i+1, 'Value']:
new_row = {'Value': np.nan,'From': df.loc[i, 'To'], 'To': df.loc[i+1, 'Value']}
else:
new_row = {'Value': np.nan, 'From': np.nan, 'To': np.nan}
new_rows.append(new_row)
df1 = df.append(new_rows, ignore_index=True)
df1.dropna(how='all', inplace=True)
df1.sort_values(by=['To'], inplace=True)
df1
>>> df1
Value From To
0 0.0 0.0 2.0
1 2.0 2.0 4.0
9 NaN 4.0 5.0
2 5.0 5.0 7.0
3 7.0 7.0 9.0
4 9.0 9.0 11.0
12 NaN 11.0 14.0
5 14.0 14.0 16.0
13 NaN 16.0 21.0
6 21.0 21.0 23.0
14 NaN 23.0 25.0
7 25.0 25.0 27.0
It is probably improvable....

Related

Get surface weighted average of multiple columns in pandas dataframe

I want to take the surface-weighted average of the columns in my dataframe. I have two surface-columns and two U-value-columns. I want to create an extra column 'U_av' (surface-weighted-average U-value) and U_av = (A1*U1 + A2*U2) / (A1+A2). If NaN occurs in one of the columns, NaN should be returned.
Initial df:
ID A1 A2 U1 U2
0 14 2 1.0 10.0 11
1 16 2 2.0 12.0 12
2 18 2 3.0 24.0 13
3 20 2 NaN 8.0 14
4 22 4 5.0 84.0 15
5 24 4 6.0 84.0 16
Desired Output:
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.33
1 16 2 2.0 12.0 12 12
2 18 2 3.0 24.0 13 17.4
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.66
5 24 4 6.0 84.0 16 43.2
Code:
import numpy as np
import pandas as pd

df = pd.DataFrame({"ID": [14,16,18,20,22,24],
"A1": [2,2,2,2,4,4],
"U1": [10,12,24,8,84,84],
"A2": [1,2,3,np.nan,5,6],
"U2": [11,12,13,14,15,16]})

print(df)
#the mean of two columns U1 and U2 and dropping NaN is easy (U1+U2/2 in this case)
#but what to do for the surface-weighted mean (U_av = (A1*U1 + A2*U2) / (A1+A2))?
df.loc[:,'Umean'] = df[['U1','U2']].dropna().mean(axis=1)
EDIT:
adding to the solutions below:
df["U_av"] = (df.A1.mul(df.U1) + df.A2.mul(df.U2)).div(df[['A1','A2']].sum(axis=1))

Hope I got you correct:
df['U_av'] = (df['A1']*df['U1'] + df['A2']*df['U2']) / (df['A1']+df['A2'])
df
ID A1 U1 A2 U2 U_av
0 14 2 10 1.0 11 10.333333
1 16 2 12 2.0 12 12.000000
2 18 2 24 3.0 13 17.400000
3 20 2 8 NaN 14 NaN
4 22 4 84 5.0 15 45.666667
5 24 4 84 6.0 16 43.200000
Try this code:
numerator = df.A1.mul(df.U1) + (df.A2.mul(df.U2))
denominator = df.A1.add(df.A2)
df["U_av"] = numerator.div(denominator)
df
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.333333
1 16 2 2.0 12.0 12 12.000000
2 18 2 3.0 24.0 13 17.400000
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.666667
5 24 4 6.0 84.0 16 43.200000

How to reset cumprod when na's are in for pandas column

I have a 2 columns in a dataframe that I want to calculate the cumprod for both, but the cumprod needs to restart once it sees an na in the cell
I have tried using cumprod straightforwardly, but it's not getting me the correct values because the cumprod is continuous and not restarting when the na shows up
Here is an exaple df
index col1 col2
0 2 4
1 6 4
2 1 na
3 2 7
4 na 6
5 na 8
6 5 na
7 8 9
8 3 2
here is my desired output:
index col1 col2
0 2 4
1 12 16
2 12 na
3 24 7
4 na 42
5 na 336
6 5 na
7 40 9
8 240 18
Here is a solution that operates on each column and concats back together, since the masks are different for each column.
pd.concat(
[df[col].groupby(df[col].isnull().cumsum()).cumprod() for col in df.columns], axis=1)
col1 col2
0 2.0 4.0
1 12.0 16.0
2 12.0 NaN
3 24.0 7.0
4 NaN 42.0
5 NaN 336.0
6 5.0 NaN
7 40.0 9.0
8 120.0 18.0
A slightly more efficient approach is to calculate the grouper mask all at once and use zip
m = df.isnull().cumsum()
pd.concat(
[df[col].groupby(mask).cumprod() for col, mask in zip(df.columns, m.values.T)], axis=1)
Here's a similar solution with dict comprehension and the default constructor
pd.DataFrame({c: df[c].groupby(df[c].isna().cumsum()).cumprod() for c in df.columns})
col1 col2
0 2.0 4.0
1 12.0 16.0
2 12.0 NaN
3 24.0 7.0
4 NaN 42.0
5 NaN 336.0
6 5.0 NaN
7 40.0 9.0
8 120.0 18.0
You can use groupby with isna and cumsum to get groups to comprod over in each column using apply:
df.apply(lambda x: x.groupby(x.isna().cumsum()).cumprod())
Output:
col1 col2
index
0 2.0 4.0
1 12.0 16.0
2 12.0 NaN
3 24.0 7.0
4 NaN 42.0
5 NaN 336.0
6 5.0 NaN
7 40.0 9.0
8 120.0 18.0
Here is a solution without operating column by column:
df = pd.DataFrame([[2,4], [6,4], [1,np.nan], [2,7], [np.nan,6], [np.nan,8], [5,np.nan], [8,9], [3,2]],
columns=['col1', 'col2'])
df_cumprod = df.cumprod()
adjust_factor = df_cumprod.fillna(method='ffill').where(df_cumprod.isnull()).fillna(method='ffill').fillna(1)
print(df_cumprod / adjust_factor)
col1 col2
0 2.0 4.0
1 12.0 16.0
2 12.0 NaN
3 24.0 7.0
4 NaN 42.0
5 NaN 336.0
6 5.0 NaN
7 40.0 9.0
8 120.0 18.0

Join dataframes by key - repeated data as new columns

I'm facing the next situation. I have two dataframes lets say df1 and df2, and I need to join them by a key ( ID_ed , ID ) the second dataframe may have more than one occurrence of the key, what I need is to join the two dataframes, and add the repeated occurrences of the keys as new columns ( as shown in the next Image )
I tried to use merge = df2.join( df1 , lsuffix='_ZID', rsuffix='_IID' , how = "left" ) and concat operations but no luck so far .It seems that it only preserve the last occurrence ( as if it was overwriting the data )
Any help in this is really appreciated, and thanks in advance.
Another approach is to create a serial counter for the ID_ed column, set_index and unstack before calling the pivot_table. The pivot_table aggregation would be first. This approach would be fairly similar to this SO answer
Generate the data
import pandas as pd
import numpy as np
a = [['ID_ed','color'],[1,5],[2,8],[3,7]]
b = [['ID','code'],[1,1],[1,5],
[2,np.nan],[2,20],[2,74],
[3,10],[3,98],[3,85],
[3,21],[3,45]]
df1 = pd.DataFrame(a[1:], columns=a[0])
df2 = pd.DataFrame(b[1:], columns=b[0])
print(df1)
ID_ed color
0 1 5
1 2 8
2 3 7
print(df2)
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
5 3 10.0
6 3 98.0
7 3 85.0
8 3 21.0
9 3 45.0
First the merge and unstack
# Merge and add a serial counter column
df = df1.merge(df2, how='inner', left_on='ID_ed', right_on='ID')
df['counter'] = df.groupby('ID_ed').cumcount()+1
print(df)
ID_ed color ID code counter
0 1 5 1 1.0 1
1 1 5 1 5.0 2
2 2 8 2 NaN 1
3 2 8 2 20.0 2
4 2 8 2 74.0 3
5 3 7 3 10.0 1
6 3 7 3 98.0 2
7 3 7 3 85.0 3
8 3 7 3 21.0 4
9 3 7 3 45.0 5
# Set index and unstack
df.set_index(['ID_ed','color','counter']).\
unstack().\
swaplevel(1,0,axis=1).\
sort_index(level=0,axis=1).add_prefix('counter_')
print(df)
counter counter_1 counter_2 \
counter_ID counter_code counter_ID counter_code\
ID_ed color \
1 5 1.0 1.0 1.0 5.0\
2 8 2.0 NaN 2.0 20.0\
3 7 3.0 10.0 3.0 98.0 \
counter_3 counter_4 counter_5
counter_ID counter_code counter_ID counter_code counter_ID counter_code
NaN NaN NaN NaN NaN NaN
2.0 74.0 NaN NaN NaN NaN
3.0 85.0 3.0 21.0 3.0 45.0
Next generate the pivot table
# Pivot table with 'first' aggregation
dfp = pd.pivot_table(df, index=['ID_ed','color'],
columns=['counter'],
values=['ID', 'code'],
aggfunc='first')
print(dfp)
ID code
counter 1 2 3 4 5 1 2 3 4 5
ID_ed color
1 5 1.0 1.0 NaN NaN NaN 1.0 5.0 NaN NaN NaN
2 8 2.0 2.0 2.0 NaN NaN NaN 20.0 74.0 NaN NaN
3 7 3.0 3.0 3.0 3.0 3.0 10.0 98.0 85.0 21.0 45.0
Finally rename the columns and slice by partial column name
# Rename columns
level_1_names = list(dfp.columns.get_level_values(1))
level_0_names = list(dfp.columns.get_level_values(0))
new_cnames = [b+'_'+str(f) for f, b in zip(level_1_names, level_0_names)]
dfp.columns = new_cnames
# Slice by new column names
print(dfp.loc[:, dfp.columns.str.contains('code')].reset_index(drop=False))
ID_ed color code_1 code_2 code_3 code_4 code_5
0 1 5 1.0 5.0 NaN NaN NaN
1 2 8 NaN 20.0 74.0 NaN NaN
2 3 7 10.0 98.0 85.0 21.0 45.0
I'd use cumcount and pivot_table:
In [11]: df1
Out[11]:
ID color
0 1 5
1 2 8
2 3 7
In [12]: df2
Out[12]:
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
In [13]: res = df1.merge(df2) # This is a merge if the column names match
In [14]: res
Out[14]:
ID color code
0 1 5 1.0
1 1 5 5.0
2 2 8 NaN
3 2 8 20.0
4 2 8 74.0
In [15]: res['count'] = res.groupby('ID').cumcount()
In [16]: res.pivot_table('code', ['ID', 'color'], 'count')
Out[16]:
count 0 1 2
ID color
1 5 1.0 5.0 NaN
2 8 NaN 20.0 74.0

How to sum each row and replace each value in row with sum?

If I have a data frame like this:
1 2 3 4 5
1 2 4 5 NaN 3
2 3 5 6 1 2
3 3 1 1 1 1
How do I sum each row and replace the values in that row with the sum so I get something like this:
1 2 3 4 5
1 14 14 14 NaN 14
2 17 17 17 17 17
3 7 7 7 7 7
Use mask for replace all non missing values by sum:
df = df.mask(df.notnull(), df.sum(axis=1), axis=0)
print (df)
1 2 3 4 5
1 14 14 14 NaN 14
2 17 17 17 17.0 17
3 7 7 7 7.0 7
Or use numpy.broadcast_to with numpy.where:
arr = df.values
a = np.broadcast_to(np.nansum(arr, axis=1)[:, None], df.shape)
df = pd.DataFrame(np.where(np.isnan(arr), np.nan, a),
index=df.index, columns=df.columns)
#alternative
df[:] = np.where(np.isnan(arr), np.nan, a)
print (df)
1 2 3 4 5
1 14.0 14.0 14.0 NaN 14.0
2 17.0 17.0 17.0 17.0 17.0
3 7.0 7.0 7.0 7.0 7.0
Using mul
df.notnull().replace(False,np.nan).mul(df.sum(1),axis=0).astype(float)
1 2 3 4 5
1 14.0 14.0 14.0 NaN 14.0
2 17.0 17.0 17.0 17.0 17.0
3 7.0 7.0 7.0 7.0 7.0

Pandas: Fill missing dataframe values from other dataframe

I have two dataframes of different size:
df1 = pd.DataFrame({'A':[1,2,None,4,None,6,7,8,None,10], 'B':[11,12,13,14,15,16,17,18,19,20]})
df1
A B
0 1.0 11
1 2.0 12
2 NaN 13
3 4.0 14
4 NaN 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
df2 = pd.DataFrame({'A':[2,3,4,5,6,8], 'B':[12,13,14,15,16,18]})
df2['A'] = df2['A'].astype(float)
df2
A B
0 2.0 12
1 3.0 13
2 4.0 14
3 5.0 15
4 6.0 16
5 8.0 18
I need to fill missing values (and only them) in column A of the first dataframe with values from the second dataframe with common key in the column B. It is equivalent to a SQL query:
UPDATE df1 JOIN df2
ON df1.B = df2.B
SET df1.A = df2.A WHERE df1.A IS NULL;
I tried to use answers to similar questions from this site, but it does not work as I need:
df1.fillna(df2)
A B
0 1.0 11
1 2.0 12
2 4.0 13
3 4.0 14
4 6.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
df1.combine_first(df2)
A B
0 1.0 11
1 2.0 12
2 4.0 13
3 4.0 14
4 6.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
Intended output is:
A B
0 1.0 11
1 2.0 12
2 3.0 13
3 4.0 14
4 5.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
How do I get this result?
You were right about using combine_first(), except that both dataframes must share the same index, and the index must be the column B:
df1.set_index('B').combine_first(df2.set_index('B')).reset_index()

Categories