If I had a simple pandas DataFrame like this:
frame = pd.DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'), index=list('123'))
I want find the max value from each row, and use this to find the next value in the column and add this value to a new column.
So the above DataFrame looks like this (with d2 changed to 3):
a b c d
1 1 2 3 4
2 5 6 7 3
3 9 10 11 12
So, conceptually the first row should be scanned, 4 is identified as the largest number, then 3 is found as the number within the same column but in the next index. Similarly for the row 2, 7 is the largest number, and 11 is the next number in that column. So 3 and 11 should get added to a new column like this:
a b c d Next
1 1 2 3 4 NaN
2 5 6 7 3 3
3 9 10 11 12 11
I started by making a function like this, but it only finds the max values.
f = lambda x: x.max()
max = frame.apply(f, axis='columns')
frame['Next'] = max
Based on your edit, you can use np.argsort:
i = np.arange(len(df))
j = pd.Series(np.argmax(df.values, axis=1))
df['next'] = df.shift(-1).values[i, j]
a b c d next
1 1 2 3 4 3.0
2 5 6 7 3 11.0
3 9 10 11 12 NaN
Related
I need to drop missing values in a few columns. I wrote this to do it one by one:
df2['A'].fillna(df1['A'].mean(), inplace=True)
df2['B'].fillna(df1['B'].mean(), inplace=True)
df2['C'].fillna(df1['C'].mean(), inplace=True)
Any other ways I can fill them all in one line of code?
You can use a single instructions:
cols = ['A', 'B', 'C']
df[cols] = df[cols].fillna(df[cols].mean())
Or for apply on all numeric columns, use select_dtypes:
cols = df.select_dtypes('number').columns
df[cols] = df[cols].fillna(df[cols].mean())
Note: I strongly discourage you to use inplace parameter. It will probably disappear in Pandas 2
[lambda c: df2[c].fillna(df1[c].mean(), inplace=True) for c in df2.columns]
There are few options to work with nans in a df. I'll explain some of them...
Given this example df:
A
B
C
0
1
5
10
1
2
nan
11
2
nan
nan
12
3
4
8
nan
4
nan
9
14
Example 1: fill all columns with mean
df = df.fillna(df.mean())
Result:
A
B
C
0
1
5
10
1
2
7.33333
11
2
2.33333
7.33333
12
3
4
8
11.75
4
2.33333
9
14
Example 2: fill some columns with median
df[["A","B"]] = df[["A","B"]].fillna(df.median())
Result:
A
B
C
0
1
5
10
1
2
8
11
2
2
8
12
3
4
8
nan
4
2
9
14
Example 3: fill all columns using ffill()
Explanation: Missing values are replaced with the most recent available value in the same column. So, the value of the preceding row in the same column is used to fill in the blanks.
df = df.fillna(method='ffill')
Result:
A
B
C
0
1
5
10
1
2
8
11
2
2
8
12
3
4
8
12
4
2
9
14
Example 4: fill all columns using bfill()
Explanation: Missing values in a column are filled using the value of the next row going up, meaning the values are filled from the bottom to the top. Basically, you're replacing the missing values with the next known non-missing value.
df = df.fillna(method='bfill')
Result:
A
B
C
0
1
5
10
1
2
8
11
2
4
8
12
3
4
8
14
4
nan
9
14
If you wanted to DROP (no fill) the missing values. You can do this:
Option 1: remove rows with one or more missing values
df = df.dropna(how="any")
Result:
A
B
C
0
1
5
10
Option 2: remove rows with all missing values
df = df.dropna(how="all")
The goal is to put the current highest digit in the new column increasing row by row in a given group of letters. The expected, correct value, as a result formula, was entered by me manually in the column "col_ok". The only thing I have achieved so far is assigning the highest value to a given group and this result is in the fourth column called "cumulatively". However, for the first row in the "A" group it is not true, because the correct value according to the assumptions described is: "1" Similarly, the values in the second and third rows. Only the value in the fourth row is true, but the value in the fifth row is not. Forgive me the inconsistency of my post, I'm not an IT specialist and I don't know English. Thanks in advance for your support.
df = pd.read_csv('C:/Users/.../a.csv',names=['group_letter', 'digit', 'col_ok'] ,
index_col=0,)
df = df.assign(cumulatively = df.groupby('group_letter')['col_ok'].transform('max'))
print(df)
group_letter digit col_ok cumulatively
A 1 1 5
A 3 3 5
A 2 2 5
A 5 5 5
A 1 5 5
B 1 1 3
B 2 2 3
B 1 2 3
B 1 2 3
B 3 3 3
C 5 5 6
C 6 6 6
C 1 6 6
C 2 6 6
C 3 6 6
D 4 4 7
D 3 4 7
D 2 4 7
D 5 5 7
D 7 7 7
IIUC use:
df = df.assign(cumulatively = df.groupby('group_letter')['col_ok'].cummax())
I have this pandas Dataframe :
A B C
20 6 7
5 3.8 9
34 4 1
I want to create duplicate rows if value in A is say >10.
So the Dataframe should finally look like:
A B C
10 6 7
10 6 7
5 3.8 9
10 4 1
10 4 1
10 4 1
4 4 1
Is there a way in pandas to do this elegantly? Or I will have to loop over rows and do it manually..?
I have already browsed similar queries on StackOverflow, but none of them does exactly what I want.
Use:
#create default index
df = df.reset_index(drop=True)
#get floor and modulo divisions
a = df['A'] // 10
b = (df['A'] % 10)
#repeat once if not 0
df2 = df.loc[df.index.repeat(b.ne(0).astype(int))]
#repplace values of A with map by index
df2['A'] = df2.index.map(b.get)
#repeat with assign scalar 10
df1 = df.loc[df.index.repeat(a)].assign(A=10)
#join together, sort index and create default RangeIndex
df = df1.append(df2).sort_index().reset_index(drop=True)
print (df)
A B C
0 10 6.0 7
1 10 6.0 7
2 5 3.8 9
3 10 4.0 1
4 10 4.0 1
5 10 4.0 1
6 4 4.0 1
I need to remove all rows from a pandas.DataFrame, which satisfy an unusual condition.
In case there is an exactly the same row, except for it has Nan value in column "C", I want to remove this row.
Given a table:
A B C D
1 2 NaN 3
1 2 50 3
10 20 NaN 30
5 6 7 8
I need to remove the first row, since it has Nan in column C, but there is absolutely same row (second) with real value in column C.
However, 3rd row must stay, because there're no rows with same A, B and D values as it has.
How do you perform this using pandas? Thank you!
You can achieve in using drop_duplicates.
Initial DataFrame:
df=pd.DataFrame(columns=['a','b','c','d'], data=[[1,2,None,3],[1,2,50,3],[10,20,None,30],[5,6,7,8]])
df
a b c d
0 1 2 NaN 3
1 1 2 50 3
2 10 20 NaN 30
3 5 6 7 8
Then you can sort DataFrame by column C. This will drop NaNs to the bottom of column:
df = df.sort_values(['c'])
df
a b c d
3 5 6 7 8
1 1 2 50 3
0 1 2 NaN 3
2 10 20 NaN 30
And then remove duplicates selecting taken into account columns ignoring C and keeping first catched row:
df1 = df.drop_duplicates(['a','b','d'], keep='first')
a b c d
3 5 6 7 8
1 1 2 50 3
2 10 20 NaN 30
But it will be valid only if NaNs are in column C.
You can try fillna along with drop_duplicates
df.bfill().ffill().drop_duplicates(subset=['A', 'B', 'D'], keep = 'last')
This will handle the scenario such as A, B and D values are same but C has non-NaN values in both the rows.
You get
A B C D
1 1 2 50 3
2 10 20 Nan 30
3 5 6 7 8
This feels right to me
notdups = ~df.duplicated(df.columns.difference(['C']), keep=False)
notnans = df.C.notnull()
df[notdups | notnans]
A B C D
1 1 2 50.0 3
2 10 20 NaN 30
3 5 6 7.0 8
Im facing a problem with applying a function to a DataFrame (to model a solar collector based on annual hourly weather data)
Suppose I have the following (simplified) DataFrame:
df2:
A B C
0 11 13 5
1 6 7 4
2 8 3 6
3 4 8 7
4 0 1 7
Now I have defined a function that takes all rows as input to create a new column called D, but I want the function to also take the last calculated value of D (except of course for the first row as no value for D is calculated) as input.
def Funct(x):
D = x['A']+x['B']+x['C']+(x-1)['D']
I know that the function above is not working, but it gives an idea of what I want.
So to summarise:
Create a function that creates a new column in the dataframe and takes the value of the new column one row above it as input
Can somebody help me?
Thanks in advance.
It sounds like you are calculating a cumulative sum. In that case, use cumsum:
In [45]: df['D'] = (df['A']+df['B']+df['C']).cumsum()
In [46]: df
Out[46]:
A B C D
0 11 13 5 29
1 6 7 4 46
2 8 3 6 63
3 4 8 7 82
4 0 1 7 90
[5 rows x 4 columns]
Are you looking for this?
You can use shift to align the previous row with current row and then you can do your operation.
In [7]: df
Out[7]:
a b
1 1 1
2 2 2
3 3 3
4 4 4
[4 rows x 2 columns]
In [8]: df['c'] = df['b'].shift(1) #First row will be Nan
In [9]: df
Out[9]:
a b c
1 1 1 NaN
2 2 2 1
3 3 3 2
4 4 4 3
[4 rows x 3 columns]