Maximum value in a given group, increasing row by row - python

The goal is to put the current highest digit in the new column increasing row by row in a given group of letters. The expected, correct value, as a result formula, was entered by me manually in the column "col_ok". The only thing I have achieved so far is assigning the highest value to a given group and this result is in the fourth column called "cumulatively". However, for the first row in the "A" group it is not true, because the correct value according to the assumptions described is: "1" Similarly, the values ​​in the second and third rows. Only the value in the fourth row is true, but the value in the fifth row is not. Forgive me the inconsistency of my post, I'm not an IT specialist and I don't know English. Thanks in advance for your support.
df = pd.read_csv('C:/Users/.../a.csv',names=['group_letter', 'digit', 'col_ok'] ,
index_col=0,)
df = df.assign(cumulatively = df.groupby('group_letter')['col_ok'].transform('max'))
print(df)
group_letter digit col_ok cumulatively
A 1 1 5
A 3 3 5
A 2 2 5
A 5 5 5
A 1 5 5
B 1 1 3
B 2 2 3
B 1 2 3
B 1 2 3
B 3 3 3
C 5 5 6
C 6 6 6
C 1 6 6
C 2 6 6
C 3 6 6
D 4 4 7
D 3 4 7
D 2 4 7
D 5 5 7
D 7 7 7

IIUC use:
df = df.assign(cumulatively = df.groupby('group_letter')['col_ok'].cummax())

Related

How to create a new variable based on the fourth observation of a different variable

From this table
ID, Date, Value
A Jan01 5
A Feb01 10
A Mar03 9
A Apr02 7
A Jan01 2
B Feb01 3
B Mar01 6
B Mar01 9
B Mar02 5
Desired table:
ID, Date, Value, New_Variable
A Jan01 5 7
A Feb01 10 7
A Mar03 9 7
A Apr02 7 7
A Jan01 2 5
B Feb01 3 5
B Mar01 6 5
B Mar01 9 5
B Mar02 5 5
I know I can do
df.groupby('ID')['Value'].transform('first')
if I want to take the first value, what about the other rows? like the fourth or the fifth?
We can group the datframe by ID then transform the Value column with nth to select the nth value from each group.
df['new_col'] = df.groupby('ID')['Value'].transform('nth', n=3)
print(df)
ID Date Value new_col
0 A Jan01 5 7
1 A Feb01 10 7
2 A Mar03 9 7
3 A Apr02 7 7
4 A Jan01 2 7
5 B Feb01 3 5
6 B Mar01 6 5
7 B Mar01 9 5
8 B Mar02 5 5
Note: The n value is zero based, so in order to select the 4th row you have to specify n=3
one idea is add a ranking column to show which place a row is in among its group. For example:
df['rank'] = df.groupby('ID').cumcount()
In this case you know the 4th place for each ID.
fourth_place = df[df['rank']==3]
so that you can create a mapping
mapping = fourth_place.set_index('ID')['Value']
which can be used in creating the new column
df['New_Variable'] = df['ID'].map(mapping)

Add all column values repeated of one data frame to other in pandas

Having two data frames:
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'c':[7],'d':[8]})
c d
0 7 8
The goal is to add all df2 column values to df1, repeated and create the following result. It is assumed that both data frames do not share any column names.
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If there are strings columns names is possible use DataFrame.assign with unpack Series created by selecing first row of df2:
df = df1.assign(**df2.iloc[0])
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
Another idea is repeat values by df1.index with DataFrame.reindex and use DataFrame.join (here first index value of df2 is same like first index value of df1.index):
df = df1.join(df2.reindex(df1.index, method='ffill'))
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If no missing values in original df is possible use forward filling missing values in last step, but also are types changed to floats, thanks #Dishin H Goyan:
df = df1.join(df2).ffill()
print (df)
a b c d
0 1 4 7.0 8.0
1 2 5 7.0 8.0
2 3 6 7.0 8.0

Assign the frequency of each value to dataframe with new column

I try to set up a Dataframe that countains a column called frequency.
This column should show how often the value is present in a specific column of the dataframe in every row. Something like this:
Index Category Frequency
0 1 1
1 3 2
2 3 2
3 4 1
4 7 3
5 7 3
6 7 3
7 8 1
This is just an example
I already tried it with value_counts(), however I only receive a value in the last line of the appearing number.
In the case of the example
Index Category Frequency
0 1 1
1 3 N.A
2 3 2
3 4 1
4 7 N.A
5 7 N.A
6 7 3
7 8 1
It is very important that the column has the same number of rows as the dataframe, preferably appended to the same dataframe
df['Frequency'] = df.groupby('Category').transform('count')
Use pandas.Series.map:
df['Frecuency']=df['Category'].map(df['Category'].value_counts())
or pandas.Series.replace:
df['Frecuency']=df['Category'].replace(df['Category'].value_counts())
Output:
Index Category Frecuency
0 0 1 1
1 1 3 2
2 2 3 2
3 3 4 1
4 4 7 3
5 5 7 3
6 6 7 3
7 7 8 1
Details
df['Category'].value_counts()
7 3
3 2
4 1
1 1
8 1
Name: Category, dtype: int64
using value_counts you get a series whose index are the elements of the category and the values ​​is the count. So you can use map or pandas.Series.replace to create a series with the category values ​​replaced by those in the count. And finally assign this series to the frequency column
you can do it using group by like below
df.groupby("Category") \
.apply(lambda g: g.assign(frequency = len(g))) \
.reset_index(level=0, drop=True)

Refer to next index in pandas

If I had a simple pandas DataFrame like this:
frame = pd.DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'), index=list('123'))
I want find the max value from each row, and use this to find the next value in the column and add this value to a new column.
So the above DataFrame looks like this (with d2 changed to 3):
a b c d
1 1 2 3 4
2 5 6 7 3
3 9 10 11 12
So, conceptually the first row should be scanned, 4 is identified as the largest number, then 3 is found as the number within the same column but in the next index. Similarly for the row 2, 7 is the largest number, and 11 is the next number in that column. So 3 and 11 should get added to a new column like this:
a b c d Next
1 1 2 3 4 NaN
2 5 6 7 3 3
3 9 10 11 12 11
I started by making a function like this, but it only finds the max values.
f = lambda x: x.max()
max = frame.apply(f, axis='columns')
frame['Next'] = max
Based on your edit, you can use np.argsort:
i = np.arange(len(df))
j = pd.Series(np.argmax(df.values, axis=1))
df['next'] = df.shift(-1).values[i, j]
a b c d next
1 1 2 3 4 3.0
2 5 6 7 3 11.0
3 9 10 11 12 NaN

Python: Applying a function to DataFrame taking input from the new calculated column

Im facing a problem with applying a function to a DataFrame (to model a solar collector based on annual hourly weather data)
Suppose I have the following (simplified) DataFrame:
df2:
A B C
0 11 13 5
1 6 7 4
2 8 3 6
3 4 8 7
4 0 1 7
Now I have defined a function that takes all rows as input to create a new column called D, but I want the function to also take the last calculated value of D (except of course for the first row as no value for D is calculated) as input.
def Funct(x):
D = x['A']+x['B']+x['C']+(x-1)['D']
I know that the function above is not working, but it gives an idea of what I want.
So to summarise:
Create a function that creates a new column in the dataframe and takes the value of the new column one row above it as input
Can somebody help me?
Thanks in advance.
It sounds like you are calculating a cumulative sum. In that case, use cumsum:
In [45]: df['D'] = (df['A']+df['B']+df['C']).cumsum()
In [46]: df
Out[46]:
A B C D
0 11 13 5 29
1 6 7 4 46
2 8 3 6 63
3 4 8 7 82
4 0 1 7 90
[5 rows x 4 columns]
Are you looking for this?
You can use shift to align the previous row with current row and then you can do your operation.
In [7]: df
Out[7]:
a b
1 1 1
2 2 2
3 3 3
4 4 4
[4 rows x 2 columns]
In [8]: df['c'] = df['b'].shift(1) #First row will be Nan
In [9]: df
Out[9]:
a b c
1 1 1 NaN
2 2 2 1
3 3 3 2
4 4 4 3
[4 rows x 3 columns]

Categories