Secondary row value of highest rolling sums pandas - python

I am trying to get the max value of one row, according to the cumulative sum of a different row. My dataframe looks like this:
df = pd.DataFrame({'constant': ['a', 'b', 'b', 'c', 'c', 'd', 'a'], 'value': [1, 3, 1, 5, 1, 9, 2]})
indx constant value
0 a 1
1 b 3
2 b 1
3 c 5
4 c 1
5 d 9
6 a 2
I am trying to add a new field, with the constant that has the highest cumulative sum of value up to that point in the dataframe. the final dataframe would look like this:
indx constant value new_field
0 a 1 NaN
1 b 3 a
2 b 1 b
3 c 5 b
4 c 1 c
5 d 9 c
6 a 2 d
As you can see, at index 1, a has the highest cumulative sum of value for all prior rows. At index 2, b has the highest cumulative sum of value for all prior rows, and so on.
Anyone have a solution?

As presented, you just need a shift. However try the following for other scenarios.
Steps
Find the cummulative maximum
Where the cummulative max is equal to df['value'], copy the 'constant', otherwise make it a NaN
The NaNs should leave chance to broadcast the constant corresponding to the max value
Outcome
df=df.assign(new_field=(np.where(df['value']==df['value'].cummax(), df['constant'], np.nan))).ffill()
df=df.assign(new_field=df['new_field'].shift())
constant value new_field
0 a 1 NaN
1 b 3 a
2 b 1 b
3 c 5 b
4 c 1 c
5 d 9 c
6 a 2 d

I think you should try and approach this as a pivot table, which would allow you to use np.argmax over the column axis.
# this will count cummulative occurences over the ix for each value of `constant`
X = df.pivot_table(
index=df.index,
columns=['constant'],
values='value'
).fillna(0.0).cumsum(axis=0)
# now you get a list of ixs that max the cummulative value over the column axis - i.e., the "winner"
colix = np.argmax(X.values, axis=1)
# you can fetch corresponding column names using this argmax index
df['winner'] = np.r_[[np.nan], X.columns[colix].values[:-1]]
# and there you go
df
constant value winner
0 a 1 NaN
1 b 3 a
2 b 1 b
3 c 5 b
4 c 1 c
5 d 9 c
6 a 2 d

You should be a little more careful (since values can be negative value which decrease cumsum), here is what you probably need to do,
df["cumsum"] = df["value"].cumsum()
df["cummax"] = df["cumsum"].cummax()
df["new"] = np.where(df["cumsum"] == df["cummax"], df['constant'], np.nan)
df["new"] = df.ffill()["new"].shift()
df

Related

How can a duplicate row be dropped with some condition [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 9 months ago.
Simple DataFrame:
df = pd.DataFrame({'A': [1,1,2,2], 'B': [0,1,2,3], 'C': ['a','b','c','d']})
df
A B C
0 1 0 a
1 1 1 b
2 2 2 c
3 2 3 d
I wish for every value (groupby) of column A, to get the value of column C, for which column B is maximum. For example for group 1 of column A, the maximum of column B is 1, so I want the value "b" of column C:
A C
0 1 b
1 2 d
No need to assume column B is sorted, performance is of top priority, then elegance.
Check with sort_values +drop_duplicates
df.sort_values('B').drop_duplicates(['A'],keep='last')
Out[127]:
A B C
1 1 1 b
3 2 3 d
df.groupby('A').apply(lambda x: x.loc[x['B'].idxmax(), 'C'])
# A
#1 b
#2 d
Use idxmax to find the index where B is maximal, then select column C within that group (using a lambda-function
Here's a little fun with groupby and nlargest:
(df.set_index('C')
.groupby('A')['B']
.nlargest(1)
.index
.to_frame()
.reset_index(drop=True))
A C
0 1 b
1 2 d
Or, sort_values, groupby, and last:
df.sort_values('B').groupby('A')['C'].last().reset_index()
A C
0 1 b
1 2 d
Similar solution to #Jondiedoop, but avoids the apply:
u = df.groupby('A')['B'].idxmax()
df.loc[u, ['A', 'C']].reset_index(drop=1)
A C
0 1 b
1 2 d

How to get first n records of groups based on column value

I am wondering how I can use groupby and head to get the first n values of a group of records, where n is encoded in a column in the original dataframe.
import pandas as pd
df = pd.DataFrame({"A": [1] * 4 + [2] * 3, "B": list(range(1, 8))})
gp = df.groupby("A").head(2)
print(gp)
This will return the first 2 records of each group. How would I go ahead if I wanted the first 1 of group 1, and the first 2 of group 2, as encoded in column A?
Desired outcome:
A B
0 1 1
4 2 5
5 2 6
We can create a sequential counter using groupby + cumcount to uniquely identify the rows within each group of column A, then create a boolean mask to identify the rows where the counter value is less than or equal to value encoded in column A, now we can filter the required rows using this boolean mask
df[df.groupby('A').cumcount().add(1).le(df['A'])]
A B
0 1 1
4 2 5
5 2 6
Here is solution with DataFrame.head in custom function by A passed by x.name - here is filtered data by A values:
gp = df.groupby("A", group_keys=False).apply(lambda x: x.head(x.name))
print(gp)
A B
0 1 1
4 2 5
5 2 6
If need filter by order in A values solution is:
df = pd.DataFrame({"A": [8] * 4 + [6] * 3, "B": list(range(1, 8))})
d = {v: k for k, v in enumerate(df.A.unique(), 1)}
gp = df.groupby("A", group_keys=False, sort=False).apply(lambda x: x.head(d[x.name]))
print(gp)
A B
0 8 1
4 6 5
5 6 6
df_ = pd.concat([gp[1].head(i+1) for i, gp in enumerate(df.groupby("A"))])
# print(df_)
A B
0 1 1
4 2 5
5 2 6

When to use .count() and .value_counts() in Pandas?

I am learning pandas. I'm not sure when to use the .count() function and when to use .value_counts().
count() is used to count the number of non-NA/null observations across the given axis. It works with non-floating type data as well.
Now as an example create a dataframe df
df = pd.DataFrame({"A":[10, 8, 12, None, 5, 3],
"B":[-1, None, 6, 4, None, 3],
"C":["Shreyas", "Aman", "Apoorv", np.nan, "Kunal", "Ayush"]})
Find the count of non-NA value across the row axis.
df.count(axis = 0)
Output:
A 5
B 4
C 5
dtype: int64
Find the number of non-NA/null value across the column.
df.count(axis = 1)
Output:
0 3
1 2
2 3
3 1
4 2
5 3
dtype: int64
value_counts() function returns Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
So for the example shown below
s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts()
The output would be:
3.0 2
4.0 1
2.0 1
1.0 1
dtype: int64
value_counts() aggregates the data and counts each unique value. You can achieve the same by using groupby which is a more broad function to aggregate data in pandas.
count() simply returns the number of non NaN/Null values in column (series) you apply it on.
df = pd.DataFrame({'Id':['A', 'B', 'B', 'C', 'D', 'E', 'F', 'F'],
'Value':[10, 20, 15, 5, 35, 20, 10, 25]})
print(df)
Id Value
0 A 10
1 B 20
2 B 15
3 C 5
4 D 35
5 E 20
6 F 10
7 F 25
# Value counts
df['Id'].value_counts()
F 2
B 2
C 1
A 1
D 1
E 1
Name: Id, dtype: int64
# Same operation but with groupby
df.groupby('Id')['Id'].count()
Id
A 1
B 2
C 1
D 1
E 1
F 2
Name: Id, dtype: int64
# Count()
df['Id'].count()
8
Example with NaN values and count:
print(df)
Id Value
0 A 10
1 B 20
2 B 15
3 NaN 5
4 D 35
5 E 20
6 F 10
7 F 25
df['Id'].count()
7
count() returns the total number of non-null values in the series.
value_counts() returns a series of the number of times each unique non-null value appears, sorted from most to least frequent.
As usual, an example is the best way to convey this:
ser = pd.Series(list('aaaabbbccdef'))
ser
>
0 a
1 a
2 a
3 a
4 b
5 b
6 b
7 c
8 c
9 d
10 e
11 f
dtype: object
ser.count()
>
12
ser.value_counts()
>
a 4
b 3
c 2
f 1
d 1
e 1
dtype: int64
Note that a dataframe has the count() method, which returns a series of the count() (scalar) value for each column in the df. However, a dataframe has no value_counts() method.

Python: given list of columns and list of values, return subset of dataframe that meets all criteria

I have a dataframe like the following.
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
Assume that column A will always in the dataframe but sometimes the could be column B, column B and C, or multiple number of columns.
I have created a code to save the columns names (other than A) in a list as well as the unique permutations of the values in the other columns into a list. For instance, in this example, we have columns B and C saved into columns:
col = ['B','C']
The permutations in the simple df are 1,7; 2,8; 3,9. For simplicity assume one permutation is saved as follows:
permutation = [2,8]
How do I select the entire rows (and only those) that equal that permutation?
Right now, I am using:
a[a[col].isin(permutation)]
Unfortunately, I don't get the values in column A.
(I know how to drop those values that are NaN later. BuT How should I do this to keep it dynamic? Sometimes there will be multiple columns. (Ultimately, I'll run through a loop and save the different iterations) based upon multiple permutations in the columns other than A.
Use the intersection of boolean series (where both conditions are true) - first setup code:
import pandas as pd
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
col = ['B','C']
permutation = [2,8]
And here's the solution for this limited example:
>>> df[(df[col[0]] == permutation[0]) & (df[col[1]] == permutation[1])]
A B C
1 Jean 2 8
3 Sue 2 8
To break that down:
>>> b, c = col
>>> per_b, per_c = permutation
>>> column_b_matches = df[b] == per_b
>>> column_c_matches = df[c] == per_c
>>> intersection = column_b_matches & column_c_matches
>>> df[intersection]
A B C
1 Jean 2 8
3 Sue 2 8
Additional columns and values
To take any number of columns and values, I would create a function:
def select_rows(df, columns, values):
if not columns or not values:
raise Exception('must pass columns and values')
if len(columns) != len(values):
raise Exception('columns and values must be same length')
intersection = True
for c, v in zip(columns, values):
intersection &= df[c] == v
return df[intersection]
and to use it:
>>> select_rows(df, col, permutation)
A B C
1 Jean 2 8
3 Sue 2 8
Or you can coerce the permutation to an array and accomplish this with a single comparison, assuming numeric values:
import numpy as np
def select_rows(df, columns, values):
return df[(df[col] == np.array(values)).all(axis=1)]
But this does not work with your code sample as given
I figured out a solution. Aaron's above works well if I only have two columns. I need a solution that works regardless of the size of the df (as size will be 3-7 columns).
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
permutation = [2,8]
col = ['B','C']
interim = df[col].isin(permutation)
df[df.index.isin(interim[(interim != 0).all(1)].index)]
you can do it this way:
In [77]: permutation = np.array([0,2,2])
In [78]: col
Out[78]: ['a', 'b', 'c']
In [79]: df.loc[(df[col] == permutation).all(axis=1)]
Out[79]:
a b c
10 0 2 2
15 0 2 2
16 0 2 2
your solution will not always work properly:
sample DF:
In [71]: df
Out[71]:
a b c
0 0 2 1
1 1 1 1
2 0 1 2
3 2 0 1
4 0 1 0
5 2 0 0
6 2 0 0
7 0 1 0
8 2 1 0
9 0 0 0
10 0 2 2
11 1 0 1
12 2 1 1
13 1 0 0
14 2 1 0
15 0 2 2
16 0 2 2
17 1 0 2
18 0 1 1
19 1 2 0
In [67]: col = ['a','b','c']
In [68]: permutation = [0,2,2]
In [69]: interim = df[col].isin(permutation)
pay attention at the result:
In [70]: df[df.index.isin(interim[(interim != 0).all(1)].index)]
Out[70]:
a b c
5 2 0 0
6 2 0 0
9 0 0 0
10 0 2 2
15 0 2 2
16 0 2 2

pandas - Going from aggregated format to long format

If I would go from a long format to a grouped aggregated format I would simply do:
s = pd.DataFrame(['a','a','a','a','b','b','c'], columns=['value'])
s.groupby('value').size()
value
a 4
b 2
c 1
dtype: int64
Now if I wanted to revert that aggregation and go from a grouped format to a long format, how would I go about doing that? I guess I could loop through the grouped series and repeat 'a' 4 times and 'b' 2 times etc.
Is there a better way to do this in pandas or any other Python package?
Thankful for any hints
Perhaps .transform can help with this:
s.set_index('value', drop=False, inplace=True)
s['size'] = s.groupby(level='value', as_index=False).transform(size)
s.reset_index(inplace=True, drop=True)
s
yielding:
value size
0 a 4
1 a 4
2 a 4
3 a 4
4 b 2
5 b 2
6 c 1
Another and rather simple approach is to use np.repeat (assuming s2 is the aggregated series):
In [17]: np.repeat(s2.index.values, s2.values)
Out[17]: array(['a', 'a', 'a', 'a', 'b', 'b', 'c'], dtype=object)
In [18]: pd.DataFrame(np.repeat(s2.index.values, s2.values), columns=['value'])
Out[18]:
value
0 a
1 a
2 a
3 a
4 b
5 b
6 c
There might be something cleaner, but here's an approach. First, store you groupby results in a dataframe and rename the columsn.
agg = s.groupby('value').size().reset_index()
agg.columns = ['key', 'count']
Then, build a frame with with columns that track the count for each letter.
counts = agg['count'].apply(lambda x: pd.Series([0] * x))
counts['key'] = agg['key']
In [107]: counts
Out[107]:
0 1 2 3 key
0 0 0 0 0 a
1 0 0 NaN NaN b
2 0 NaN NaN NaN c
Finally, this can be melted and nulls droppeed to get your desired frame.
In [108]: pd.melt(counts, id_vars='key').dropna()[['key']]
Out[108]:
key
0 a
1 b
2 c
3 a
4 b
6 a
9 a

Categories