Python df groupby with agg for string and sum

Python df groupby with agg for string and sum - python

With this df as base i want the following output:
So all should be aggregated by column 0 and all strings from column 1 should be added and the numbers from column 2 should be summed when the strings from column 1 have the same name.
With the following code i could aggregate the strings but without summing the numbers:
df2= df1.groupby([0]).agg(lambda x: ','.join(set(x))).reset_index()
df2

Avoid an arbitrary number of columns
Your desired output suggests you have an arbitrary number of columns dependent on the number of values in 1 for each group 0. This is anti-Pandas, which is strongly geared towards an arbitrary number of rows. Hence series-wise operations are preferred.
So you can just use groupby + sum to store all the information you require.
df = pd.DataFrame({0: ['2008-04_E.pdf']*3,
1: ['Mat1', 'Mat2', 'Mat2'],
2: [3, 1, 1]})
df_sum = df.groupby([0, 1]).sum().reset_index()
print(df_sum)
0 1 2
0 2008-04_E.pdf Mat1 3
1 2008-04_E.pdf Mat2 2
But if you insist...
If you insist on your unusual requirement, you can achieve it as follows via df_sum calculated as above.
key = df_sum.groupby(0)[1].cumcount().add(1).map('Key{}'.format)
res = df_sum.set_index([0, key]).unstack().reset_index().drop('key', axis=1)
res.columns = res.columns.droplevel(0)
print(res)
Key1 Key2 Key1 Key2
0 2008-04_E.pdf Mat1 Mat2 3 2

This seems like a 2-step process. It also requires that each group from column 1 has the same number of unique elements in column 2. First groupby the columns you want grouped
df_grouped = df.groupby([0,1]).sum().reset_index()
Then reshape to the form you want:
def group_to_row(group):
group = group.sort_values(1)
output = []
for i, row in group[[1,2]].iterrows():
output += row.tolist()
return pd.DataFrame(data=[output])
df_output = df_grouped.groupby(0).apply(group_to_row).reset_index()
This is untested but this is also quite a non-standard form so unfortunately I don't think there's a standard Pandas function for you.

Related

python / pandas: How to count each cluster of unevenly distributed distinct values in each row

I am transitioning from excel to python and finding the process a little daunting. I have a pandas dataframe and cannot find how to count the total of each cluster of '1's' per row and group by each ID (example data below).
ID 20-21 19-20 18-19 17-18 16-17 15-16 14-15 13-14 12-13 11-12
0 335344 0 0 1 1 1 0 0 0 0 0
1 358213 1 1 0 1 1 1 1 0 1 0
2 358249 0 0 0 0 0 0 0 0 0 0
3 365663 0 0 0 1 1 1 1 1 0 0
The result of the above in the format
ID
LastColumn Heading a '1' occurs: count of '1's' in that cluster
would be:
335344
16-17: 3
358213
19-20: 2
14-15: 4
12-13: 1
365663
13-14: 5
There are more than 11,000 rows of data I would like to output the result to a txt file. I have been unable to find any examples of how the same values are clustered by row, with a count for each cluster, but I am probably not using the correct python terminology. I would be grateful if someone could point me in the right direction. Thanks in advance.

First step is use DataFrame.set_index with DataFrame.stack for reshape. Then create consecutive groups by compare for not equal Series.shifted values with cumulative sum by Series.cumsum to new column g. Then filter rows with only 1 and aggregate by named aggregation by GroupBy.agg with GroupBy.last and GroupBy.size:
df = df.set_index('ID').stack().reset_index(name='value')
df['g'] = df['value'].ne(df['value'].shift()).cumsum()
df1 = (df[df['value'].eq(1)].groupby(['ID', 'g'])
.agg(a=('level_1','last'), b=('level_1','size'))
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID a b
0 335344 16-17 3
1 358213 19-20 2
2 358213 14-15 4
3 358213 12-13 1
4 365663 13-14 5
Last for write to txt use DataFrame.to_csv:
df1.to_csv('file.txt', index=False)
If need your custom format in text file use:
with open("file.txt","w") as f:
for i, g in df1.groupby('ID'):
f.write(f"{i}\n")
for a, b in g[['a','b']].to_numpy():
f.write(f"\t{a}: {b}\n")

You just need to use the sum method and then specify which axis you would like to sum on. To get the sum of each row, create a new series equal to the sum of the row.
# create new series equal to sum of values in the index row
df['sum'] = df.sum(axis=1) # specifies index (row) axis
The best method for getting the sum of each column is dependent on how you want to use that information but in general the core is just to use the sum method on the series and assign it to a variable.
# sum a column and assign result to variable
foo = df['20-21'].sum() # default axis=0
bar = df['16-17'].sum() # default axis=0
print(foo) # returns 1
print(bar) # returns 3
You can get the sum of each column using a for loop and add them to a dictionary. Here is a quick function I put together that should get the sum of each column and return a dictionary of the results so you know which total belongs to which column. The two inputs are 1) the dataframe 2) a list of any column names you would like to ignore
def get_df_col_sum(frame: pd.DataFrame, ignore: list) -> dict:
"""Get the sum of each column in a dataframe in a dictionary"""
# get list of headers in dataframe
dfcols = frame.columns.tolist()
# create a blank dictionary to store results
dfsums = {}
# loop through each column and append sum to list
for dfcol in dfcols:
if dfcol not in ignore:
dfsums.update({dfcol: frame[dfcol].sum()})
return dfsums
I then ran the following code
# read excel to dataframe
df = pd.read_excel(test_file)
# ignore the ID column
ignore_list = ['ID']
# get sum for each column
res_dict = get_df_col_sum(df, ignore_list)
print(res_dict)
and got the following result.
{'20-21': 1, '19-20': 1, '18-19': 1, '17-18': 3, '16-17': 3, '15-16':
2, '14-15': 2, '13-14': 1, '12-13': 1, '11-12': 0}
Sources: Sum by row, Pandas Sum, Add pairs to dictionary

Pandas: add number of unique values to other dataset (as shown in picture):

I need to add the number of unique values in column C (right table) to the related row in the left table based on the values in common column A (as shown in the picture):
thank you in advance

Groupby column A in second dataset and calculate count of each unique value in column C. merge it with first dataset on column A. Rename column C to C-count if needed:
>>> count_df = df2.groupby('A', as_index=False).C.nunique()
>>> output = pd.merge(df1, count_df, on='A')
>>> output.rename(columns={'C':'C-count'}, inplace=True)
>>> output
A B C-count
0 2 22 3
1 3 23 2
2 5 21 1
3 1 24 1
4 6 21 1

Use DataFrameGroupBy.nunique with Series.map for new column in df1:
df1['C-count'] = df1['A'].map(df2.groupby('A')['C'].nunique())

This may not be the most effective way of doing this, so if your databases are too big be careful.
Define the following function:
def c_value(a_value, right_table):
c_ids = []
for index, row in right_table.iterrows():
if row['A'] == a_value:
if row['C'] not in c_ids:
c_ids.append(row['C'])
return len(c_ids)
For this function I'm supposing that the right_table is a pandas.Dataframe.
Now, you do the following to build the new column (assuming that the left table is a pandas.Dataframe):
new_column = []
for index, row in left_table.iterrows():
new_column.append(c_value(row['A'],right_table))
left_table["C-count"] = new_column
After this, the left_table Dataframe should be the one dessired (as far as I understand what you need).

Create a column based on the aggregation of values from multiple column at multiple row indexes

I'm trying to translate a technical analysis operator from another proprietary language to python as using dataframes, but I got stuck on a problem that seems rather simple, but I can't get to solve the pandas way. To simplify the problem let's have the example of this dataframe:
d = {'value1': [0,1,2,3], 'value2': [4,5,6,7]}
df = pd.DataFrame(data=d)
which result in the following dataframe:
What I want to achieve is this:
which in Pseudocode I would achieve in the following way:
value1 = [0,1,2,3]
value2 = [4,5,6,7]
result = []
for i in range(len(value1)):
calculation = value1[i] * value2[i]
lookback = value1[i]
for j in range(lookback):
calculation -= value2[j]
result[i] = calculation
How would I tackle a this in a dataframe context? Because the only similar approach that I found in the documentation is the usage of https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html# but there is no mentioning of interacting/manipulating the series contained in the columns/rows.

df['result'] = df.value1 * df.value2 - (df.value2.cumsum() - df.value2)
df
Output
value1 value2 result
0 0 4 0
1 1 5 1
2 2 6 3
3 3 7 6
Explanation
We are calculating cumulative sum for value2 and subtracting the current value2 which in total is subtracted by the product of value1 and value2.

This solution should work even if the first column value1 has random integers and not increasing integers from 0, and follow the pseudocode provided by the OP.
You should just ensure that any value in value1 is a valid integer for the dataframe (that is, no integer grater than the amount of rows in the dataframe, which is also required by the pseudocode).
import pandas as PD
d = {'value1': [0,1,2,3], 'value2': [4,5,6,7]}
df = PD.DataFrame(data=d)
csum2 = df["value2"].cumsum()
df["sum2"] = [csum2[i] for i in df["value1"]]
df["result"] = df["value1"] * df["value2"] - df["sum2"] + df["value2"]
df.drop("sum2", axis=1, inplace=True)
To explain: I save in an additional column "sum2" the result of the inner loop in the pseucode for j in range(lookback): so that I can then perform the main operation to get the "result" column.
At the end df is:
value1 value2 result
0 0 4 0
1 1 5 1
2 2 6 3
3 3 7 6

What is the pythonic way to do a conditional count across pandas dataframe rows with apply?

I'm trying to do a conditional count across records in a pandas dataframe. I'm new at Python and have a working solution using a for loop, but running this on a large dataframe with ~200k rows takes a long time and I believe there is a better way to do this by defining a function and using apply, but I'm having trouble figuring it out.
Here's a simple example.
Create a pandas dataframe with two columns:
import pandas as pd
data = {'color': ['blue','green','yellow','blue','green','yellow','orange','purple','red','red'],
'weight': [4,5,6,4,1,3,9,8,4,1]
}
df = pd.DataFrame(data)
# for each row, count the number of other rows with the same color and a lesser weight
counts = []
for i in df.index:
c = df.loc[i, 'color']
w = df.loc[i, 'weight']
ct = len(df.loc[(df['color']==c) & (df['weight']<w)])
counts.append(ct)
df['counts, same color & less weight'] = counts
For each record, the 'counts, same color & less weight' column is intended to get a count of the other records in the df with the same color and a lesser weight. For example, the result for row 0 (blue, 4) is zero because no other records with color=='blue' have lesser weight. The result for row 1 (green, 5) is 1 because row 4 is also color=='green' but weight==1.
How do I define a function that can be applied to the dataframe to achieve the same?
I'm familiar with apply, for example to square the weight column I'd use:
df['weight squared'] = df['weight'].apply(lambda x: x**2)
... but I'm unclear how to use apply to do a conditional calculation that refers to the entire df.
Thanks in advance for any help.

We can do transform with min groupby
df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)
0 0
1 1
2 1
3 0
4 0
5 0
6 0
7 0
8 1
9 0
Name: weight, dtype: int64
#df['c...]=df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)

Delete all rows bellow a certain condition in pandas

I have a dataframe with multiple columns. One of the columns (denoted as B in the example) works as a trigger, i.e.,
I have to drop all rows after the first value bigger than 0.5. However, I have to conserve this first number.
An example is given above. All rows after 0.59 (which is the first that obeys to the condition of being bigger than 0.5) are deleted.
initial_df = pd.DataFrame([[1,0.4], [5,0.43], [4,0.59], [11,0.41], [9,0.61]], columns = ['A', 'B'])
Bellow the blue box indicates the trigger and the red box the values that have to dropped.
In the end we will have:
The final goal is to obtain the following dataframe:
Is it possible to do it in pandas in a efficient way (not using a for loop)?

You can use np.where with Boolean indexing to extract the positional index of the first value matching a condition. Then feed this to iloc:
idx = np.where(df['B'].gt(0.5))[0][0]
res = df.iloc[:idx+1]
print(res)
A B
0 1 0.40
1 5 0.43
2 4 0.59
For very large dataframes where the condition is likely to met early on, more optimal would be to use next with a generator expression to calculate idx:
idx = next((idx for idx, val in enumerate(df['B']) if val > 0.5), len(df.index))
For better performance, see Efficiently return the index of the first value satisfying condition in array.

So this works if your index is the same as your iloc:
first_occurence = initial_df[initial_df.B>0.5].index[0]
initial_df.iloc[:first_occurence+1]
EDIT: this is a more general solution
first_occurence = initial_df.index.get_loc(initial_df[initial_df.B>0.5].iloc[0].name)
final_df = initial_df.iloc[:first_occurence+1]

I found a solution similar to the one shown by jpp:
indices = initial_df.index
trigger = initial_df[initial_df.B > 0.5].index[0]
initial_df[initial_df.index.isin(indices[indices<=trigger])]
Since the real dataframe has multiple indices, this is the only solution that I found.

I am assuming you want to delete all rows where "B" column value is less than 0.5.
Try this:
initial_df = pd.DataFrame([[1, 0.4], [5, 0.43], [4, 0.59], [11, 0.41], [9, 0.61]], columns=['A', 'B'])
final_df = initial_df[initial_df['B'] >= 0.5]
The resulting data frame, final_df is:
A B
2 4 0.59
4 9 0.61

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python df groupby with agg for string and sum - python

Related

python / pandas: How to count each cluster of unevenly distributed distinct values in each row

Pandas: add number of unique values to other dataset (as shown in picture):

Create a column based on the aggregation of values from multiple column at multiple row indexes

What is the pythonic way to do a conditional count across pandas dataframe rows with apply?

Delete all rows bellow a certain condition in pandas

Categories

Resources