Conditionally setting rows in pandas groupby - python

I have a (simplified) dataframe like:
+--------+-----------+-------+
| type | estimated | value |
+--------+-----------+-------+
| type_a | TRUE | 1 |
| type_a | TRUE | 2 |
| type_a | | 3 |
| type_b | | 4 |
| type_b | | 5 |
| type_b | | 6 |
+--------+-----------+-------+
I'd like to group and sum it into two rows:
+--------+-----------+-------+
| type | estimated | value |
+--------+-----------+-------+
| type_a | TRUE | 6 |
| type_b | | 15 |
+--------+-----------+-------+
However, I want the grouped row to have the 'estimated' column to be TRUE if any of the rows grouped to form it were estimated. If my group by includes the 'estimated' column, then the rows won't be grouped together.
My idea was to iterate through each group, e.g. (pseudocode)
grouped = df.groupby('type')
for group in grouped:
group['flag'] = 0
for row in group:
if row['estimated'] == True:
group['flag'] = 1
Then after grouping I could set all the rows with non-zero 'flag' to an estimated = True.
I'm having some trouble figuring out how to iterate through rows of groups, and the solution seems pretty hacky. Also you shouldn't edit something you're iterating over. Is there a solution/better way?

you want groupby with agg
df.groupby('type').agg(dict(estimated='any', value='sum')).reset_index()
type value estimated
0 type_a 6 True
1 type_b 15 False

Related

Getting different Values when using groupby(column)["id"].nunique and trying to add a column using transform

I'm trying to count the individual values per group in a dataset and add them as a new column to a table. The first one works, the second one produces wrong values.
When I use the following code
unique_id_per_column = source_table.groupby("disease").some_id.nunique()
I'll get
| | disease | some_id |
|---:|:------------------------|--------:|
| 0 | disease1 | 121 |
| 1 | disease2 | 1 |
| 2 | disease3 | 5 |
| 3 | disease4 | 9 |
| 4 | disease5 | 77 |
These numbers seem to check out, but I want to add them to another table where I have already a column with all values per group.
So I used the following code
table["unique_ids"] = source_table.groupby("disease").uniqe_id.transform("nunique")
and I get the following table, with wrong numbers for every row except the first.
| | disease |some_id | unique_ids |
|---:|:------------------------|-------:|------------------:|
| 0 | disease1 | 151 | 121 |
| 1 | disease2 | 1 | 121 |
| 2 | disease3 | 5 | 121 |
| 3 | disease4 | 9 | 121 |
| 4 | disease5 | 91 | 121 |
I've expected that I will get the same results as in the first table. Anyone knows why I get the number for the first row repeated instead of correct numbers?
Solution with Series.map if need create column in another DataFrame:
s = source_table.groupby("disease").some_id.nunique()
table["unique_ids"] = table["disease"].map(s)

Pyspark, Two dataframes groupBy at the same time and apply pandasUDF

I know that for #pandas_udf(schema, PandasUDFType.GROUPED_MAP), it should be used under a pyspark_dataframe.groupBy(something).apply();
What if I need two dataframes in this pandas_udf. For example: I have a dataframe A:
| id | cluster| value |
|:---- |:------:| -----:|
| 1 | A | 3 |
| 2 | A | 5 |
| 3 | B | 7 |
| 4 | B | 5 |
And then I have dataframe B:
| id | cluster |
|:---- |:------:|
| 5 | A |
| 6 | B |
And my desire output dataframe is:
| id | cluster| pred |
|:---- |:------:| -----:|
| 5 | A | 5 |
| 6 | B | 6 |
Where 'pred' is calculated by A.groupBy('cluster') and then the mean of the value in that cluster. I want to achieve this through pandasUDF, so I can only do B.groupBy('cluster').apply(my_pandasUDF) So I'm wondering can I have my_pandasUDF to have two inputs like this?
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def my_pandasUDF(df, A_df = A):
means = A_df['value'].mean(axis=0)
df['pred'] = means
return df
B.groupBy('cluster').apply(my_pandasUDF).show()
Can this code give my desire result? If not, how to do that? Thank you so much

Filtering all the rows until a max is reached in each group

I would like to filter a data frame with many users attempts on some tests. I have sorted the data frame based on the ID and date. The problem is I don't know how to filter all those rows up to the maximum for the specific user. I want to drop the rows that come after the maximum point for every user.
For example:
| user | score | date |
| A | 5 | 2021-11-14 10:22:13.854 |
| A | 7 | 2021-11-14 10:25:03.044 |
| B | 4 | 2021-11-16 19:01:42.005 |
| B | 7 | 2021-11-16 19:04:21.859 |
| B | 6 | 2021-11-16 19:06:52.372 |
I want to filter the data frame so that for user B only the first two rows are filters (since the third row is lower than the maximum for this user).
The result would be:
| user | score | date |
| A | 5 | 2021-11-14 10:22:13.854 |
| A | 7 | 2021-11-14 10:25:03.044 |
| B | 4 | 2021-11-16 19:01:42.005 |
| B | 7 | 2021-11-16 19:04:21.859 |
This should work:
df.groupby('user').apply(lambda g: g.head(g['score'].argmax()+1)).reset_index(drop=True)
Because:
first, group by the user/ID
then for each group, get the location of the max-score (if there are multiple such scores it picks the first occurrence)
and return rows up to that row

How to dimensionalize a pandas dataframe

I'm looking for a more elegant way of doing this, other than a for-loop and unpacking manually...
Imagine I have a dataframe that looks like this
| id | value | date | name |
| -- | ----- | ---------- | ---- |
| 1 | 5 | 2021-04-05 | foo |
| 1 | 6 | 2021-04-06 | foo |
| 5 | 7 | 2021-04-05 | bar |
| 5 | 9 | 2021-04-06 | bar |
If I wanted to dimensionalize this, I could split it up into two different tables. One, perhaps, would contain "meta" information about the person, and the other serving as "records" that would all relate back to one person... a pretty simple idea as far as SQL-ian ideas go...
The resulting tables would look like this...
Meta
| id | name |
| -- | ---- |
| 1 | foo |
| 5 | bar |
Records
| id | value | date |
| -- | ----- | ---------- |
| 1 | 5 | 2021-04-05 |
| 1 | 6 | 2021-04-06 |
| 5 | 7 | 2021-04-05 |
| 5 | 9 | 2021-04-06 |
My question is, how can I achieve this "dimensionalizing" of a dataframe with pandas, without having to write a for loop on the unique id key field and unpacking manually?
Think about this not as "splitting" the existing dataframe, but as creating two new dataframes from the original. You can do this in a couple of lines:
meta = df[['id','name']].drop_duplicates() #Select the relevant columns and remove duplicates
records = df.drop("name", axis=1) #Replicate the original dataframe but drop the name column
You could drop_duplicates based off a subset of columns for the columns you want to keep. For the second dataframe, you can drop the name column:
df1 = df.drop_duplicates(['id', 'name']).loc[:,['id', 'name']] # perigon's answer is simpler with df[['id','name']].drop_duplicates()
df2 = df.drop('name', axis=1)
df1, df2
Output:
( id name
0 1 foo
2 5 bar,
id value date
0 1 5 2021-04-05
1 1 6 2021-04-06
2 5 7 2021-04-05
3 5 9 2021-04-06)

Pandas sort_values, duplicate values in sort column

How does pandas treat equal values in the column it is sorting by.
dataFrame1
a | b | c |
--|---|---|
1 | 2 | 2 |
2 | 1 | 6 |
2 | 1 | 5 |
3 | 4 | 2 |
If I run dataFrame1.sort_values(by=['a'], ascending=True)
How does it treat the duplicate values in a ?

Categories