pandas.Dataframe.groupby(['date','some_category']).agg([np.sum, np.size]) produces a count that is repeated for each sum column. Is it possible to output just a single count column when passing a list of aggregate functions?
a = df_all.groupby(['date','some_category']).sum()
b = df_all.groupby(['date','some_category']).size()
pd.concat([a,b], axis=1)
produces basically what I want but seems awkward.
df.pivot_table(index=['date', 'some_category'],aggfunc=['sum', 'size']) is what I was looking for. This produces a single size column (though I am not sure why it is labeled '0'), rather than repeated (identical) size for each summed column. Thank all, I learned some useful things along the way.
Related
How do I get from that to this:
I know I can count using len() by putting condition inside but actual data is really big, it has around 14k rows. So it is not realistic to find each unique rows and then count the total number of each unique rows. Is there any easier solution ?
you can use groupby on all columns and count (pick one column):
df.groupby(df.columns)['Date'].count()
Use groupby, size and reset_index:
df.groupby(list(df.columns)).size().reset_index(name = "total")
I would like to drop duplicates from a dataframe in a balanced way. Currently df.drop_duplicates() have the parameter keep where you can determine to keep the first or the last occurence. Instead of this solution, I would like to keep the occurrences in a distributed way.
I.e. I have this dataframe, with two columns: text and category, which looks balanced but have duplicates:
Applying drop_duplicates() and plotting again, will look like this:
df = df.drop_duplicates(subset='text')
df['Category'].value_counts().plot(kind='bar')
The expected result will be the dataframe without the duplicates, with all the columns. But instead of keeping the last occurence or the first one (because this will produce an unbalanced dataframe, due the most occurences could be in the first or last category), keep the dataframe as balanced as possible (since depending on the number of duplicates, there could be some odd categories, that is, it does not need to be 100% balanced).
To maintain balance after dropping duplicates you need to down sample all categories greater than the minimum (neg_hp). However I assume sadness might be a better minimum point. You can do that by
df = df.drop_duplicates(subset='text')
max_sample = df[df['Category'].eq('sadness')]
Then sample max_sample from all other categories
df['Category'].eq('conifident').sample(max_sample)
That way all your categories will have the same value_counts
I'm working with tables containing a number of image-derived features, with a row of features for every timepoint of an image series, and multiple images. An image identifier is contained in one column. To condense this (primarily to be used in a parallel coordinates plot), I want to reduce/aggregate the columns to a single value. The reduce operation has to be chosen depending on the feature in the column (for example mean, max, or something custom). DataFrame.agg seemed like the natural thing to do, but it gives me a table with multiple rows. Right now I'm doing something like this:
result_df = DataFrame()
for col in df.columns:
if col in ReduceThisColumnByMean:
result_df[col] = df.mean()
elif col in ReduceThisColumnByMax:
result.df[col] = df.max()
This seems like a detour to me, and might not scale well (not a big concern, as the number of reduce operations will most probably not grow beyond a few). Is there a more pandas-esk way to aggregate multiple columns with specific operations to a single row?
You can select all columns by list, get mean and max and join together by concat, last convert Series to one row DataFrame by Series.to_frame and transpose:
result_df = pd.concat([df[ReduceThisColumnByMean].mean(),
df[ReduceThisColumnByMax].max()]).to_frame().T
I am at my wit's end as I am writing this. This is probably an incredibly small issue, but I've not been able to get around it. Here's what is going on:
I have a dataframe df with 80 columns
Performing value_counts().count() over df iteratively, I am able to print the column names and the number of unique values in that column.
Here's the problem: What I am also wanting to do is sum up the count() of unique values of the all columns. Essentially I will need just one number. S0 basically, if column1 had 10 uniques, column2 had 5, column3 had 3.., I am expecting the sum() to be 18.
About #2, here's what works (simple for loop) -
def counting_unique_values_in_df(df):
for evry_colm in df:
print (evry_colm, "-", df[evry_colm].value_counts().count())
That works. It prints it in this format - the column - unique values
Now, alongside that, I'd like to print the sum of the unique values. Whatever I tried, it either prints the unique value of the last column (which is incidentally 2), or prints some thing random. I know it's something to do with the for loop, but I can't seem to figure out what.
I also know that in order to get what I want, which is essentially sum(df[evry_colm].value_counts().count()), I will need to convert df[evry_colm].value_counts().count() to a series, or even a dataframe, but I am stuck with that too!
Thanks in advance for your help.
You could use nunique, which returns a series across all your columns, which you can then sum:
df.nunique().sum()
My first instinct was to do it by series with a list comprehension
sum([df[col].nunique() for col in list(df)])
but this is slower and less Pandorable!
I just wanted to know what is the difference in the function performed by these 2.
Data:
import pandas as pd
df = pd.DataFrame({"ID":["A","B","A","C","A","A","C","B"], "value":[1,2,4,3,6,7,3,4]})
as_index=False :
df_group1 = df.groupby("ID").sum().reset_index()
reset_index() :
df_group2 = df.groupby("ID", as_index=False).sum()
Both of them give the exact same output.
ID value
0 A 18
1 B 6
2 C 6
Can anyone tell me what is the difference and any example illustrating the same?
When you use as_index=False, you indicate to groupby() that you don't want to set the column ID as the index (duh!). When both implementation yield the same results, use as_index=False because it will save you some typing and an unnecessary pandas operation ;)
However, sometimes, you want to apply more complicated operations on your groups. In those occasions, you might find out that one is more suited than the other.
Example 1: You want to sum the values of three variables (i.e. columns) in a group on both axes.
Using as_index=True allows you to apply a sum over axis=1 without specifying the names of the columns, then summing the value over axis 0. When the operation is finished, you can use reset_index(drop=True/False) to get the dataframe under the right form.
Example 2: You need to set a value for the group based on the columns in the groupby().
Setting as_index=False allow you to check the condition on a common column and not on an index, which is often way easier.
At some point, you might come across KeyError when applying operations on groups. In that case, it is often because you are trying to use a column in your aggregate function that is currently an index of your GroupBy object.