df.Last_3mth_Avg.isnull().groupby([df['ShopID'],df['ProductID']]).sum().astype(int).reset_index(name='count')
The code above help me to see the number of null values by shopid and productid. Question is df.Last_3mth_Avg.isnull() becomes a series, how a groupby([df['ShopID'],df['ProductID']]) can be used afterwards?
I use the solution from:
Pandas count null values in a groupby function
You should filter your df first:
df[df.Last_3mth_Avg.isnull()].groupby(['ShopID','ProductID']).agg('count')
There are two ways to use groupby:
The common way is to use on the dataframe so you just mention the column names in the by= parameter
The second way is you apply on a series but use equal sized series in the by= parameter. This is rarely used and helps when you want to do convertions on a specific column and use groupby in the same line
So, the above code line should work
Related
Comparing:
df.loc[:,'col1']
df.loc[:,['col1']]
Why does (2) create a DataFrame, while (1) creates a Series?
in principle when it's a list, it can be a list of more than one column's names, so it's natural for pandas to give you a DataFrame because only DataFrame can host more than one column. However, when it's a string instead of a list, pandas can safely say that it's just one column, and thus giving you a Series won't be a problem. Take the two formats and two outcomes as a reasonable flexibility to get whichever you need, a series or a dataframe. sometimes you just need specifically one of the two.
This is driving me crazy because it should be so simple and yet it's not working. It's a duplicate question and yet the answers from previous questions don't work.
My csv looks similar to this:
name,val1,val2,val3
ted,1,2,
bob,1,,
joe,,,4
I want to print the contents of row 'joe'. I use the line below and pycharm gives me a KeyError.
print(df.loc['joe'])
The problem with your logic is that you have not let pandas know which column it should search joe for.
print(df.loc[df['name'] == 'joe'])
or
print(df[df['name'] == 'joe'])
Using .loc directly is achievable only on index.
If you just used pd.read_csv without mentioning the index, by default pandas will use number as index. You can set name to be the index if it is unique. Then .loc will work:
df.set_index("name")
print(df.loc['joe'])
Another option, and it's how usually working with .loc, is to name specifically what column you refer to:
print(df.loc[df["name"]=="joe"])
Note that the condition df["name"]=="joe" returns a series with true/false for each row. df.loc[...] on that series will return only rows where the value is true, and therefore it will return only rows where name is "joe". Keep that in mind when in future you will try to do more complex conditioning on your dataframe using .loc.
I have a dataframe,df that looks like
I need to add a column Weekday to it, which is obtained through the index.
What is the difference between using
df['Weekday']=df.index.weekday
and
df.loc[:,'Weekday'] = df.index.weekday
In this case , you wont have any issues.
But is it advisable to use .loc functionality.
You can read the difference in detailed here
I have a dataframe with over 100 columns, I would like to check all pairs to see which are unique identifiers.
You can use drop_duplicates(subset), specifying the columns you would regard as possible identifiers in the subset argument.
Since you have so many columns it will probably be easiest to take all columns and subtract from them the ones you would disregard (such as value columns).
You can use from collections import counter. see doc
I have a DataFrame that looks like this:
I need to calculate the percentage based on the count column which I already did following this answer.
The result is this:
Now I need to add the results for the Groupby column back into the original DataFrame. I tried grouped.reset_index() and then adding it but I get an error ValueError: cannot insert count, already exists since the column used in the Group by is also used in the aggregation.
Can anyone help me to find a way to add the results back to the DataFrame?
You want to use transform and that answer you linked could be better as well.
df.assign(
NormalizedCount=df['count'] / df.groupby('suburb')['count'].transform('sum')