Find the sum of a column by grouping two columns [duplicate] - python

This question already has answers here:
Pandas DataFrame iterating over rows and sum
(1 answer)
Pandas sum by groupby, but exclude certain columns
(4 answers)
Closed 10 months ago.
For this dataset, i want to find the sum of Value(£) for each combination of the
three columns together for Year, Length Group and Port of Landing. So for example, one sum value will be for the year 2016, the Length group 10m&Under and the Port of Landing Aberdaran.

Given the response you have back to #berkayln, I think you want to project that column back to your original dataframe...
Does this suit your need ?
df['sumPerYearLengthGroupPortOfLanding']=df.groupby(['Year','Length Group','Port of Landing'])['Value(£)'].transform(lambda x: x.sum())

You can try this one:
dataframe.groupby(['Year','Length Group','Port of Landing'])['Value(£)'].sum()
That should work.

You can use pd.DataFrame.groupby to aggregate the data.
# Change the order if you want a different hierarchy
grp_cols = ["Year", "Length Group", "Port of Landing"]
df.groupby(grp_cols)["Value(£)"].sum()
You can also do them one-by-one as such:
for col in grp_cols:
df.groupby(col)["Value(£)"].sum()
You can also use .loc to get 2016 only.
df.loc[df.Year == 2016]["Value(£)"].sum()
The pd.DataFrame.groupby functionality allows you to aggregate using other functions other than .sum, including customized functions that operate on the sub-dataframes.

Related

Drop only specified amount of duplicates pandas [duplicate]

This question already has answers here:
Keeping the last N duplicates in pandas
(2 answers)
Closed 11 months ago.
Whereas panda's drop_duplicates function can be specified with "first", "last", or False. I want to be able to keep N amount of duplicates. Instead of keeping just one (e.g. with "first" or "last"), or none (with "False"), I want to keep a certain amount of the duplicates.
Any help is appreciated!
Something like this could work, but you haven't specified whether you are using one or more column(s) to deduplicate:
n = 3
df.groupby('drop_dup_col').head(n)
This can be used to keep the first three duplicates based on a column value from the top (head) of the dataframe. If you want to start from the bottom of the df, you can use .tail(n) instead.
Change n to the amount of rows you want to keep and change 'drop_dup_col' to the column name you are using to dedup your df.
Multiple columns can be specified in groupby using:
df.groupby(['col1','col5'])
Regarding the question in your comment:
It's a bit hard to implement, because if you want to say delete 3 duplicates there should also be a minimum of 3 duplicates, otherwise in case 2 duplicates occur they will be deleted from the data and no row is kept.
n = 3
df['dup_count'] = df.groupby('drop_dup_col').transform('size')
df2 = df
df2 = df2.loc[df['dup_count'] >= n]
df3 = pd.concat([df, df2])
df3.drop_duplicates(keep=False)
I believe a combination of groupby and tail(N) should work for this-
In this case, if you want to keep 4 duplicates in df['myColumnDuplicates']:
df.groupby('myColumnDuplicates').tail(4)
To be more precise, and complete the answer with #Stijn 's answer,
tail(n) would keep the last n duplicated values found- while head(n) should keep the first n duplicated values

Count values in one column based on the categories of other column [duplicate]

This question already has answers here:
Python: get a frequency count based on two columns (variables) in pandas dataframe some row appears
(3 answers)
Closed last year.
I'm working on the following dataset:
and I want to count each value in the LearnCode column for each Age category, I've tried doing it using Groupby method but didn't manage to get it correctly, can anyone help on how to do it?
You can do this using a groupby on two columns
results = df.groupby(by=['Age', 'LearnCode']).count()
This outputs a count for each ['Age', 'LearnCode'] pair

Selecting columns - python [duplicate]

This question already has answers here:
Selecting non-adjacent columns by column number pandas [duplicate]
(1 answer)
Selecting a range of columns in a dataframe
(5 answers)
Closed 1 year ago.
How to select multiple columns in Python using .iloc function?
Let's say I have data frame with X rows and 100 columns and I would like to select the first 50 columns then 75 to 80 and then columns 90 and 95.
So far I read about two way of selection in Python, single columns df = df1.iloc[:,[1,2,3]] and range df = df1.iloc[:,1:30], but is there any possibility how to combine then in more complex selection?
I.e. In my example I would expect code like this:
But it does not work. I tried also different syntax (using brackets etc.) but cannot find the correct solution.
df = df1.iloc[:,[1:50,75:80,90,95]]
I believe you should try using np.r_. In thid case, please try with:
df1.iloc[:, np.r_[1:50, 75:80, 90, 95]]
This should be able to allow you to select multiple groups of columns

How to select top n row from each group after group by in pandas? [duplicate]

This question already has answers here:
Sorting columns and selecting top n rows in each group pandas dataframe
(3 answers)
Closed 3 months ago.
I have a pandas dataframe with following shape
open_year, open_month, type, col1, col2, ....
I'd like to find the top type in each (year,month) so I first find the count of each type in each (year,month)
freq_df = df.groupby(['open_year','open_month','type']).size().reset_index()
freq_df.columns = ['open_year','open_month','type','count']
Then I want to find the top n type based on their freq (e.g. count) for each (year_month). How can I do that?
I can use nlargest but I'm missing the type
freq_df.groupby(['open_year','open_month'])['count'].nlargest(5)
but I'm missing the column type
I'd recommend sorting your counts in descending order first, and you can call GroupBy.head after—
(freq_df.sort_values('count', ascending=False)
.groupby(['open_year','open_month'], sort=False).head(5)
)

Count number of unique values for multiple columns in Python [duplicate]

This question already has answers here:
Finding count of distinct elements in DataFrame in each column
(8 answers)
Closed 5 years ago.
How to count a number of unique values in multiple columns in Python, pandas etc. I can do for one column using "nunique" function. I need something like:
print("Number of unique values in Var1", DF.var1.nunique(),sep="= ").
For all the variables in the dataset. Something like a loop or apply function maybe. I tried a lot of things failed to get what I desired.
Thanks for the help!
You want to print number of unique values per column, so use:
for k, v in df.nunique().to_dict().items():
print('{}={}'.format(k,v))

Categories