Problem about using groupby on a list column - python

I'm using the MovieLens 1M dataset to learn pandas, and I want to get some data based on the genres column.
one row of the dataframe I get is like this:
movieid title genres rating userid gender age occupation zipcode timestamp
1000204 2198 Modulations (1998) [Documentary] 5 5949 M 18 17 47901 958846401
1000205 2703 Broken Vessels (1998) [Drama] 3 5675 M 35 14 30030 976029116
1000206 2845 White Boys (1999) [Drama] 1 5780 M 18 17 92886 958153068
1000207 3607 One Little Indian (1973) [Comedy, Drama, Western] 5 5851 F 18 20 55410 957756608
1000208 2909 Five Wives, Three Secretaries and Me (1998) [Documentary] 4 5938 M 25 1 35401 957273353
I want to us df.groupby('genres') to groupby the dataframe and then get the sum of each genre and the mean rating of each genre.
However, when I use the df.groupby('genres').mean(), it had an error
"TypeError: unhashable type: 'list' "
Please tell me why this error happeded and how can I use groupby on a column which the data are lists.
THX very much!

groupby takes a list as argument. Try df.groupby(['genres']).mean()

Related

Add a column in pandas based on sum of the subgroup values in another column [duplicate]

This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 12 days ago.
Here is a simplified version of my dataframe (the number of persons in my dataframe is way more than 3):
df = pd.DataFrame({'Person':['John','David','Mary','John','David','Mary'],
'Sales':[10,15,20,11,12,18],
})
Person Sales
0 John 10
1 David 15
2 Mary 20
3 John 11
4 David 12
5 Mary 18
I would like to add a column "Total" to this data frame, which is the sum of total sales per person
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
What would be the easiest way to achieve this?
I have tried
df.groupby('Person').sum()
but the shape of the output is not congruent with the shape of df.
Sales
Person
David 27
John 21
Mary 38
What you want is the transform method which can apply a function on each group:
df['Total'] = df.groupby('Person')['Sales'].transform(sum)
It gives as expected:
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
The easiest way to achieve this is by using the pandas groupby and sum functions.
df['Total'] = df.groupby('Person')['Sales'].sum()
This will add a column to the dataframe with the total sales per person.
your 'Persons' column in the dataframe contains repeated values
it is not possible to apply a new column to this via groupby
I would suggest making a new dataframe based on sales sum
The below code will help you with that
newDf = pd.DataFrame(df.groupby('Person')['Sales'].sum()).reset_index()
This will create a new dataframe with 'Person' and 'sales' as columns.

Convert GroupBy object to Dataframe (pandas)

I am working with a large dataset which I've stored in a pandas dataframe. All of my methods I've written to operate on this dataset work on dataframes, but some of them don't work on GroupBy objects.
I've come to a point in my code where I would like to group all data by author name (which I was able to achieve easily via .groupby()). Unfortunately, this outputs a GroupBy object which isn't very useful to me when I want to use dataframe only methods.
I've searched tons of other posts but not found any satisfying answer... how do I convert this GroupBy object back into a DataFrame? (Note: It is much too large for me to manually select groups and concatenate them into a dataframe, I need something automated).
Not exactly sure I understand, so if this isn't what you are looking for, please comment.
Creating a dataframe:
df = pd.DataFrame({'author':['gatsby', 'king', 'michener', 'michener','king','king', 'tolkein', 'gatsby'], 'b':range(13,21)})
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
#create the groupby object
dfg = df.groupby('author')
In [44]: dfg
Out[44]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002169D24DB20>
#show groupby works using count()
dfg.count()
b
author
gatsby 2
king 3
michener 2
tolkein 1
But I think this is what you want. How to revert dfg back to a dataframe. You just need to perform some function on it that doesn't change the data. This is one way.
df_reverted = dfg.apply(lambda x: x)
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
This is another way and may be faster; note the dataframe names df and dfg.
df[dfg['b'].transform('count') > 0]
It's testing groupby and taking all groups greater than zero (so everything), returns a boolean series that is applied against the original dataframe, df

Count ocurrences in index

I have a dataframe like this:
brand1 brand2 brand3
survey 22 33 12
clothes 19 22 19
shoes 34 12 15
What I'd like to do is count how many clothes I have and how many shoes in total, not taking into consideration the categories. I'm not sure how to do this since "survey" is not a column.
I basically want this:
survey
clothes 100
shoes 100
Any advice would be helpful.
Try
df.sum(axis = 1)
This should give the sum of values of the rows, then to display you can use a dictionary with keys as the survey column names, and value as the df.sum's values (maybe after storing it in a list).

Is it possible to do full text search in pandas dataframe

currently, I'm using pandas DataFrame.filter to filter the records of the dataset. if I give a word, I have got all the records that are matching with that word. now if I give two words that are present in the dataset but they are not in one record then I got an empty set. Is there any way in either pandas or other python modules that I can find something that can search multiple words ( not in one record )?
With python list comprehension, we can build a full-text search by mapping. in pandas DataFrame.filter uses indexing. is there any difference between mapping and indexing? if yes what is it and which can give a better performance?
CustomerID Genre Age AnnualIncome (k$) SpendingScore (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
pokemon[pokemon['CustomerID'].isin(['200','5'])]
Output:
CustomerID Genre Age AnnualIncome (k$) SpendingScore (1-100)
5 Female 31 17 40
200 Male 30 137 83
Name Qty.
0 Apple 3
1 Orange 4
2 Cake 5
Considering the above dataframe, if you want to find quantities of Apples and Oranges, you can do it like this:
result = df[df['Name'].isin(['Apple','Orange'])]
print (result)

Finding most common values from a count in pandas jupyterbnote book

I currently have a massive dataset with a large amount of rows and I wanted to create a smaller dataframe that only pulls 2 columns from the larger one and how many times each name occurred in that chapter in this instance 'Occurrence'
The below code is what I am using
df1 = (Dec16.groupby(["BNF Chapter", "Name"]).size().reset_index(name="Occurrence"))
df1
It plots this
BNF Chapter Name Occurrence
1 Aluminium hydroxide 2
1 Aluminium hydroxide + Magnesium trisilicate 2
1 Alverine 702
.......
21 Polihexanide 2
21 Potassium hydroxide 32
21 Sesame oil 22
21 Sodium chloride 222
What I would like to get is the top 10 most occurred names for a certain chapter as the dataset is so large.
For example a dataframe that only pulls
The top 10 most common names in chapter 1
How would I go about doing this?
Many thanks!!!
You can use this pandas.DataFrame.count
This Count Values In Pandas Dataframe here can help you out I hope

Categories