This question already has answers here:
How can I compute a histogram (frequency table) for a single Series?
(4 answers)
Closed 6 years ago.
I have a dataframe and I am looking at one column within the dataframe called names
array(['Katherine', 'Robert', 'Anne', nan, 'Susan', 'other'], dtype=object)
I am trying to make a call to tell me how many times each of these unique names shows up in the column, for example if there are 223 instances of Katherine etc.
How do i do this? i know value_counts just shows 1 for each of these because they are the separate unique values
If I understand you correctly, you can use pandas.Series.value_counts.
Example:
import pandas as pd
import numpy as np
s = pd.Series(['Katherine', 'Robert', 'Anne', np.nan, 'Susan', 'other'])
s.value_counts()
Katherine 1
Robert 1
other 1
Anne 1
Susan 1
dtype: int64
The data you provided only has one of each name - so here is an example with multiple 'Katherine' entries:
s = pd.Series(['Katherine','Katherine','Katherine','Katherine', 'Robert', 'Anne', np.nan, 'Susan', 'other'])
s.value_counts()
Katherine 4
Robert 1
other 1
Anne 1
Susan 1
dtype: int64
When applied to your Dataframe you will call this as follows:
df['names'].value_counts()
You could use group by to achieve that:
df[['col1']].groupby(['col1']).agg(['count'])
Related
This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 1 year ago.
I am attempting to create a sub-table from a larger dataset which lists out the unique ID, the name of the person and the date they attended an appointment.
For example,
df = pd.DataFrame({'ID': ['abc', 'def', 'abc', 'abc'],
'name':['Alex','Bertie','Alex','Alex'],
'date_attended':['01/01/2021','05/01/2021','11/01/2021','20/01/2021']
What I would like is a dataframe, that shows the last time Alex and Bertie attended a class. So my dataframe would like:
name date_attended
Alex 20/01/2021
Bertie 05/01/2021
I'm really struggling with this! So far I have tried (based off a previous question I saw here):
max_date_list = ['ID','date_attended']
df = df.groupby(['ID'])[max_date_list].transform('max').size()
but I keep getting an error. I know this would involve a groupby but I can't figure out how to get the maximum date. Would anyone know how to do this?
Try sort_values by 'date_attended' and drop_duplicates by 'ID':
df['date_attended'] = pd.to_datetime(df['date_attended'], dayfirst=True)
df.sort_values('date_attended', ascending=False).drop_duplicates('ID')
Output:
ID name date_attended
3 abc Alex 2021-01-20
1 def Bertie 2021-01-05
To match your expected output format exactly, you might want to groupby "name":
>>> df.groupby("name")["date_attended"].max()
name
Alex 20/01/2021
Bertie 05/01/2021
Name: date_attended, dtype: object
Alternatively, if you might have different ID with the same name:
>>> df.groupby("ID").agg({"name": "first", "date_attended": "max"}).set_index("name")
date_attended
name
Alex 20/01/2021
Bertie 05/01/2021
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
Situation
I have a dataframe attributes that holds some attribute information about cars in 3 series:
attributes = {'brand': ['Honda Civic','Honda Civic','Honda Civic','Toyota Corolla','Toyota Corolla','Audi A4'],
'attributeName': ['wheels','doors','fuelType','wheels','color','wheels'],
'attributeValue': ['4','2','hybrid','4','red','4']
}
Expected result
result = {'brand': ['Honda Civic','Toyota Corolla','Audi A4'],
'wheels': ['4','4','4'],
'doors': ['2','',''],
'fuelType':['hybrid','',''],
'color': ['','red','']
}
How can I realize this?
Transform the values from attributeName into series to represent its attributeValue for each brand/car in one row.
With get_dummies I get this transformation, but only with true/false values not with the original values.
This is a simple pivot:
attributes.pivot(index='brand',
columns='attributeName',
values='attributeValue').fillna('')
or, shorter as your columns are in right order:
attributes.pivot(*attributes).fillna('')
To format it exactly as your provided output (except column order, please give details on that), you can use:
(attributes.pivot(index='brand', columns='attributeName', values='attributeValue')
.fillna('').rename_axis(None, axis=1)
.reset_index()
)
output:
brand color doors fuelType wheels
0 Audi A4 4
1 Honda Civic 2 hybrid 4
2 Toyota Corolla red 4
This question already has answers here:
Multiple aggregations of the same column using pandas GroupBy.agg()
(4 answers)
Closed 1 year ago.
I have a following dataframe:
d = {'City' : ['Paris', 'London', 'NYC', 'Paris', 'NYC'], 'ppl' : [3000,4646,33543,85687568,34545]}
df = pd.DataFrame(data=d)
df_mean = df.groupby('City').mean()
now I want to instead just calc the mean (and .std()) of the ppl column, I want to have the city, mean, std in my dataframe (of course the cities should be grouped). If this is not possible it would be ok to just add at least the column for the .std() column to my resulting dataframe
You can use .GroupBy.agg(), as follows:
df.groupby('City').agg({'ppl': ['min', 'std']})
If you don't want the column City be the index, you can do:
df.groupby('City').agg({'ppl': ['min', 'std']}).reset_index()
or
df.groupby('City')['ppl'].agg(['mean','std']).reset_index()
Result:
City mean std
0 London 4646 NaN
1 NYC 34044 7.085210e+02
2 Paris 42845284 6.058814e+07
Let's say I have a dataframe as follows:
d = {'name': ['spain', 'greece','belgium','germany','italy'], 'davalue': [3, 4, 6, 9, 3]}
df = pd.DataFrame(data=d)
index name davalue
0 spain 3
1 greece 4
2 belgium 6
3 germany 9
4 italy 3
I would like to aggregate and sum based on a list of strings in the name column. So for example, I may have: southern=['spain', 'greece', 'italy'] and northern=['belgium','germany'].
My goal is to aggregate by using sum, and obtain:
index name davalue
0 southern 10
1 northen 15
where 10=3+4+3 and 15=6+9
I imagined something like:
df.groupby(by=[['spain','greece','italy'],['belgium','germany']])
could exist. The docs say
A label or list of labels may be passed to group by the columns in self
but I'm not sure I understand what that means in terms of syntax.
I would build a dictionary and map:
d = {v:'southern' for v in southern}
d.update({v:'northern' for v in northern})
df['davalue'].groupby(df['name'].map(d)).sum()
Output:
name
northern 15
southern 10
Name: davalue, dtype: int64
One way could be using np.select and using the result as a grouper:
import numpy as np
southern=['spain', 'greece', 'italy']
northern=['belgium','germany']
g = np.select([df.name.isin(southern),
df.name.isin(northern)],
['southern', 'northern'],
'others')
df.groupby(g).sum()
davalue
northern 15
southern 10
df["regional_group"]=df.apply(lambda x: "north" if x["home_team_name"] in ['belgium','germany'] else "south",axis=1)
You create a new column by which you later groubpy.
df.groupby("regional_group")["davavalue"].sum()
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
Problem: Add a new column to a DataFrame and populate with the values of a column from another DataFrame, depending on a condition, in one line of code similar to list comprehensions.
Example code:
I create a DataFrame called df with some pupil information
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz',
'Maricopa', 'Yuma'])
Then a second DataFrame called df_extra which has a string representation of the year:
extra_data = {'year': [2012, 2013, 2014],
'yr_string': ['twenty twelve','twenty thirteen','twenty fourteen']}
df_extra = pd.DataFrame(extra_data)
Now how to add the values yr_string as a new column to df where the numerical years match in one line of code?
I can easily do this with a couple of for loops, but would really like to know if this is possible to do in one line, similar to list comprehensions?
I have searched questions already on here, but there is nothing discussing adding a new column to an existing DataFrame from another DataFrame based on a condition in one line.
You can merge the dataframe on the year column.
df.merge(df_extra, how='left', on=['year'])
# name reports year yr_string
# 0 Jason 4 2012 twenty twelve
# 1 Molly 24 2012 twenty twelve
# 2 Tina 31 2013 twenty thirteen
# 3 Jake 2 2014 twenty fourteen
# 4 Amy 3 2014 twenty fourteen
Basically this says "pull the data from df_extra into df anywhere that the year column matches in df". Note this will return a copy, not modify the dataframe in place.
List comprehensions are still Python loops (that might not be totally technically accurate). With the pandas.merge() method, you get to take advantage of the vectorized, optimized backend code that Pandas uses to operate on its dataframes. Should be faster.