Populate Pandas dataframe with group_by calculations made in Pandas series - python

I have created a dataframe from a dictionary as follows:
my_dict = {'VehicleType':['Truck','Car','Truck','Car','Car'],'Colour':['Green','Green','Black','Yellow','Green'],'Year':[2002,2014,1975,1987,1987],'Frequency': [0,0,0,0,0]}
df = pd.DataFrame(my_dict)
So my dataframe df currently looks like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 0
1 Car Green 2014 0
2 Truck Black 1975 0
3 Car Yellow 1987 0
4 Car Green 1987 0
I'd like it to look like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
i.e., the Frequency column should represent the totals of VehicleType AND Colour combinations (but leaving out the Year column). So in row 4 for example, the 2 in the Frequency column tells you that there are a total of 2 rows with the combination of 'Car' and 'Green'.
This is essentially a 'Count' with 'Group By' calculation, and Pandas provides a way to do the calculation as follows:
grp_by_series = df.groupby(['VehicleType', 'Colour']).size()
grp_by_series
VehicleType Colour
Car Green 2
Yellow 1
Truck Black 1
Green 1
dtype: int64
What I'd like to do next is to extract the calculated group_by values from the Panda series and put them into the Frequency column of the Pandas dataframe. I've tried various approaches but without success.
The example I've given is hugely simplified - the dataframes I'm using are derived from genomic data and have hundreds of millions of rows, and will have several frequency columns based on various combinations of other columns, so ideally I need a solution which is fast and scales well.
Thanks for any help!

You are on a good path. You can continue like this:
grp_by_series=grp_by_series.reset_index()
res=df[['VehicleType', 'Colour']].merge(grp_by_series, how='left')
df['Frequency'] = res[0]
print(df)
Output:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2

I think a .transform() does what you want:
df['Frequency'] = df.groupby(['VehicleType', 'Colour'])['Year'].transform('count')

Related

AttributeError: 'SeriesGroupBy' object has no attribute 'tolist'

In a Panda's dataframe: I want to count how many of value 1 there is, in the stroke coulmn, for each value in the Residence_type column. In order to count how much 1 there is, I convert the stroke column to a list, easier I think.
So for example, the value Rural in Residence_type has 300 times 1 in the stroke column.. and so on.
The data is something like this:
Residence_type Stroke
0 Rural 1
1 Urban 1
2 Urban 0
3 Rural 1
4 Rural 0
5 Urban 0
6 Urban 0
7 Urban 1
8 Rural 0
9 Rural 1
The code:
grpby_variable = data.groupby('stroke')
grpby_variable['Residence_type'].tolist().count(1)
the final goal is to find the difference between the number of times the value 1 appears, for each value in the Residence_type column (rural or urban).
Am I doing it right? what is this error ?
Not sure I got what you need done. Please try filter stroke==1, groupby and count;
df.query("Stroke==1").groupby('Residence_type')['Stroke'].agg('count').to_frame('Stroke_Count')
Stroke_Count
Residence_type
Rural 3
Urban 2
You could try the following if you need the differences between categories
df1 =df.query("Stroke==1").groupby('Residence_type')['Stroke'].agg('count').to_frame('Stroke_Count')
df1.loc['Diff'] = abs(df1.loc['Rural']-df1.loc['Urban'])
print(df1)
Stroke_Count
Residence_type
Rural 3
Urban 2
Diff 1
Assuming that Stroke only contains 1 or 0, you can do:
result_df = df.groupby('Residence_type').sum()
>>> result_df
Stroke
Residence_type
Rural 3
Urban 2
>>> result_df.Stroke['Rural'] - result_df.Stroke['Urban']
1

I need help creating a groupby in Pandas which aggregates on one column

I have a dataframe with three columns:
Colour
Person
Number of times worn
There are three colours, multiple names and the number column indicates how many times a specific name had a particular colour. The problem is, the same colour occurs multiple times for the same person I am trying to do a groupby, which sums up the total number, per colour per name. Any idea how I can perform a groupby which aggregates in this manner? Sorry if this it too vague!
I attach an image of the sample data for clarity.
Any help on how to neatly aggregate by colour would be great!
Colour Person Number of times worn
0 Red Tom 1
1 Red Tom 2
2 Red Tom 5
3 Blue Tom 7
4 Blue Tom 8
5 Green Tom 9
6 Red John 9
7 Red John 6
8 Green John 0
9 Green John 0
10 Orange John 5
11 Red John 4
12 Red Stanley 2
13 Orange Stanley 4
14 Green Stanley 5
15 Green Stanley 0
16 Green Stanley 6
17 Green Stanley 7
Thanks
You can also write in this way
df.groupby(["Person", "Color"])["n"].sum().reset_index(drop=True)
Or this works like a charm as well
df.groupby(["Person", "Color"]).agg({"n": "sum"}).reset_index(drop=True)
Only use reset_index(drop=True) if you plan to modify the original dataframe, otherwise don't pass drop=True and just store it a variable.
You can groupby multiple columns at the same time like this.
df = pd.DataFrame({
'Colour' : ['red', 'red', 'red', 'red', 'blue','blue',],
'Person' : ['Tom', 'Tom', 'Tom', 'John', 'John', 'John'],
'n' : [1,2,4,5,6,7]
})
df.groupby(['Person','Colour']).sum().reset_index()
Output:
Person Colour n
0 John blue 13
1 John red 5
2 Tom red 7

Plot against dummy variables and grouped values

This is some values of the table I have
country colour ...
1 Spain red
2 USA blue
3 Greece green
4 Italy white
5 USA red
6 USA blue
7 Spain red
I want to be able to group the countries together and plot it where the country is in the x axis and the total number of 'colours' is calculated for each country. For example, country USA has 2 blues and 1 red, Spain has 2 reds etc. I want this in a bar chart form. I would like this to be done using either matplotlib or seaborn.
I would assume I would have to use dummy variables for the 'colours' column but I'm not sure how to plot against a grouped column and dummy variables.
Much appreciated if you could show and explain the process. Thank you.
Try with crosstab:
pd.crosstab(df['country'], df['colour']).plot.bar()
Output:

Standardize values in a data-frame column

I have a dataframe df which looks like:
id colour response
1 blue curent
2 red loaning
3 yellow current
4 green loan
5 red currret
6 green loan
You can see the values in the response column are not uniform and I would like to get the to snap to a standardized set of responses.
I also have a validation list validate which looks like
validate
current
loan
transfer
I would like to standardise the response column in the df based on the first three characters in the entry against the validate list
So the eventual output would look like:
id colour response
1 blue current
2 red loan
3 yellow current
4 green loan
5 red current
6 green loan
have tried to use fnmatch
pattern = 'cur*'
fnmatch.filter(df, pattern) = 'current'
but can't change the values in the df.
If anyone could offer assistance it would be appreciated
Thanks
You could use map
In [3664]: mapping = dict(zip(s.str[:3], s))
In [3665]: df.response.str[:3].map(mapping)
Out[3665]:
0 current
1 loan
2 current
3 loan
4 current
5 loan
Name: response, dtype: object
In [3666]: df['response2'] = df.response.str[:3].map(mapping)
In [3667]: df
Out[3667]:
id colour response response2
0 1 blue curent current
1 2 red loaning loan
2 3 yellow current current
3 4 green loan loan
4 5 red currret current
5 6 green loan loan
Where s is series of validation values.
In [3650]: s
Out[3650]:
0 current
1 loan
2 transfer
Name: validate, dtype: object
Details
In [3652]: mapping
Out[3652]: {'cur': 'current', 'loa': 'loan', 'tra': 'transfer'}
mapping can be series too
In [3678]: pd.Series(s.str[:3].values, index=s.values)
Out[3678]:
current cur
loan loa
transfer tra
dtype: object
Fuzzy match ?
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
a=[]
for x in df.response:
a.append([process.extract(x, val.validate, limit=1)][0][0][0])
df['response2']=a
df
Out[867]:
id colour response response2
0 1 blue curent current
1 2 red loaning loan
2 3 yellow current current
3 4 green loan loan
4 5 red currret current
5 6 green loan loan

Broaden pandas dataframe

I have data that looks like this:
Box,Code
Green,1221
Green,8391
Red,3709
Red,2911
Blue,9820
Blue,4530
Using a pandas dataframe, I'm wondering if it is possible to output something like this:
Box,Code1,Code2
Green,1221,8391
Red,3709,2911
Blue,9820,4530
My data always has an equal number of rows per 'Box'.
I've been experimenting with pivots and crosstabs (as well as stack and unstack) in pandas but haven't found anything that gets me to the 'broaden' result I'm looking for.
You can use groupby for lists and then DataFrame constructor:
a = df.groupby('Box')['Code'].apply(list)
df = pd.DataFrame(a.values.tolist(), index=a.index).add_prefix('Code').reset_index()
print (df)
Box Code0 Code1
0 Blue 9820 4530
1 Green 1221 8391
2 Red 3709 2911
Or cumcount for new Series with pandas.pivot:
g = df.groupby('Box').cumcount()
df = pd.pivot(index=df['Box'], columns=g, values=df['Code']).add_prefix('Code').reset_index()
print (df)
Box Code0 Code1
0 Blue 9820 4530
1 Green 1221 8391
2 Red 3709 2911
And similar solution with unstack:
df['g'] = df.groupby('Box').cumcount()
df = df.set_index(['Box', 'g'])['Code'].unstack().add_prefix('Code').reset_index()
print (df)
g Box Code0 Code1
0 Blue 9820 4530
1 Green 1221 8391
2 Red 3709 2911

Categories