Group by and find top n value_counts pandas - python

I have a dataframe of taxi data with two columns that looks like this:
Neighborhood Borough Time
Midtown Manhattan X
Melrose Bronx Y
Grant City Staten Island Z
Midtown Manhattan A
Lincoln Square Manhattan B
Basically, each row represents a taxi pickup in that neighborhood in that borough. Now, I want to find the top 5 neighborhoods in each borough with the most number of pickups. I tried this:
df['Neighborhood'].groupby(df['Borough']).value_counts()
Which gives me something like this:
borough
Bronx High Bridge 3424
Mott Haven 2515
Concourse Village 1443
Port Morris 1153
Melrose 492
North Riverdale 463
Eastchester 434
Concourse 395
Fordham 252
Wakefield 214
Kingsbridge 212
Mount Hope 200
Parkchester 191
......
Staten Island Castleton Corners 4
Dongan Hills 4
Eltingville 4
Graniteville 4
Great Kills 4
Castleton 3
Woodrow 1
How do I filter it so that I get only the top 5 from each? I know there are a few questions with a similar title but they weren't helpful to my case.

I think you can use nlargest - you can change 1 to 5:
s = df['Neighborhood'].groupby(df['Borough']).value_counts()
print s
Borough
Bronx Melrose 7
Manhattan Midtown 12
Lincoln Square 2
Staten Island Grant City 11
dtype: int64
print s.groupby(level=[0,1]).nlargest(1)
Bronx Bronx Melrose 7
Manhattan Manhattan Midtown 12
Staten Island Staten Island Grant City 11
dtype: int64
additional columns were getting created, specified level info

You can do this in a single line by slightly extending your original groupby with 'nlargest':
>>> df.groupby(['Borough', 'Neighborhood']).Neighborhood.value_counts().nlargest(5)
Borough Neighborhood Neighborhood
Bronx Melrose Melrose 1
Manhattan Midtown Midtown 1
Manhatten Lincoln Square Lincoln Square 1
Midtown Midtown 1
Staten Island Grant City Grant City 1
dtype: int64

Solution: for get topn from every group
df.groupby(['Borough']).Neighborhood.value_counts().groupby(level=0, group_keys=False).head(5)
.value_counts().nlargest(5) in other answers only give you one group top 5, doesn't make sence for me too.
group_keys=False to avoid duplicated index
because value_counts() has already sorted, just need head(5)

df['Neighborhood'].groupby(df['Borough']).value_counts().head(5)
head() gets the top 5 rows in a data frame.

Try this one (just change the number in head() to your choice):
# top 3 : total counts of 'Neighborhood' in each Borough
Z = df.groupby('Borough')['Neighborhood'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z

You can also try below code to get only top 10 values of value counts
'country_code' and 'raised_amount_usd' is column names.
groupby_country_code=master_frame.groupby('country_code')
arr=groupby_country_code['raised_amount_usd'].sum().sort_index()[0:10]
print(arr)
[0:10] shows index 0 to 10 from array for slicing. you can choose your slicing option.

Related

how to merge df's together with duplicate keys in pandas

I have following 2 data frames that are taken from excel files:
df1 = 10000 rows (like the master list that has all unique #s)
df2 = 670 rows
I am loading a excel file (df2) that has zip, address, state and I want to match that info and then add on the supplier # from df1 so that I could have 1 file thats still 670 rows but now has the supplier row column.
Since there was no unique key between two dataframes, I thought that I could make a unique key to merge on by joining 3 columns together - zip, address, and state and join them with a "-".. maybe this is too risky for a match? df1 has a ton of duplicate addresses, zips, states so I couldnt do something like joining just zip and state.
df1 =
(10000 rows)
(unique)
supplier_num ZIP ADDRESS STATE CCCjoin
0 7100000 35481 14th street CA 35481-14th street-CA
1 7000005 45481 14th street CA 45481-14th street-CA
2 7000006 45482 140th circle CT 45482-140th circle-CT
3 7000007 35482 140th circle CT 35482-140th circle-CT
4 7000008 35483 13th road VT 35483-13th road-VT
...
df2 =
(670 rows)
ZIP ADDRESS STATE CCCjoin
0 35481 14th street CA 35481-14th street-CA
1 45481 14th street CA 45481-14th street-CA
2 45482 140th circle CT 45482-140th circle-CT
3 35482 140th circle CT 35482-140th circle-CT
4 35483 13th road VT 35483-13th road-VT
...
OUTPUT:
df3 =
(670 rows)
ZIP ADDRESS STATE Unique Key (Unique)supplier_num
0 35481 14th street CA 35481-14th street-CA 7100000
1 45481 14th street CA 45481-14th street-CA 7100005
2 45482 140th circle CT 45482-140th circle-CT 7100006
3 35482 140th circle CT 35482-140th circle-CT 7100007
4 35483 13th road VT 35483-13th road-VT 7100008
...
670 15483 13 baker road CA 15483-13 baker road-CA 7100009
I've looked around on here and found some helpful tricks and I think ive made some progress. Here is some code that I tried.
df1['g'] = df1.groupby('CCCjoin').cumcount()
df2['g'] = df2.groupby('CCCjoin').cumcount()
then I merge:
merged_table = pd.merge(df1,df2, on=['CCCjoin', 'g'], how='inner').drop('g', axis =1 )
This sort of works. I get a match of 293 rows and I cross checked the supplier number and it matches the address.
What am I missing to get the 377 matches? Thanks in advance!

fill NAN values with Ratio Proportion

Suppose we have a dataFrame which has two columns, the Boroughs of NYC and the list of incidents transpiring in those boroughs.
df['BOROUGH'].value_counts()
BROOKLYN 368129
QUEENS 315681
MANHATTAN 278583
BRONX 167083
STATEN ISLAND 50194
518,953 rows have null under BOROUGH.
df.shape
(1698623,2)
How can I allocate the null values as Ratio Proportion of the Borough values?
For example:
df['BOROUGH'].value_counts()/df['BOROUGH'].value_counts().sum()
BROOKLYN 0.312061
QUEENS 0.267601
MANHATTAN 0.236153
BRONX 0.141635
STATEN ISLAND 0.042549
31% of the null (518,953) be BROOKLYN = 160,875
27% of the null (518,953) be QUEENS = 140,117
and so forth.....
After the Ratio Proportion of the null:
df['BOROUGH']. value_counts() - Requested
BROOKLYN 529004
QUEENS 455798
.......
You can use np.random.choice:
# where the null values are
is_null = df['BOROUGH'].isna()
# obtain the distribution of non-null values
freq = df['BOROUGH'].value_counts(normalize=True)
# random sampling with corresponding frequencies
to_replace = np.random.choice(freq.index, p=freq, size=is_null.sum())
df.loc[is_null, 'BOROUGH'] = to_replace

Add new column to dataframe with value_counts

i have two datasets:
-population: shows the population of USA states, organized alphabetically.
-data: has more than 200,000 rows
population.head()
state population
0 Alabama 4887871
1 Alaska 737438
2 Arizona 7171646
3 Arkansas 3013825
4 California 39557045
i'm trying to add a new column called "Incidents" from the other data set.
I tried: population['incidents'] = data.state.value_counts().sort_index()
but i'm getting the following result:
state population incidents
0 Alabama 4887871 NaN
1 Alaska 737438 NaN
2 Arizona 7171646 NaN
3 Arkansas 3013825 NaN
4 California 39557045 NaN
what can i do to fix this??
EDIT:
data.state.value_counts().sort_index()
Alabama 5373
Alaska 1292
Arizona 2268
Arkansas 2753
California 15975
Colorado 3069
Connecticut 2984
Delaware 1643
District of Columbia 3091
Florida 14610
Georgia 8717
````````````````````````
If you wanna add a specific column from one dataset to the other dataset you do it like this
population['incidents'] = data[['columntoappend']]
Your RHS (right hand side ) must be one column which in your case is not.
https://www.google.com/amp/s/www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/amp/
The way to do this is as follows, provided that your length of your indices are consistent:
population['incidents'] = [x for x in data.state.value_counts().sort_index()]
I can't really explain why your approach results in NaN objects though. In any case, it would be incorrect as well as you're assigning entire series to each row in the population dataset. With the list comprehension, you're assigning one value to each row.

identify records that make up 90% of total

I have a report that identifies key drivers of an overall number/trend. I would like to automate the functionality to be able to list/identify the underlying records based on a percentage of that number. For example if the net change for sales of widgets in the south(region) is -5,000.00, but there are positives and negatives- I would like to identify at least ~90% (-4,500.00) of all underlying drivers that make up that -5,000.00 total from largest to smallest.
data
region OfficeLocation sales
South 1 -500
South 2 300
South 3 -1000
South 4 -2000
South 5 300
South 6 -700
South 7 -400
South 8 800
North 11 300
North 22 -400
North 33 1000
North 44 800
North 55 900
North 66 -800
for South, the total sales is -3200. I would like to identify/list the drivers that make up at least 90% of this move(in descending order)- so 90% of -3200 would be 2880. And the directional moves/sales for South office 3 & 4 = -3000 would be the output for this request:
region OfficeLocation sales
South 3 -1000
South 4 -2000
for North, the total sales is +1800. I would like to identify/list the drivers that make up at least 90% of this move(in descending order)- so at least 90% of 1800 would be 1620. And the directional moves/sales for South office 3 & 4 = -3000 would be the output for this request:
region OfficeLocation sales
North 33 1000
North 44 800
Dataset above has both positive and negative trends for south/north. Any help you can provide would be greatly appreciated!
As mentioned in the comment, it isn't clear what to do in the 'North' case as the sum is positive there, but ignoring that, you could do something like the following:
In [200]: df[df.groupby('region').sales.apply(lambda g: g <= g.loc[(g.sort_values().cumsum() > 0.9*g.sum()).idxmin()])]
Out[200]:
region OfficeLocation sales
2 South 3 -1000
3 South 4 -2000
13 North 66 -800
If, in the positive case, you want to find as few elements as possible that together have the property that they make up 90% of the sum of the sales, the above solution can be adopted as follows:
def is_driver(group):
s = group.sum()
if s > 0:
group *= -1
s *= -1
a = group.sort_values().cumsum() > 0.9*s
return group <= group.loc[a.idxmin()]
In [168]: df[df.groupby('region').sales.apply(is_driver)]
Out[168]:
region OfficeLocation sales
2 South 3 -1000
3 South 4 -2000
10 North 33 1000
12 North 55 900
Note that in the case of a tie, only one element is picked out.

Pandas column is the sum if three criteria are met (similar to sumproduct)

I am trying to create a new column which values are the sum of another column but only if two column contain a specific value.
origin_data_frame (df_o)
month state count
2015-12 Alabama 31359
2015-12 Alaska 245
2015-12 Arizona 2940
2015-12 Arkansas 4076
2015-12 California 119166
2015-12 Colorado 3265
2015-12 Connecticut 12190
2015-12 Delaware 297
2015-12 DC 16
....... ... ..
target_data_frame (df_t) ('counts' is not there):
level_0 level_1 Veterans, 2011-2015 counts
0 h_pct_vet California 1777410 <?>
1 h_pct_vet Texas 1539655 <?>
2 h_pct_vet Florida 1507738 <?>
3 h_pct_vet Pennsylvania 870770 <?>
4 h_pct_vet New York 828586 <?>
5 l_pct_vet Vermont 44708 <?>
6 l_pct_vet Wyoming 48505 <?>
the problem:
counts should include a value that is the sum of count if month is between '2011-01' and '2015-12' and state equals "level_1".
I can get a sum for all count in the time frame:
counts_2011_2015 = df_o['count'][(df_o['month'] >= '2011-01-01') & (df_o['month'] <= '2015-12-31')].sum()
What I tried so far but without success:
df_t['counts'] = df_o['count'][(df_o['month'] >= '2011-01-01') & (df_o['month'] <= '2015-12-31') & (df_o['state'] == df_t['level_1'])].sum()
It raises a ValueError: "ValueError: Can only compare identically-labeled Series objects".
What I found so far (dropping indexes) is not helpful so I would be thankful if someone has an idea
Try grouping them by state first and then merging them with df_t:
# untested code
counts = (
df_o[df_o.month.between("2011-01", "2015-12")]
.groupby("state")["count"].sum()
.reset_index(name="counts")
)
df_t.merge(counts, left_on="level_1", right_index=True, how="left")
An alternative to #pomber's solution, if you wish to avoid an explicit merge, is to align indices, assign a series from your groupby, then reset index.
df_t = df_t.set_index('level_1')
df_t['counts'] = df_o.loc[df_o.month.between('2011-01', '2015-12')]\
.groupby('state')['count'].sum()
df_t = df_t.reset_index()

Categories