fill NAN values with Ratio Proportion - python

Suppose we have a dataFrame which has two columns, the Boroughs of NYC and the list of incidents transpiring in those boroughs.
df['BOROUGH'].value_counts()
BROOKLYN 368129
QUEENS 315681
MANHATTAN 278583
BRONX 167083
STATEN ISLAND 50194
518,953 rows have null under BOROUGH.
df.shape
(1698623,2)
How can I allocate the null values as Ratio Proportion of the Borough values?
For example:
df['BOROUGH'].value_counts()/df['BOROUGH'].value_counts().sum()
BROOKLYN 0.312061
QUEENS 0.267601
MANHATTAN 0.236153
BRONX 0.141635
STATEN ISLAND 0.042549
31% of the null (518,953) be BROOKLYN = 160,875
27% of the null (518,953) be QUEENS = 140,117
and so forth.....
After the Ratio Proportion of the null:
df['BOROUGH']. value_counts() - Requested
BROOKLYN 529004
QUEENS 455798
.......

You can use np.random.choice:
# where the null values are
is_null = df['BOROUGH'].isna()
# obtain the distribution of non-null values
freq = df['BOROUGH'].value_counts(normalize=True)
# random sampling with corresponding frequencies
to_replace = np.random.choice(freq.index, p=freq, size=is_null.sum())
df.loc[is_null, 'BOROUGH'] = to_replace

Related

Is there a way to fill a column of a dataframe with values from a list, grouped by each individual item - Python

I have a dataset that I would like to explore but it is not structured very well. The original excel had the Region and Variety of grape in the one column, the Region was indicated as the heading for the rows beneath by being in bold. When I loaded it into Python you can't tell which rows were regions or grape varieties.
Ideally I want to have those two columns separated so my ideal dataframe looks like table 2.
What I have done so far is add a 'is_region" column and put "Yes" for values in the 'Region/variety' column that match my list of regions.
grapeyield_df = pd.read_excel (r'Regional varietal area, production and price data, 1999 to 2013.xlsx', 'Yield_since2001_tab',skiprows=[0])
regions = ['Adelaide Hills','Adelaide Plains','Alpine Valleys','Alpine Valleys/Beechworth','Australian Capital Territory','Barossa -other','Barossa Valley']
grapeyield_df["is_region"] = np.where(grapeyield_df["Region/Variety"].isin(regions), "Yes", "No")
Table 1:
Region/Variety
2011
is_region
Adelaide Hills
3452
Yes
Chardonnay
26357
No
Pinot Grigio
7876
No
Barossa Valley
7368
Yes
Table 2:
Region
Variety
2011
Adelaide Hills
Chardonnay
26357
Adelaide Hills
Pinot Grigio
7876
Barossa Valley
Chardonnay
8787
You could do the following:
df["Region"] = df["Region/Variety"].where(df["Region/Variety"].isin(regions)).ffill()
df = (
df[df["Region/Variety"] != df["Region"]]
.reset_index().rename(columns={"Region/Variety": "Variety"})
)[["Region", "Variety", "2011"]]
The first step is adding a "Region" column (using .where and .ffill). The second takes those parts of df that don't belong to the region-rows, resets the index, renames the "Region\Variety"-column into "Variety", and selects the columns in the requested order.
Result for
df:
Region/Variety 2011
0 Adelaide Hills 3452
1 Chardonnay 26357
2 Pinot Grigio 7876
3 Barossa Valley 7368
4 Chardonnay 8787
is
Region Variety 2011
0 Adelaide Hills Chardonnay 26357
1 Adelaide Hills Pinot Grigio 7876
2 Barossa Valley Chardonnay 8787
A naive way of doing it would use pandas ffill-method
# Split ambiguous column.
df['region'] = df['Region/Variety'].copy()
df['variety'] = df['Region/Variety'].copy()
df = df.drop('Region/Variety', axis=1)
# Shadow invalid regions.
idx_regions = df['region'].isin(regions)
df.loc[~idx_regions,'region'] = None
# fill up missing regions
df['region'] = df['region'].ffill()
# Drop superfluous rows.
df = df.loc[df['variety'] != df['region']]
Be aware that this drops some entries from your original df, i.e. cells [0,1] and [3,1]

How to search in row and add those row values in python

Percentage
NaN
1.576020
Redmond
4.264524
England
4.975278
England - Street XY
5.346106
Denmark Street x
7.601978
England – Street wy
11.773795
England – Street AU
13.936959
Redmond street COX
50.525340
Baharin
0
I need to create another data frame which sums all rows starting with Redmond Percentage, all all rows starting with England followed by street namePercentage, all rows starting with England only Percentage and all all rows starting with Redmond.
How to do it using python.
In above case output should be
Percentage
NaN
1.576020
Redmond
50.525340
England
4.975278
England with street
11.773795
Denmark
7.60
Baharin
0
One way to do this:
df = df.reset_index()
m = df['index'].astype(str).str.contains('Street')
street_df = df.loc[m]
street_df = street_df.groupby(street_df['index'].str.split(' ').str[0]).agg({'Percentage': sum}).reset_index()
street_df['index'] = street_df['index'] + ' with street'
result = pd.concat([df[~m],street_df])

pandas left join returning larger matrix and not working

I have 2 dataframes, the 1st is below "station_anal"
count Start station number
index
31623 17105 31623
31258 11432 31258
31201 10194 31201
31200 9505 31200
31247 9145 31247
2nd dataframe "vt" is:
Start station number Start station
0 31214 17th & Corcoran St NW
1 31104 Adams Mill & Columbia Rd NW
2 31221 18th & M St NW
3 31111 10th & U St NW
4 31260 23rd & E St NW
station_anal is size 486x2
vt size is 8000x2
my left join command is:
lj = pd.merge(station_anal, vt, how = 'left', on = 'Start station number')
dtypes are the same for both columns namely int64
however lj returns:
lj.head()
count Start station number Start station
0 17105 31623 Columbus Circle / Union Station
1 17105 31623 Columbus Circle / Union Station
2 17105 31623 Columbus Circle / Union Station
3 17105 31623 Columbus Circle / Union Station
4 17105 31623 Columbus Circle / Union Station
of size 8000x3
Makes no sense since my understanding is left join resulting matrix row size is always the first dataframe in this case 486
Let's use map:
station_anal['Start Station'] = station_anal['Start station number']
.map(vt.set_index('Start station Numer')['Start station'])
Update drop duplicates then map:
mapper = vt.drop_duplicates('Start Station Number')\
.set_index('Start station number')['Start station']
station_anal['Start Station'] = station_anal['Start station number']\
.map(mapper)

Make row operations faster in pandas

I am doing a course on Coursera and I have a dataset to perform some operations on. I have gotten the answer to the problem but my answer takes time to compute.
Here is the original dataset and a sample screenshot is provided below.
The task is to convert the data from monthly values to quarterly values i.e. I need to sort of aggregate 2000-01, 2000-02, 2000-03 data to 2000-Q1 and so on. The new value for 2000-Q1 should be the mean of these three values.
Likewise 2000-04, 2000-05, 2000-06 would become 2000-Q2 and the new value should be the mean of 2000-04, 2000-05, 2000-06
Here is how I solved the problem.
First I defined a function quarter_rows() which takes a row of data (as a series), loops through every third element using column index, replaces some values (in-place) with a mean computed as explained above and returns the row
import pandas as pd
import numpy as np
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
def quarter_rows(row):
for i in range(0, len(row), 3):
row.replace(row[i], np.mean(row[i:i+3]), inplace=True)
return row
Now I do some subsetting and cleanup of the data to leave only what I need to work with
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
housing3 = housing.set_index(["State","RegionName"]).ix[:, '2000-01' : ]
I then used apply to apply the function to all rows.
housing3 = housing3.apply(quarter_rows, axis=1)
I get the expected result. A sample is shown below
But the whole process takes more than a minute to complete. The original dataframe has about 10370 columns.
I don't know if there is a way to speed things up in the for loop and apply functions. The bulk of the time is taken up in the for loop inside my quarter_rows() function.
I've tried python lambdas but every way I tried threw an exception.
I would really be interested in finding a way to get the mean using three consecutive values without using the for loop.
Thanks
I think you can use instead apply use resample by quarters and aggregate mean, but first convert column names to month periods by to_period:
housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
Testing:
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
#for testing slect only 10 first rows and columns from jan 2000 to jun 2000
housing3 = housing.set_index(["State","RegionName"]).ix[:10, '2000-01' : '2000-06']
print (housing3)
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
State RegionName
NY New York NaN NaN NaN NaN NaN NaN
CA Los Angeles 204400.0 207000.0 209800.0 212300.0 214500.0 216600.0
IL Chicago 136800.0 138300.0 140100.0 141900.0 143700.0 145300.0
PA Philadelphia 52700.0 53100.0 53200.0 53400.0 53700.0 53800.0
AZ Phoenix 111000.0 111700.0 112800.0 113700.0 114300.0 115100.0
NV Las Vegas 131700.0 132600.0 133500.0 134100.0 134400.0 134600.0
CA San Diego 219200.0 222900.0 226600.0 230200.0 234400.0 238500.0
TX Dallas 85100.0 84500.0 83800.0 83600.0 83800.0 84200.0
CA San Jose 364100.0 374000.0 384700.0 395700.0 407100.0 416900.0
FL Jacksonville 88000.0 88800.0 89000.0 88900.0 89600.0 90600.0
housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
print (housing3)
2000Q1 2000Q2
State RegionName
NY New York NaN NaN
CA Los Angeles 207066.666667 214466.666667
IL Chicago 138400.000000 143633.333333
PA Philadelphia 53000.000000 53633.333333
AZ Phoenix 111833.333333 114366.666667
NV Las Vegas 132600.000000 134366.666667
CA San Diego 222900.000000 234366.666667
TX Dallas 84466.666667 83866.666667
CA San Jose 374266.666667 406566.666667
FL Jacksonville 88600.000000 89700.000000

Group by and find top n value_counts pandas

I have a dataframe of taxi data with two columns that looks like this:
Neighborhood Borough Time
Midtown Manhattan X
Melrose Bronx Y
Grant City Staten Island Z
Midtown Manhattan A
Lincoln Square Manhattan B
Basically, each row represents a taxi pickup in that neighborhood in that borough. Now, I want to find the top 5 neighborhoods in each borough with the most number of pickups. I tried this:
df['Neighborhood'].groupby(df['Borough']).value_counts()
Which gives me something like this:
borough
Bronx High Bridge 3424
Mott Haven 2515
Concourse Village 1443
Port Morris 1153
Melrose 492
North Riverdale 463
Eastchester 434
Concourse 395
Fordham 252
Wakefield 214
Kingsbridge 212
Mount Hope 200
Parkchester 191
......
Staten Island Castleton Corners 4
Dongan Hills 4
Eltingville 4
Graniteville 4
Great Kills 4
Castleton 3
Woodrow 1
How do I filter it so that I get only the top 5 from each? I know there are a few questions with a similar title but they weren't helpful to my case.
I think you can use nlargest - you can change 1 to 5:
s = df['Neighborhood'].groupby(df['Borough']).value_counts()
print s
Borough
Bronx Melrose 7
Manhattan Midtown 12
Lincoln Square 2
Staten Island Grant City 11
dtype: int64
print s.groupby(level=[0,1]).nlargest(1)
Bronx Bronx Melrose 7
Manhattan Manhattan Midtown 12
Staten Island Staten Island Grant City 11
dtype: int64
additional columns were getting created, specified level info
You can do this in a single line by slightly extending your original groupby with 'nlargest':
>>> df.groupby(['Borough', 'Neighborhood']).Neighborhood.value_counts().nlargest(5)
Borough Neighborhood Neighborhood
Bronx Melrose Melrose 1
Manhattan Midtown Midtown 1
Manhatten Lincoln Square Lincoln Square 1
Midtown Midtown 1
Staten Island Grant City Grant City 1
dtype: int64
Solution: for get topn from every group
df.groupby(['Borough']).Neighborhood.value_counts().groupby(level=0, group_keys=False).head(5)
.value_counts().nlargest(5) in other answers only give you one group top 5, doesn't make sence for me too.
group_keys=False to avoid duplicated index
because value_counts() has already sorted, just need head(5)
df['Neighborhood'].groupby(df['Borough']).value_counts().head(5)
head() gets the top 5 rows in a data frame.
Try this one (just change the number in head() to your choice):
# top 3 : total counts of 'Neighborhood' in each Borough
Z = df.groupby('Borough')['Neighborhood'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z
You can also try below code to get only top 10 values of value counts
'country_code' and 'raised_amount_usd' is column names.
groupby_country_code=master_frame.groupby('country_code')
arr=groupby_country_code['raised_amount_usd'].sum().sort_index()[0:10]
print(arr)
[0:10] shows index 0 to 10 from array for slicing. you can choose your slicing option.

Categories