Add new column to dataframe with value_counts - python

i have two datasets:
-population: shows the population of USA states, organized alphabetically.
-data: has more than 200,000 rows
population.head()
state population
0 Alabama 4887871
1 Alaska 737438
2 Arizona 7171646
3 Arkansas 3013825
4 California 39557045
i'm trying to add a new column called "Incidents" from the other data set.
I tried: population['incidents'] = data.state.value_counts().sort_index()
but i'm getting the following result:
state population incidents
0 Alabama 4887871 NaN
1 Alaska 737438 NaN
2 Arizona 7171646 NaN
3 Arkansas 3013825 NaN
4 California 39557045 NaN
what can i do to fix this??
EDIT:
data.state.value_counts().sort_index()
Alabama 5373
Alaska 1292
Arizona 2268
Arkansas 2753
California 15975
Colorado 3069
Connecticut 2984
Delaware 1643
District of Columbia 3091
Florida 14610
Georgia 8717
````````````````````````

If you wanna add a specific column from one dataset to the other dataset you do it like this
population['incidents'] = data[['columntoappend']]
Your RHS (right hand side ) must be one column which in your case is not.
https://www.google.com/amp/s/www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/amp/

The way to do this is as follows, provided that your length of your indices are consistent:
population['incidents'] = [x for x in data.state.value_counts().sort_index()]
I can't really explain why your approach results in NaN objects though. In any case, it would be incorrect as well as you're assigning entire series to each row in the population dataset. With the list comprehension, you're assigning one value to each row.

Related

How to divide a DataFrame into groups using index label and perform operation to find 3 largest in a particular column according to each index

I have a dataframe like this:
STNAME CTYNAME POPESTIMATE
Alabama Autauga County 54660
Alabama Baldwin County 183193
Alabama Barbour County 27341
Alabama Bibb County 22861
Alabama Blount County 57373
....... ............... .....
Wyoming Sweetwater County 43593
Wyoming Teton County 21297
Wyoming Uinta County 21102
....... ............. ......
....... ............. .....
and so on............
Here i have to find out three most populous cities(CTYNAME) for each state and sum up them(using POPESTIMATE) for each state and we can call that as Population of each state,and from that data of population(only three most populous cities for each state) I have to find out three most populous states and print them in a list.
I have tried out this using multiple method in pandas library but nothing has worked for me.
Can some one please help me with this.
Spliting df:
df = df.groupby('STNAME',as_index=True)
print(df.apply(lambda s: pd.Series(s.nlargest(3).index)))

Create dictionary from list when keys are embedded in list

I have a list of college towns with corresponding states in the U.S. I want to create a dataframe with two columns one for 'State' and the other 'RegionName'. The dataframe should look like this:
DataFrame( [ ["Alabama", "Auburn"], ["Alabama", "Troy"],
["Alabama", "Tuscaloosa"], ["Alabama", "Tuskegee"], ["Alaska",
"Fairbanks"], ["Arizona", "Flagstaff"], ["Arizona", "Tempe"], ["Arizona",
"Tucson"] ],
columns=["State", "RegionName"] )
The problem is I have a list with the States and RegionNames together, with the corresponding RegionNames following after the State name in the list like this:
['Alabama',
'Auburn','Troy','Tuscaloosa','Tuskegee',
'Alaska','Fairbanks',
'Arizona','Flagstaff','Tempe','Tucson']
I have been looking at examples and I am currently stuck on this. Any help would be greatly appreciated!
You may need create the list of states here, then using ffill with mask to split the original single columns dataframe
df['RegionName']=df.State
df.State=df.State.where(df.State.isin(States)).ffill()
df=df.loc[df.State!=df.RegionName]
df
Out[80]:
State RegionName
1 Alabama Auburn
2 Alabama Troy
3 Alabama Tuscaloosa
4 Alabama Tuskegee
6 Alaska Fairbanks
8 Arizona Flagstaff
9 Arizona Tempe
10 Arizona Tucson
Data Input
States=['Alabama','Alaska','Arizona']
l=['Alabama',
'Auburn','Troy','Tuscaloosa','Tuskegee',
'Alaska','Fairbanks',
'Arizona','Flagstaff','Tempe','Tucson']
df=pd.DataFrame(l,columns=['State'])

Pandas column is the sum if three criteria are met (similar to sumproduct)

I am trying to create a new column which values are the sum of another column but only if two column contain a specific value.
origin_data_frame (df_o)
month state count
2015-12 Alabama 31359
2015-12 Alaska 245
2015-12 Arizona 2940
2015-12 Arkansas 4076
2015-12 California 119166
2015-12 Colorado 3265
2015-12 Connecticut 12190
2015-12 Delaware 297
2015-12 DC 16
....... ... ..
target_data_frame (df_t) ('counts' is not there):
level_0 level_1 Veterans, 2011-2015 counts
0 h_pct_vet California 1777410 <?>
1 h_pct_vet Texas 1539655 <?>
2 h_pct_vet Florida 1507738 <?>
3 h_pct_vet Pennsylvania 870770 <?>
4 h_pct_vet New York 828586 <?>
5 l_pct_vet Vermont 44708 <?>
6 l_pct_vet Wyoming 48505 <?>
the problem:
counts should include a value that is the sum of count if month is between '2011-01' and '2015-12' and state equals "level_1".
I can get a sum for all count in the time frame:
counts_2011_2015 = df_o['count'][(df_o['month'] >= '2011-01-01') & (df_o['month'] <= '2015-12-31')].sum()
What I tried so far but without success:
df_t['counts'] = df_o['count'][(df_o['month'] >= '2011-01-01') & (df_o['month'] <= '2015-12-31') & (df_o['state'] == df_t['level_1'])].sum()
It raises a ValueError: "ValueError: Can only compare identically-labeled Series objects".
What I found so far (dropping indexes) is not helpful so I would be thankful if someone has an idea
Try grouping them by state first and then merging them with df_t:
# untested code
counts = (
df_o[df_o.month.between("2011-01", "2015-12")]
.groupby("state")["count"].sum()
.reset_index(name="counts")
)
df_t.merge(counts, left_on="level_1", right_index=True, how="left")
An alternative to #pomber's solution, if you wish to avoid an explicit merge, is to align indices, assign a series from your groupby, then reset index.
df_t = df_t.set_index('level_1')
df_t['counts'] = df_o.loc[df_o.month.between('2011-01', '2015-12')]\
.groupby('state')['count'].sum()
df_t = df_t.reset_index()

Pandas apply function using another DataFrame

I'm trying to use the apply function on my dataframe ('homes') that has multi index ('states' and 'RegionName'). The function i use tries to check if the combination of state and Region Name is matched by my other data frame ('UT').
when applying this function:
homes['UT']=homes.apply(lambda row: 1 if
ut[(ut['State']==states[homes.iloc[row].name[0]]) &
(ut['RegionName']==homes.iloc[row].name[1])] else 0, axis=1)
i get an error saying basicaly that my index is out of bounds.
I tried a few things like converting the other dataframe to two lists and check if the rows of my dataframe are in those lists but still getting the same error.
my ut dataframe head:
State RegionName
1 Alabama Auburn
2 Alabama Florence
3 Alabama Jacksonville
4 Alabama Livingston
5 Alabama Montevallo
my home data frame head:
2000q1 2000q2 2000q3 2000q4
State RegionName
New York New York NaN NaN NaN NaN
California Los Angeles 207066.666667 214466.666667 220966.666667 226166.666667
Illinois Chicago 138400.000000 143633.333333 147866.666667 152133.333333
Pennsylvania Philadelphia 53000.000000 53633.333333 54133.333333 54700.000000
Arizona Phoenix 111833.333333 114366.666667 116000.000000 117400.000000
Any suggestions?
found the answer thank to #user8505495.
the code should be like this:
homes['UT']=homes.apply(lambda row: 1 if (row.name[0]+', '+row.name[1] in ut['full'].values) else 0, axis=1)
i have no idea why it works but it does. thanks for all the help!

Make row operations faster in pandas

I am doing a course on Coursera and I have a dataset to perform some operations on. I have gotten the answer to the problem but my answer takes time to compute.
Here is the original dataset and a sample screenshot is provided below.
The task is to convert the data from monthly values to quarterly values i.e. I need to sort of aggregate 2000-01, 2000-02, 2000-03 data to 2000-Q1 and so on. The new value for 2000-Q1 should be the mean of these three values.
Likewise 2000-04, 2000-05, 2000-06 would become 2000-Q2 and the new value should be the mean of 2000-04, 2000-05, 2000-06
Here is how I solved the problem.
First I defined a function quarter_rows() which takes a row of data (as a series), loops through every third element using column index, replaces some values (in-place) with a mean computed as explained above and returns the row
import pandas as pd
import numpy as np
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
def quarter_rows(row):
for i in range(0, len(row), 3):
row.replace(row[i], np.mean(row[i:i+3]), inplace=True)
return row
Now I do some subsetting and cleanup of the data to leave only what I need to work with
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
housing3 = housing.set_index(["State","RegionName"]).ix[:, '2000-01' : ]
I then used apply to apply the function to all rows.
housing3 = housing3.apply(quarter_rows, axis=1)
I get the expected result. A sample is shown below
But the whole process takes more than a minute to complete. The original dataframe has about 10370 columns.
I don't know if there is a way to speed things up in the for loop and apply functions. The bulk of the time is taken up in the for loop inside my quarter_rows() function.
I've tried python lambdas but every way I tried threw an exception.
I would really be interested in finding a way to get the mean using three consecutive values without using the for loop.
Thanks
I think you can use instead apply use resample by quarters and aggregate mean, but first convert column names to month periods by to_period:
housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
Testing:
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
#for testing slect only 10 first rows and columns from jan 2000 to jun 2000
housing3 = housing.set_index(["State","RegionName"]).ix[:10, '2000-01' : '2000-06']
print (housing3)
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
State RegionName
NY New York NaN NaN NaN NaN NaN NaN
CA Los Angeles 204400.0 207000.0 209800.0 212300.0 214500.0 216600.0
IL Chicago 136800.0 138300.0 140100.0 141900.0 143700.0 145300.0
PA Philadelphia 52700.0 53100.0 53200.0 53400.0 53700.0 53800.0
AZ Phoenix 111000.0 111700.0 112800.0 113700.0 114300.0 115100.0
NV Las Vegas 131700.0 132600.0 133500.0 134100.0 134400.0 134600.0
CA San Diego 219200.0 222900.0 226600.0 230200.0 234400.0 238500.0
TX Dallas 85100.0 84500.0 83800.0 83600.0 83800.0 84200.0
CA San Jose 364100.0 374000.0 384700.0 395700.0 407100.0 416900.0
FL Jacksonville 88000.0 88800.0 89000.0 88900.0 89600.0 90600.0
housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
print (housing3)
2000Q1 2000Q2
State RegionName
NY New York NaN NaN
CA Los Angeles 207066.666667 214466.666667
IL Chicago 138400.000000 143633.333333
PA Philadelphia 53000.000000 53633.333333
AZ Phoenix 111833.333333 114366.666667
NV Las Vegas 132600.000000 134366.666667
CA San Diego 222900.000000 234366.666667
TX Dallas 84466.666667 83866.666667
CA San Jose 374266.666667 406566.666667
FL Jacksonville 88600.000000 89700.000000

Categories