I am doing a course on Coursera and I have a dataset to perform some operations on. I have gotten the answer to the problem but my answer takes time to compute.
Here is the original dataset and a sample screenshot is provided below.
The task is to convert the data from monthly values to quarterly values i.e. I need to sort of aggregate 2000-01, 2000-02, 2000-03 data to 2000-Q1 and so on. The new value for 2000-Q1 should be the mean of these three values.
Likewise 2000-04, 2000-05, 2000-06 would become 2000-Q2 and the new value should be the mean of 2000-04, 2000-05, 2000-06
Here is how I solved the problem.
First I defined a function quarter_rows() which takes a row of data (as a series), loops through every third element using column index, replaces some values (in-place) with a mean computed as explained above and returns the row
import pandas as pd
import numpy as np
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
def quarter_rows(row):
for i in range(0, len(row), 3):
row.replace(row[i], np.mean(row[i:i+3]), inplace=True)
return row
Now I do some subsetting and cleanup of the data to leave only what I need to work with
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
housing3 = housing.set_index(["State","RegionName"]).ix[:, '2000-01' : ]
I then used apply to apply the function to all rows.
housing3 = housing3.apply(quarter_rows, axis=1)
I get the expected result. A sample is shown below
But the whole process takes more than a minute to complete. The original dataframe has about 10370 columns.
I don't know if there is a way to speed things up in the for loop and apply functions. The bulk of the time is taken up in the for loop inside my quarter_rows() function.
I've tried python lambdas but every way I tried threw an exception.
I would really be interested in finding a way to get the mean using three consecutive values without using the for loop.
Thanks
I think you can use instead apply use resample by quarters and aggregate mean, but first convert column names to month periods by to_period:
housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
Testing:
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
#for testing slect only 10 first rows and columns from jan 2000 to jun 2000
housing3 = housing.set_index(["State","RegionName"]).ix[:10, '2000-01' : '2000-06']
print (housing3)
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
State RegionName
NY New York NaN NaN NaN NaN NaN NaN
CA Los Angeles 204400.0 207000.0 209800.0 212300.0 214500.0 216600.0
IL Chicago 136800.0 138300.0 140100.0 141900.0 143700.0 145300.0
PA Philadelphia 52700.0 53100.0 53200.0 53400.0 53700.0 53800.0
AZ Phoenix 111000.0 111700.0 112800.0 113700.0 114300.0 115100.0
NV Las Vegas 131700.0 132600.0 133500.0 134100.0 134400.0 134600.0
CA San Diego 219200.0 222900.0 226600.0 230200.0 234400.0 238500.0
TX Dallas 85100.0 84500.0 83800.0 83600.0 83800.0 84200.0
CA San Jose 364100.0 374000.0 384700.0 395700.0 407100.0 416900.0
FL Jacksonville 88000.0 88800.0 89000.0 88900.0 89600.0 90600.0
housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
print (housing3)
2000Q1 2000Q2
State RegionName
NY New York NaN NaN
CA Los Angeles 207066.666667 214466.666667
IL Chicago 138400.000000 143633.333333
PA Philadelphia 53000.000000 53633.333333
AZ Phoenix 111833.333333 114366.666667
NV Las Vegas 132600.000000 134366.666667
CA San Diego 222900.000000 234366.666667
TX Dallas 84466.666667 83866.666667
CA San Jose 374266.666667 406566.666667
FL Jacksonville 88600.000000 89700.000000
Related
I have a df with 3 columns, City, State, and MSA. Some of the MSA values are NaN. I would like to fill the MSA NaN values with a concatenation of City and State. I can fill MSA with City using df.MSA_CBSA.fillna(df.City, inplace=True), but some cities in different states have the same name.
City
State
MSA
Chicago
IL
Chicago MSA
Belleville
IL
Nan
Belleville
KS
Nan
City
State
MSA
Chicago
IL
Chicago MSA
Belleville
IL
Belleville IL
Belleville
KS
Belleville KS
Keep using the vectorized operation that you suggested. Notice that the argument can receive a combination from the other instances:
df.MSA.fillna(df.City + "," + df.State, inplace=True)
i have two datasets:
-population: shows the population of USA states, organized alphabetically.
-data: has more than 200,000 rows
population.head()
state population
0 Alabama 4887871
1 Alaska 737438
2 Arizona 7171646
3 Arkansas 3013825
4 California 39557045
i'm trying to add a new column called "Incidents" from the other data set.
I tried: population['incidents'] = data.state.value_counts().sort_index()
but i'm getting the following result:
state population incidents
0 Alabama 4887871 NaN
1 Alaska 737438 NaN
2 Arizona 7171646 NaN
3 Arkansas 3013825 NaN
4 California 39557045 NaN
what can i do to fix this??
EDIT:
data.state.value_counts().sort_index()
Alabama 5373
Alaska 1292
Arizona 2268
Arkansas 2753
California 15975
Colorado 3069
Connecticut 2984
Delaware 1643
District of Columbia 3091
Florida 14610
Georgia 8717
````````````````````````
If you wanna add a specific column from one dataset to the other dataset you do it like this
population['incidents'] = data[['columntoappend']]
Your RHS (right hand side ) must be one column which in your case is not.
https://www.google.com/amp/s/www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/amp/
The way to do this is as follows, provided that your length of your indices are consistent:
population['incidents'] = [x for x in data.state.value_counts().sort_index()]
I can't really explain why your approach results in NaN objects though. In any case, it would be incorrect as well as you're assigning entire series to each row in the population dataset. With the list comprehension, you're assigning one value to each row.
I'm trying to use the apply function on my dataframe ('homes') that has multi index ('states' and 'RegionName'). The function i use tries to check if the combination of state and Region Name is matched by my other data frame ('UT').
when applying this function:
homes['UT']=homes.apply(lambda row: 1 if
ut[(ut['State']==states[homes.iloc[row].name[0]]) &
(ut['RegionName']==homes.iloc[row].name[1])] else 0, axis=1)
i get an error saying basicaly that my index is out of bounds.
I tried a few things like converting the other dataframe to two lists and check if the rows of my dataframe are in those lists but still getting the same error.
my ut dataframe head:
State RegionName
1 Alabama Auburn
2 Alabama Florence
3 Alabama Jacksonville
4 Alabama Livingston
5 Alabama Montevallo
my home data frame head:
2000q1 2000q2 2000q3 2000q4
State RegionName
New York New York NaN NaN NaN NaN
California Los Angeles 207066.666667 214466.666667 220966.666667 226166.666667
Illinois Chicago 138400.000000 143633.333333 147866.666667 152133.333333
Pennsylvania Philadelphia 53000.000000 53633.333333 54133.333333 54700.000000
Arizona Phoenix 111833.333333 114366.666667 116000.000000 117400.000000
Any suggestions?
found the answer thank to #user8505495.
the code should be like this:
homes['UT']=homes.apply(lambda row: 1 if (row.name[0]+', '+row.name[1] in ut['full'].values) else 0, axis=1)
i have no idea why it works but it does. thanks for all the help!
Very new to pandas so any explanation with a solution is appreciated.
I have a dataframe such as
Company Zip State City
1 *CBRE San Diego, CA 92101
4 1908 Brands Boulder, CO 80301
7 1st Infantry Division Headquarters Fort Riley, KS
10 21st Century Healthcare, Inc. Tempe 85282
15 AAA Jefferson City, MO 65101-9564
I want to split the Zip State city column in my data into 3 different columns. Using the answer from this post Pandas DataFrame, how do i split a column into two I could accomplish this task if I didn't have my first column. Writing a regex to captures all companies just leads to me capturing everything in my data.
I also tried
foo = lambda x: pandas.Series([i for i in reversed(x.split())])
data_pretty = data['Zip State City'].apply(foo)
but this causes me to loose the company column and splits the names of the cities that are more than one word into separate columns.
How can I split my last column while keeping the company column data?
you can use extract() method:
In [110]: df
Out[110]:
Company Zip State City
1 *CBRE San Diego, CA 92101
4 1908 Brands Boulder, CO 80301
7 1st Infantry Division Headquarters Fort Riley, KS
10 21st Century Healthcare, Inc. Tempe 85282
15 AAA Jefferson City, MO 65101-9564
In [112]: df[['City','State','ZIP']] = df['Zip State City'].str.extract(r'([^,\d]+)?[,]*\s*([A-Z]{2})?\s*([\d\-]{4,11})?', expand=True)
In [113]: df
Out[113]:
Company Zip State City City State ZIP
1 *CBRE San Diego, CA 92101 San Diego CA 92101
4 1908 Brands Boulder, CO 80301 Boulder CO 80301
7 1st Infantry Division Headquarters Fort Riley, KS Fort Riley KS NaN
10 21st Century Healthcare, Inc. Tempe 85282 Tempe NaN 85282
15 AAA Jefferson City, MO 65101-9564 Jefferson City MO 65101-9564
From docs:
Series.str.extract(pat, flags=0, expand=None)
For each subject string in the Series, extract groups from the first
match of regular expression pat.
New in version 0.13.0.
Parameters:
pat : string
Regular expression pattern with capturing groups
flags : int, default 0 (no flags)
re module flags, e.g.
re.IGNORECASE .. versionadded:: 0.18.0
expand : bool, default False
If True, return DataFrame.
If False, return Series/Index/DataFrame.
Returns: DataFrame with one row for each subject string, and one
column for each group. Any capture group names in regular expression
pat will be used for column names; otherwise capture group numbers
will be used. The dtype of each result column is always object, even
when no match is found. If expand=True and pat has only one capture
group, then return a Series (if subject is a Series) or Index (if
subject is an Index).
How do I get the maximum within a subset of my dataframe in Pandas?
For example, when I do something like
statedata[statedata['state.region'] == 'Northeast'].ix[statedata['Murder'].idxmax()]
I get a KeyError that indicates that idxmax is returning the key for the global maximum, Alabama, rather than the maximum within the queried subset (from which that key is of course missing).
Is there a way to do this concisely on Pandas?
For reference, the data used here is from R, using
data(state)
statedata = cbind(data.frame(state.x77), state.abb, state.area, state.center, state.division, state.name, state.region)
then exported from R and imported by Pandas.
You could use df.loc to select the sub-DataFrame:
import pandas as pd
import pandas.rpy.common as com
import rpy2.robjects as ro
r = ro.r
statedata = r('''cbind(data.frame(state.x77), state.abb, state.area, state.center,
state.division, state.name, state.region)''')
df = com.convert_robj(statedata)
df.columns = df.columns.to_series().str.replace('state.', '')
subdf = df.loc[df['region']=='Northeast', 'Murder']
print(subdf)
# Connecticut 3.1
# Maine 2.7
# Massachusetts 3.3
# New Hampshire 3.3
# New Jersey 5.2
# New York 10.9
# Pennsylvania 6.1
# Rhode Island 2.4
# Vermont 5.5
# Name: Murder, dtype: float64
print(subdf.idxmax())
prints
New York
To select the state with the highest murder rate (as of 1976) for each region:
In [24]: df.groupby('region')['Murder'].idxmax()
Out[24]:
region
North Central Michigan
Northeast New York
South Alabama
West Nevada
Name: Murder, dtype: object