I have DF1 with several int columns and DF2 with 1 int column
DF1:
Year Industrial Consumer Discretionary Technology Utilities Energy Materials Communications Consumer Staples Health Care #No L1 US Agg Financials China Agg EU Agg
2001 5.884277 6.013842 6.216585 6.640594 6.701400 8.488806 7.175017 6.334284 6.082113 0.000000 5.439149 4.193736 4.686188 4.294788
2002 5.697814 6.277471 5.241045 6.608475 6.983511 8.089475 7.399775 5.882947 5.818563 7.250000 4.877012 3.635425 4.334125 3.944324
2003 5.144356 6.503754 6.270268 5.737079 6.466985 8.122228 7.040089 5.461827 5.385670 5.611753 4.163365 2.888026 3.955665 3.464020
2004 5.436486 6.463149 4.500574 5.329104 5.863406 7.562982 6.521106 5.990889 4.874258 6.554348 4.384878 3.502861 4.556418 3.412025
2005 5.003606 6.108812 5.732764 5.543677 6.131144 7.239053 7.228042 5.421092 5.561518 NaN 4.660754 3.970243 3.944251 3.106951
2006 4.505980 6.017253 4.923927 5.955308 5.799030 7.425253 6.942308
DF2:
Year Values
2002 4.514752
2003 3.994849
2004 4.254575
2005 4.277520
2006 4.784476
etc..
The indexes are the same for 2 DataFrames.
The goal is to create DF3 while subtracting DF2 from every single column from DF1. (DF2 - DF1 = DF3)
Anywhere where there is a nan, it should skip the math.
Assuming "Year" is the index for both (if not, you can make it the index using set_index), you can use sub on axis:
df3 = df1.sub(df2['Values'], axis=0)
Output:
Industrial Consumer Discretionary Technology Utilities Energy \
Year
2001 NaN NaN NaN NaN NaN NaN
2002 1.183062 1.762719 0.726293 2.093723 2.468759 3.574723
2003 1.149507 2.508905 2.275419 1.742230 2.472136 4.127379
2004 1.181911 2.208574 0.245999 1.074529 1.608831 3.308407
2005 0.726086 1.831292 1.455244 1.266157 1.853624 2.961533
2006 -0.278496 1.232777 0.139451 1.170832 1.014554 2.640777
Materials Communications Consumer.1 Staples Health_Care US_Agg \
Year
2001 NaN NaN NaN NaN NaN NaN
2002 2.885023 1.368195 1.303811 2.735248 0.362260 -0.879327
2003 3.045240 1.466978 1.390821 1.616904 0.168516 -1.106823
2004 2.266531 1.736314 0.619683 2.299773 0.130303 -0.751714
2005 2.950522 1.143572 1.283998 NaN 0.383234 -0.307277
2006 2.157832 NaN NaN NaN NaN NaN
Financials China_Agg
Year
2001 NaN NaN
2002 -0.180627 -0.570428
2003 -0.039184 -0.530829
2004 0.301843 -0.842550
2005 -0.333269 -1.170569
2006 NaN NaN
If you want to subtract df1 from df2 instead, you can use rsub instead of sub. It not clear which one you want since you explain that you want df1-df2 but your formula is the opposite.
I'm currently wrangling a big data set of 2 mio rows from Lyft for a Udacity project. The DataFrame looks like this:
id name latitude longitude
0 148.0 Horton St at 40th St 37.829705 -122.287610
1 376.0 Illinois St at 20th St 37.760458 -122.387540
2 453.0 Brannan St at 4th St 37.777934 -122.396973
3 182.0 19th Street BART Station 37.809369 -122.267951
4 237.0 Fruitvale BART Station 37.775232 -122.224498
5 NaN NaN 37.775232 -122.224498
As I try to express in the last line, I have a lot of NaN values for id and name, however, latitude and longitude are mostly never empty. My assumption is that I could actually extract the name from other rows given a certain combination of latitude and longitude.
Once I have the name, I would try filling the NaN values for id using name
dict_id = dict(zip(df['name'], df['id']))
df['id'] = df['id'].fillna(df['name'].map(dict_id))
However, I struggle because with latitude and longitude I have two values to match against the name.
You can left merge the dataframe with the copy of it after dropna , then rename the columns:
m = df.merge(df.dropna(subset=['name']),on=['latitude','longitude'],
how='left',suffixes=('','_y'))
out = (m.drop(['id','name'],1).rename(columns={'id_y':'id','name_y':'name'})
.reindex(df.columns,axis=1))
id name latitude longitude
0 148.0 Horton St at 40th St 37.829705 -122.287610
1 376.0 Illinois St at 20th St 37.760458 -122.387540
2 453.0 Brannan St at 4th St 37.777934 -122.396973
3 182.0 19th Street BART Station 37.809369 -122.267951
4 237.0 Fruitvale BART Station 37.775232 -122.224498
5 237.0 Fruitvale BART Station 37.775232 -122.224498
I have a dataframe of addresses with no zipcodes:
df1 = pd.DataFrame({'address1':['1 o\'toole st','2 main st','3 high street','5 foo street','10 foo street'],
'address2':['town1',np.nan,np.nan,'Bartown',np.nan],
'address3':[np.nan,'village','city','county2','county3']})
df1['zipcode']=''
df1
address1 address2 address3 zipcode
0 1 o'toole st town1 NaN
1 2 main st NaN village
2 3 high street NaN city
3 5 foo street Bartown county2
4 10 foo street NaN county3
And I have a second dataframe with addresses and zipcodes. Note, this is in the same order as df1, but it's not like this in the real data I'm working with:
df2 = pd.DataFrame({'address1':['1 o\'toole st','2 main st','7 mill street','5 foo street','10 foo street'],
'address2':['town1','village','city','Bartown','county3'],
'address3':[np.nan,np.nan,np.nan,'county2','USA'],
'zipcode': ['er45','qw23','rt67','yu89','yu83']})
df2
address1 address2 address3 zipcode
0 1 o'toole st town1 NaN er45
1 2 main st village NaN qw23
2 7 mill street city NaN rt67
3 5 foo street Bartown county2 yu89
4 10 foo street county3 USA yu83
I want to check if the addresses in df1 are in df2, and if so, drag the zipcodes into df1.
This is where I'm having a bit of trouble, not sure if it's the best way to approach it.
What I've done so far is create a primary key for both dataframes, using the first two lines off the address: address 1 and address 2, stripping all white spaces and nonalpha, converting to lower:
df1['key'] = (df1['address1'] + df1['address2']).str.lower().str.replace(' ', '').str.replace('\W', '')
df2['key'] = (df2['address1'] + df2['address2']).str.lower().str.replace(' ', '').str.replace('\W', '')
print(df1)
address1 address2 address3 zipcode key
0 1 o'toole st town1 NaN 1otoolesttown1
1 2 main st NaN village NaN
2 3 high street NaN city NaN
3 5 foo street Bartown county2 5foostreetbartown
4 10 foo street NaN county3 NaN
print(df2)
address1 address2 address3 zipcode key
0 1 o'toole st town1 NaN er45 1otoolesttown1
1 2 main st village NaN qw23 2mainstvillage
2 7 mill street city NaN rt67 7millstreetcity
3 5 foo street Bartown county2 yu89 5foostreetbartown
4 10 foo street county3 USA yu83 10foostreetcounty3
Now I'm going to use np.where to drag the info over to the empty zipcode column in df1, returning no_match if a matching address wasn't found:
df1['zipcode'] = np.where(df1['key'].isin(df2['key']), df2['zipcode'], 'no_match')
print(df1)
address1 address2 address3 zipcode key
0 1 o'toole st town1 NaN er45 1otoolesttown1
1 2 main st NaN village no_match NaN
2 3 high street NaN city no_match NaN
3 5 foo street Bartown county2 yu89 5foostreetbartown
4 10 foo street NaN county3 no_match NaN
My problem is with the key created for df1. As you can see, some of them are NaN. This is due to the address formatting which is different to df2. That's just how the datasets are that I'm currently working with.
I tried to get around this problem by skipping any NaN and adding the next row, but get a ValueError:
# add address1 + address2 if it's not null, otherwise use address3
df1['key'] = (df1['address1'] + (df1['address2'] if pd.notnull(df1['address2']) else df1['address3']))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any feedback or advice on how to tackle this is much appreciated. If there's an easier way to do this I'd love to know.
Use Series.fillna for replace missing values by df1['address3']:
df1['key'] = df1['address1'] + df1['address2'].fillna(df1['address3'])
instead:
df1['key'] = (df1['address1'] + (df1['address2'] if
pd.notnull(df1['address2']) else df1['address3']))
More information about you error is in using if truth statements with-pandas.
I would first replace the NaN values with empty strings, and concatenate the 3 address columns to get the address in one column, a bit like you did:
# filling NaN values
df1.fillna('', inplace=True)
df2.fillna('', inplace=True)
# concatenate the address columns
df1['address'] = df1['address1']+df1['address2']+df1['address3']
df2['address'] = df2['address1']+df2['address2']+df2['address3']
Then set the new 'address' column as the index in both DataFrames:
df1.set_index('address', inplace=True)
df2.set_index('address', inplace=True)
And finally add the zip code to df1:
df1['zipcode'] = df2['zipcode']
Here is the result:
address1 address2 address zipcode
address
1 o'toole sttown1 1 o'toole st town1 er45
2 main stvillage 2 main st village qw23
3 high streetcity 3 high street city NaN
5 foo streetBartowncounty2 5 foo street Bartown county2 yu89
10 foo streetcounty3 10 foo street county3 yu89
Your problem is this line:
df1['key'] = (df1['address1'] + (df1['address2'] if pd.notnull(df1['address2']) else df1['address3']))
The if used here leads to the error, because pd.notnull generates a boolean series but the if operator requires one boolean value.
You can solve it by using pandas.Series.where:
df1['key'] = (df1['address1'] +
df1['address2'].where(pd.notnull(df1['address2']), df1['address3'])) \
.str.lower().str.replace(' ', '').str.replace('\W', '')
This will generate a df1 with the keys you are looking for:
address1 address2 address3 key
0 1 o'toole st town1 NaN 1otoolesttown1
1 2 main st NaN village 2mainstvillage
2 3 high street NaN city 3highstreetcity
3 5 foo street Bartown county2 5foostreetbartown
4 10 foo street NaN county3 10foostreetcounty3
And now you can merge the zipcodes.
I have some time series data that is mostly quarterly, but reported in year-month-day format for multiple variables and multiple countries, however some variables for some dates have are posted the last day of the quarter and others might post on close to the last day. I would like to perform a resample that aggregates each row to end of quarter of frequency. I have this:
Date Country Var1 Var2 Var3
2012-03-30 China 12 Nan 200
2012-03-31 China Nan 50 Nan
2012-06-28 China 13 Nan 199
2012-06-30 China Nan 48 Nan
2012-09-30 China 13 49 200
2012-12-31 China 12 50 201
What I want to see is
Date Country Var1 Var2 Var3
2012-03-31 China 12 50 200
2012-06-30 China 13 48 199
2012-09-30 China 13 49 200
2012-12-31 China 12 50 201
I tried a couple of different resample ideas. First I tried
df=df.groupby("Country").resample('Q').applymap(lambda x: df.shift(1) if math.isnan(x) else x)
Then I tried converting all the Nans to zeros then aggregating by sum, but this is not ideal since I cannot keep track of which data actually are zero and which data were missing.
df=df.fillna(0)
df=df.groupby("Country").resample('Q').sum()
Here's a small example with my own dataframe doing what you want.
# creating the dataframe
df = pd.DataFrame(np.random.randn(8, 3), columns=['Var1', 'Var2', 'Var3'])
# adding NaN values
df.iloc[1]['Var1'] = np.nan
df.iloc[5]['Var1'] = np.nan
df.iloc[4]['Var2'] = np.nan
df.iloc[6]['Var2'] = np.nan
df
'''
Var1 Var2 Var3
0 -0.437551 -2.707623 0.726240
1 NaN 2.529733 0.484732
2 0.199278 -0.316516 -0.655426
3 0.732910 -0.638045 -0.706436
4 0.877915 NaN -1.141384
5 NaN -2.050228 2.091994
6 -1.119849 NaN 1.222602
7 0.406632 -2.255687 0.742452
'''
# backfilling values in Var2
df['Var2'] = df['Var2'].fillna(method='backfill').dropna()
# dropping NaN rows based on column Var1
df.dropna()
df
'''
Var1 Var2 Var3
0 -0.437551 -2.707623 0.726240
2 0.199278 -0.316516 -0.655426
3 0.732910 -0.638045 -0.706436
4 0.877915 -2.050228 -1.141384
6 -1.119849 -2.255687 1.222602
7 0.406632 -2.255687 0.742452
'''
I want to create a dataframe from an aggregate function. I thought that it would create by default a dataframe as this solution states, but it creates a series and I don't know why (Converting a Pandas GroupBy object to DataFrame).
The dataframe is from Kaggle's San Francisco Salaries. My code:
df=pd.read_csv('Salaries.csv')
in: type(df)
out: pandas.core.frame.DataFrame
in: df.head()
out: EmployeeName JobTitle TotalPay TotalPayBenefits Year Status 2BasePay 2OvertimePay 2OtherPay 2Benefits 2Year
0 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 567595.43 567595.43 2011 NaN 167411.18 0.00 400184.25 NaN 2011-01-01
1 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 538909.28 538909.28 2011 NaN 155966.02 245131.88 137811.38 NaN 2011-01-01
2 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 335279.91 335279.91 2011 NaN 212739.13 106088.18 16452.60 NaN 2011-01-01
3 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 332343.61 332343.61 2011 NaN 77916.00 56120.71 198306.90 NaN 2011-01-01
4 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 326373.19 326373.19 2011 NaN 134401.60 9737.00 182234.59 NaN 2011-01-01
in: df2=df.groupby(['JobTitle'])['TotalPay'].mean()
type(df2)
out: pandas.core.series.Series
I want df2 to be a dataframe with the columns 'JobTitle' and 'TotalPlay'
Breaking down your code:
df2 = df.groupby(['JobTitle'])['TotalPay'].mean()
The groupby is fine. It's the ['TotalPay'] that is the misstep. That is telling the groupby to only execute the the mean function on the pd.Series df['TotalPay'] for each group defined in ['JobTitle']. Instead, you want to refer to this column with [['TotalPay']]. Notice the double brackets. Those double brackets say pd.DataFrame.
Recap
df2 = df2=df.groupby(['JobTitle'])[['TotalPay']].mean()