I have DF1 with several int columns and DF2 with 1 int column
DF1:
Year Industrial Consumer Discretionary Technology Utilities Energy Materials Communications Consumer Staples Health Care #No L1 US Agg Financials China Agg EU Agg
2001 5.884277 6.013842 6.216585 6.640594 6.701400 8.488806 7.175017 6.334284 6.082113 0.000000 5.439149 4.193736 4.686188 4.294788
2002 5.697814 6.277471 5.241045 6.608475 6.983511 8.089475 7.399775 5.882947 5.818563 7.250000 4.877012 3.635425 4.334125 3.944324
2003 5.144356 6.503754 6.270268 5.737079 6.466985 8.122228 7.040089 5.461827 5.385670 5.611753 4.163365 2.888026 3.955665 3.464020
2004 5.436486 6.463149 4.500574 5.329104 5.863406 7.562982 6.521106 5.990889 4.874258 6.554348 4.384878 3.502861 4.556418 3.412025
2005 5.003606 6.108812 5.732764 5.543677 6.131144 7.239053 7.228042 5.421092 5.561518 NaN 4.660754 3.970243 3.944251 3.106951
2006 4.505980 6.017253 4.923927 5.955308 5.799030 7.425253 6.942308
DF2:
Year Values
2002 4.514752
2003 3.994849
2004 4.254575
2005 4.277520
2006 4.784476
etc..
The indexes are the same for 2 DataFrames.
The goal is to create DF3 while subtracting DF2 from every single column from DF1. (DF2 - DF1 = DF3)
Anywhere where there is a nan, it should skip the math.
Assuming "Year" is the index for both (if not, you can make it the index using set_index), you can use sub on axis:
df3 = df1.sub(df2['Values'], axis=0)
Output:
Industrial Consumer Discretionary Technology Utilities Energy \
Year
2001 NaN NaN NaN NaN NaN NaN
2002 1.183062 1.762719 0.726293 2.093723 2.468759 3.574723
2003 1.149507 2.508905 2.275419 1.742230 2.472136 4.127379
2004 1.181911 2.208574 0.245999 1.074529 1.608831 3.308407
2005 0.726086 1.831292 1.455244 1.266157 1.853624 2.961533
2006 -0.278496 1.232777 0.139451 1.170832 1.014554 2.640777
Materials Communications Consumer.1 Staples Health_Care US_Agg \
Year
2001 NaN NaN NaN NaN NaN NaN
2002 2.885023 1.368195 1.303811 2.735248 0.362260 -0.879327
2003 3.045240 1.466978 1.390821 1.616904 0.168516 -1.106823
2004 2.266531 1.736314 0.619683 2.299773 0.130303 -0.751714
2005 2.950522 1.143572 1.283998 NaN 0.383234 -0.307277
2006 2.157832 NaN NaN NaN NaN NaN
Financials China_Agg
Year
2001 NaN NaN
2002 -0.180627 -0.570428
2003 -0.039184 -0.530829
2004 0.301843 -0.842550
2005 -0.333269 -1.170569
2006 NaN NaN
If you want to subtract df1 from df2 instead, you can use rsub instead of sub. It not clear which one you want since you explain that you want df1-df2 but your formula is the opposite.
Related
I cannot find a reason why when I assign scaled variable (which is non NaN) to the original DataFrame I get NaNs even though the index matches (years).
Can anyone help? I am leaving out details which I think are not necessary, happy to provide more details if needed.
So, given the following multi-index dataframe df:
value
country year
Canada 2007 1
2006 2
2005 3
United Kingdom 2007 4
2006 5
And the following series scaled:
2006 99
2007 54
2005 78
dtype: int64
You can assign it as a new column if reindexed and converted to a list first, like this:
df.loc["Canada", "new_values"] = scaled.reindex(df.loc["Canada", :].index).to_list()
print(df.loc["Canada", :])
# Output
value new_values
year
2007 1 54.0
2006 2 99.0
2005 3 78.0
I have a countrydf as below, in which each cell in the country column contains a list of the countries where the movie was released.
countrydf
id Country release_year
s1 [US] 2020
s2 [South Africa] 2021
s3 NaN 2021
s4 NaN 2021
s5 [India] 2021
I want to make a new df which look like this:
country_yeardf
Year US UK Japan India
1925 NaN NaN NaN NaN
1926 NaN NaN NaN NaN
1927 NaN NaN NaN NaN
1928 NaN NaN NaN NaN
It has the release year and the number of movies released in each country.
My solution is that: with a blank df like the second one, run a for loop to count the number of movies released and then modify the value in the cell relatively.
countrylist=['Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', ….]
for x in countrylist:
for j in list(range(0,8807)):
if x in countrydf.country[j]:
t=int (countrydf.release_year[j] )
country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
an error occurred which read:
TypeError Traceback (most recent call last)
<ipython-input-25-225281f8759a> in <module>()
1 for x in countrylist:
2 for j in li:
----> 3 if x in countrydf.country[j]:
4 t=int(countrydf.release_year[j])
5 country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
TypeError: argument of type 'float' is not iterable
I don’t know which one is of float type here, I have check the type of countrydf.country[j] and it returned int.
I was using pandas and I am just getting started with it. Can anyone please explain the error and suggest a solution for a df that I want to create?
P/s: my English is not so good so hop you guys understand.
Here is a solution using groupby
df = pd.DataFrame([['US', 2015], ['India', 2015], ['US', 2015], ['Russia', 2016]], columns=['country', 'year'])
country year
0 US 2015
1 India 2015
2 US 2015
3 Russia 2016
Now just groupby country and year and unstack the output:
df.groupby(['year', 'country']).size().unstack()
country India Russia US
year
2015 1.0 NaN 2.0
2016 NaN 1.0 NaN
Some alternative ways to achieve this in pandas without loops.
If the Country Column have more than 1 value in the list in each row, you can try the below:
>>df['Country'].str.join("|").str.get_dummies().groupby(df['release_year']).sum()
India South Africa US
release_year
2020 0 0 1
2021 1 1 0
Else if Country has just 1 value per row in the list as you have shown in the example, you can use crosstab
>>pd.crosstab(df['release_year'],df['Country'].str[0])
Country India South Africa US
release_year
2020 0 0 1
2021 1 1 0
I have a pivot table and I want to plot the values for the 12 months of each year for each town.
2010-01 2010-02 2010-03
City RegionName
Atlanta Downtown NaN NaN NaN
Midtown 194.263702 196.319964 197.946962
Alexandria Alexandria NaN NaN NaN
West
Landmark- NaN NaN NaN
Van Dom
How can I select only the values for each region of each town? I thought maybe it would be better to change the column names with years and months to datetime format and set them as index. How can I do this?
The result must be:
City RegionName
2010-01 Atlanta Downtown NaN
Midtown 194.263702
Alexandria Alexandria NaN
West
Landmark- NaN
Van Dom
Here's some similar dummy data to play with:
idx = pd.MultiIndex.from_arrays([['A','A', 'B','C','C'],
['A1','A2','B1','C1','C2']], names=['City','Region'])
idcol = pd.date_range('2012-01', freq='M', periods=12)
df = pd.DataFrame(np.random.rand(5,12), index=idx, columns=[t.strftime('%Y-%m') for t in idcol])
Let's see what we've got:
print(df.ix[:,:3])
2012-01 2012-02 2012-03
City Region
A A1 0.513709 0.941354 0.133290
A2 0.734199 0.005218 0.068914
B B1 0.043178 0.124049 0.603469
C C1 0.721248 0.483388 0.044008
C2 0.784137 0.864326 0.450250
Let's convert these to a datetime: df.columns = pd.to_datetime(df.columns)
Now to plot you just need to transpose:
df.T.plot()
Update after your updated your question:
Use stack, and then reorder if you want:
df = df.stack().reorder_levels([2,0,1])
df.head()
City Region
2012-01-01 A A1 0.513709
2012-02-01 A A1 0.941354
2012-03-01 A A1 0.133290
2012-04-01 A A1 0.324518
2012-05-01 A A1 0.554125
I want to create a dataframe from an aggregate function. I thought that it would create by default a dataframe as this solution states, but it creates a series and I don't know why (Converting a Pandas GroupBy object to DataFrame).
The dataframe is from Kaggle's San Francisco Salaries. My code:
df=pd.read_csv('Salaries.csv')
in: type(df)
out: pandas.core.frame.DataFrame
in: df.head()
out: EmployeeName JobTitle TotalPay TotalPayBenefits Year Status 2BasePay 2OvertimePay 2OtherPay 2Benefits 2Year
0 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 567595.43 567595.43 2011 NaN 167411.18 0.00 400184.25 NaN 2011-01-01
1 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 538909.28 538909.28 2011 NaN 155966.02 245131.88 137811.38 NaN 2011-01-01
2 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 335279.91 335279.91 2011 NaN 212739.13 106088.18 16452.60 NaN 2011-01-01
3 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 332343.61 332343.61 2011 NaN 77916.00 56120.71 198306.90 NaN 2011-01-01
4 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 326373.19 326373.19 2011 NaN 134401.60 9737.00 182234.59 NaN 2011-01-01
in: df2=df.groupby(['JobTitle'])['TotalPay'].mean()
type(df2)
out: pandas.core.series.Series
I want df2 to be a dataframe with the columns 'JobTitle' and 'TotalPlay'
Breaking down your code:
df2 = df.groupby(['JobTitle'])['TotalPay'].mean()
The groupby is fine. It's the ['TotalPay'] that is the misstep. That is telling the groupby to only execute the the mean function on the pd.Series df['TotalPay'] for each group defined in ['JobTitle']. Instead, you want to refer to this column with [['TotalPay']]. Notice the double brackets. Those double brackets say pd.DataFrame.
Recap
df2 = df2=df.groupby(['JobTitle'])[['TotalPay']].mean()
I have a dataframe with multiple index and would like to create a rolling sum of some data, but for each id in the index.
For instance, let us say I have two indexes (Firm and Year) and I have some data with name zdata. The working example is the following:
import pandas as pd
# generating data
firms = ['firm1']*5+['firm2']*5
years = [2000+i for i in range(5)]*2
zdata = [1 for i in range(10)]
# Creating the dataframe
mydf = pd.DataFrame({'firms':firms,'year':years,'zdata':zdata})
# Setting the two indexes
mydf.set_index(['firms','year'],inplace=True)
print(mydf)
zdata
firms year
firm1 2000 1
2001 1
2002 1
2003 1
2004 1
firm2 2000 1
2001 1
2002 1
2003 1
2004 1
And now, I would like to have a rolling sum that starts over for each firm. However, if I type
new_rolling_df=mydf.rolling(window=2).sum()
print(new_rolling_df)
zdata
firms year
firm1 2000 NaN
2001 2.0
2002 2.0
2003 2.0
2004 2.0
firm2 2000 2.0
2001 2.0
2002 2.0
2003 2.0
2004 2.0
It doesn't take into account the multiple index and just make a normal rolling sum. Anyone has an idea how I should do (especially since I have even more indexes than 2 (firm, worker, country, year)
Thanks,
Adrien
Option 1
mydf.unstack(0).rolling(2).sum().stack().swaplevel(0, 1).sort_index()
Option 2
mydf.groupby(level=0, group_keys=False).rolling(2).sum()