ignoring hierarchical index during matrix operations - python

In the last statement of this routine I get a TypeError
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Missouri'],
'year': [2000, 2001, 2002, 2001, 2002],
'items': [5, 12, 6, 45, 0]}
frame = DataFrame(data)
def summary_pivot(df, row=['state'],column=['year'],value=['items'],func=len):
return df.pivot_table(value, rows=row,cols=column,
margins=True, aggfunc=func, fill_value=0)
test = summary_pivot(frame)
In [545]: test
Out[545]:
items
year 2000 2001 2002 All
state
Missouri 0 0 1 1
Nevada 0 1 0 1
Ohio 1 1 1 3
All 1 2 2 5
price = DataFrame(index=['Missouri', 'Ohio'], columns = ['price'], data = [200, 250])
In [546]: price
Out[546]:
price
Missouri 200
Ohio 250
test * price
TypeError: can only call with other hierarchical index objects
How can I get past this error, so I can multiply correctly the number of items in each state by the corresponding price?

In [659]: price = Series(index = ['Missouri', 'Ohio'], data = [200, 250])
In [660]: test1 = test.items
In [661]: test1.mul(price, axis='index')
Out[661]:
year 2000 2001 2002 All
All NaN NaN NaN NaN
Missouri 0 0 200 200
Nevada NaN NaN NaN NaN
Ohio 250 250 250 750

Related

Group by, Pivot with multiple columns and condition count

I have the dataframe:
df = pd.DataFrame({
"Agreement": ["Peace", "Peace", "Love", "Love", "Sun","Sun","Sun"],
"country1": ["USA", "UK", "Germany", "Spain", "Italy","India","China"],
"country2": ["Canada", "France", "Portugal", "Italy","India","Spain","UK"],
"EP1": [1, 0, 1, 0, 0,1,1],
"EP2": [0, 0, 0, 0,0,0,0],
"EP3": [1, 0, 1, 0,1,1,1]
})
I would like to group by or pivot so that I get the count of times a country is in an agreement with at least one EP equal or greater than 1. I would like as output:
df = pd.DataFrame({
"Country": ["USA", "UK", "Germany", "Spain", "Italy","India","China", "Canada","France","Portugal"],
"Agreement with at least one EP per country": [1, 1, 1, 1,1,2,1,1,0,1]
})
I have tried with pivot and group by and loop but I never reach the desired output. Thanks
Summarize 'EPx' columns in 'Agreement' then flatten your dataframe. Finally group by Country to count the number of agreement.
cols = ['country1', 'country2', 'Agreement']
out = (df.assign(Agreement=df.filter(like='EP').any(axis=1))[cols]
.melt('Agreement', value_name='Country')
.groupby('Country', sort=False)['Agreement'].sum().reset_index())
print(out)
# Output
Country Agreement
0 USA 1
1 UK 1
2 Germany 1
3 Spain 1
4 Italy 1
5 India 2
6 China 1
7 Canada 1
8 France 0
9 Portugal 1
Update
I am interested in the count of times a country is in a unique agreement with at least one EP equal or greater than 1.
cols = ['country1', 'country2', 'Agreement']
out = (df.assign(Agreement=df.filter(like='EP').any(axis=1))[cols]
.melt('Agreement', value_name='Country')
.groupby('Country', sort=False)['Agreement'].max().astype(int).reset_index())
print(out)
# Output
Country Agreement
0 USA 1
1 UK 1
2 Germany 1
3 Spain 1
4 Italy 1
5 India 1
6 China 1
7 Canada 1
8 France 0
9 Portugal 1

How to Lookup from Different Dataframe into a Middle Column of First Dataframe

I have 2 DataFrames from 2 different csv file and both file have let's say 5 columns each. And I need to lookup 1 column from the second DataFrame into the first DataFrame so the first DataFrame will has 6 columns and lookup using the ID.
Example are as below:
import pandas as pd
data = [[6661, 'Lily', 21, 5000, 'USA'], [6672, 'Mark', 32, 32500, 'Canada'], [6631, 'Rose', 20, 1500, 'London'],
[6600, 'Jenifer', 42, 50000, 'London'], [6643, 'Sue', 27, 8000, 'Turkey']]
ds_main = pd.DataFrame(data, columns = ['ID', 'Name', 'Age', 'Income', 'Country'])
data2 = [[6672, 'Mark', 'Shirt', 8.5, 2], [6643, 'Sue', 'Scraft', 2.0, 5], [6661, 'Lily', 'Blouse', 11.9, 2],
[6600, 'Jenifer', 'Shirt', 9.8, 1], [6631, 'Rose', 'Pants', 4.5, 2]]
ds_rate = pd.DataFrame(data2, columns = ['ID', 'Name', 'Product', 'Rate', 'Quantity'])
I wanted to lookup the 'Rate' from ds_rate into the ds_main. However, I wanted the rate to be place in the middle of the ds_main DataFrame.
The result should be as below:
I have tried using merge and insert, still unable to get the result that I wanted. Is there any easy way to do it?
You could use set_index + loc to get "Rate" sorted according to its "ID" in ds_main; then insert:
ds_main.insert(3, 'Rate', ds_rate.set_index('ID')['Rate'].loc[ds_main['ID']].reset_index(drop=True))
Output:
ID Name Age Rate Income Country
0 6661 Lily 21 11.9 5000 USA
1 6672 Mark 32 8.5 32500 Canada
2 6631 Rose 20 4.5 1500 London
3 6600 Jenifer 42 9.8 50000 London
4 6643 Sue 27 2.0 8000 Turkey
Assuming 'ID' is unique
ds_main.iloc[:, :3].merge(ds_rate[['ID', 'Rate']]).join(ds_main.iloc[:, 3:])
ID Name Age Rate Income Country
0 6661 Lily 21 11.9 5000 USA
1 6672 Mark 32 8.5 32500 Canada
2 6631 Rose 20 4.5 1500 London
3 6600 Jenifer 42 9.8 50000 London
4 6643 Sue 27 2.0 8000 Turkey

Operations on multiple data frame in PANDAS

I have several tables that look like this:
ID YY ZZ
2 97 826
2 78 489
4 47 751
4 110 322
6 67 554
6 88 714
code:
raw = {'ID': [2, 2, 4, 4, 6, 6,],
'YY': [97,78,47,110,67,88],
'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
For each of these dfs, I have to perform a number of operations.
First, group by id,
extract the length of the column zz and average of the column zz,
put results in new df
New df that looks like this
Cities length mean
Paris 0 0
Madrid 0 0
Berlin 0 0
Warsaw 0 0
London 0 0
code:
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2)
I pulled out the average and the size of individual groups
df_grouped = df.groupby('ID').ZZ.size()
df_grouped2 = df.groupby('ID').ZZ.mean()
the problem occurs when trying to transfer results to a new table because it does not contain all the cities and the results must be matched according to the appropriate key.
I tried to use a dictionary:
dic_cities = {"Paris":df_grouped.loc[2],
"Madrid":df_grouped.loc[4],
"Warsaw":df_grouped.loc[6],
"Berlin":df_grouped.loc[8],
"London":df_grouped.loc[10]}
Unfortunately, I'm receiving KeyError: 8
I have 19 df's from which I have to extract this data and the final tables have to look like this:
Cities length mean
Paris 2 657.5
Madrid 2 536.5
Berlin 0 0.0
Warsaw 2 634.0
London 0 0.0
Does anyone know how to deal with it using groupby and the dictionary or knows a better way to do it?
First, you should index df2 on 'Cities':
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2).set_index('Cities')
Then you should reverse you dictionary:
dic_cities = {2: "Paris",
4: "Madrid",
6: "Warsaw",
8: "Berlin",
10: "London"}
Once this is done, the processing is as simple as a groupby:
for i, sub in df.groupby('ID'):
df2.loc[dic_cities[i]] = sub.ZZ.agg([len, np.mean]).tolist()
Which gives for df2:
length mean
Cities
Paris 2.0 657.5
Madrid 2.0 536.5
Berlin 0.0 0.0
Warsaw 2.0 634.0
London 0.0 0.0
See this:
import pandas as pd
# setup raw data
raw = {'ID': [2, 2, 4, 4, 6, 6,], 'YY': [97,78,47,110,67,88], 'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
# get mean values
mean_values = df.groupby('ID').mean()
# drop column
mean_values = mean_values.drop(['YY'], axis=1)
# get occurrence number
occurrence = df.groupby('ID').size()
# save data
result = pd.concat([occurrence, mean_values], axis=1, sort=False)
# rename columns
result.rename(columns={0:'length', 'ZZ':'mean'}, inplace=True)
# city data
raw2 = 'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'], 'length': 0, 'mean': 0}
df2 = pd.DataFrame(raw2)
# rename indexes
df2 = df2.rename(index={0: 2, 1:4, 2:8, 3:6, 4:10}
# merge data
df2['length'] = result['length']
df2['mean'] = result['mean']
Outout:
Cities length mean
2 Paris 2.0 657.5
4 Madrid 2.0 536.5
8 Berlin NaN NaN
6 Warsaw 2.0 634.0
10 London NaN NaN

iterate through df column and return value in dataframe based on row index, column reference

My goal is to compare each value from the column "year" against the appropriate column year (i.e. 1999, 2000). I then want to return the corresponding value from the corresponding column. For example, for Afghanistan (first row), year 2004, I want to find the column named "2004" and return the value from the row that contains afghanistan.
Here is the table. For reference this table is the result of a sql join between educational attainment in a single defined year and a table for gdp per country for years 1999 - 2010. My ultimate goal is to return the gdp from the year that the educational data is from.
country year men_ed_yrs women_ed_yrs total_ed_yrs 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
0 Afghanistan 2004 11 5 8 NaN NaN 2461666315 4128818042 4583648922 5285461999 6.275076e+09 7.057598e+09 9.843842e+09 1.019053e+10 1.248694e+10 1.593680e+10
1 Albania 2004 11 11 11 3414760915 3632043908 4060758804 4435078648 5746945913 7314865176 8.158549e+09 8.992642e+09 1.070101e+10 1.288135e+10 1.204421e+10 1.192695e+10
2 Algeria 2005 13 13 13 48640611686 54790060513 54744714110 56760288396 67863829705 85324998959 1.030000e+11 1.170000e+11 1.350000e+11 1.710000e+11 1.370000e+11 1.610000e+11
3 Andorra 2008 11 12 11 1239840270 1401694156 1484004617 1717563533 2373836214 2916913449 3.248135e+09 3.536452e+09 4.010785e+09 4.001349e+09 3.649863e+09 3.346317e+09
4 Anguilla 2008 11 11 11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
gdp_ed_list = []
for value in df_combined_column_named['year']: #loops through each year in year column
if value in df_combined_column_named.columns: #compares year to column names
idx = df_combined_column_named[df_combined_column_named['year'][value]].index.tolist() #supposed to get the index associated with value
gdp_ed = df_combined_column_named.get_value(idx, value) #get the value of the cell found at idx, value
gdp_ed_list.append(gdp_ed) #append to a list
Currently, my code is getting stuck at the index.list() section. It is returning the error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-85-361acb97edd4> in <module>()
2 for value in df_combined_column_named['year']: #loops through each year in year column
3 if value in df_combined_column_named.columns: #compares year to column names
----> 4 idx = df_combined_column_named[df_combined_column_named['year'][value]].index.tolist()
5 gdp_ed = df_combined_column_named.get_value(idx, value)
6 gdp_ed_list.append(gdp_ed)
KeyError: u'2004'
Any thoughts?
It looks like you are trying to match the value in the year column to column labels and then extract the value in the corresponding cells. You could do that by looping through the rows (see below) but I think it would be not be the fastest way.
Instead, you could use pd.melt to coalesce the columns with year-like labels into a single column, say, year_col:
In [38]: melted = pd.melt(df, id_vars=['country', 'year', 'men_ed_yrs', 'women_ed_yrs', 'total_ed_yrs'], var_name='year_col')
In [39]: melted
Out[39]:
country year men_ed_yrs women_ed_yrs total_ed_yrs year_col value
0 Afghanistan 2004 11 5 8 1999 NaN
1 Albania 2004 11 11 11 1999 3.414761e+09
2 Algeria 2005 13 13 13 1999 4.864061e+10
3 Andorra 2008 11 12 11 1999 1.239840e+09
4 Anguilla 2008 11 11 11 1999 NaN
5 Afghanistan 2004 11 5 8 2000 NaN
...
The benefit of "melting" the DataFrame in this way is that
now you would have both year and year_col columns. The values you are looking for are in the rows where year equals year_col. And that is easy to obtain by using .loc:
In [41]: melted.loc[melted['year'] == melted['year_col']]
Out[41]:
country year men_ed_yrs women_ed_yrs total_ed_yrs year_col \
25 Afghanistan 2004 11 5 8 2004
26 Albania 2004 11 11 11 2004
32 Algeria 2005 13 13 13 2005
48 Andorra 2008 11 12 11 2008
49 Anguilla 2008 11 11 11 2008
value
25 5.285462e+09
26 7.314865e+09
32 1.030000e+11
48 4.001349e+09
49 NaN
Thus, you could use
import numpy as np
import pandas as pd
nan = np.nan
df = pd.DataFrame({'1999': [nan, 3414760915.0, 48640611686.0, 1239840270.0, nan],
'2000': [nan, 3632043908.0, 54790060513.0, 1401694156.0, nan],
'2001': [2461666315.0, 4060758804.0, 54744714110.0, 1484004617.0, nan],
'2002': [4128818042.0, 4435078648.0, 56760288396.0, 1717563533.0, nan],
'2003': [4583648922.0, 5746945913.0, 67863829705.0, 2373836214.0, nan],
'2004': [5285461999.0, 7314865176.0, 85324998959.0, 2916913449.0, nan],
'2005': [6275076000.0, 8158549000.0, 103000000000.0, 3248135000.0, nan],
'2006': [7057598000.0, 8992642000.0, 117000000000.0, 3536452000.0, nan],
'2007': [9843842000.0, 10701010000.0, 135000000000.0, 4010785000.0, nan],
'2008': [10190530000.0, 12881350000.0, 171000000000.0, 4001349000.0, nan],
'2009': [12486940000.0, 12044210000.0, 137000000000.0, 3649863000.0, nan],
'2010': [15936800000.0, 11926950000.0, 161000000000.0, 3346317000.0, nan],
'country': ['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Anguilla'],
'men_ed_yrs': [11, 11, 13, 11, 11],
'total_ed_yrs': [8, 11, 13, 11, 11],
'women_ed_yrs': [5, 11, 13, 12, 11],
'year': ['2004', '2004', '2005', '2008', '2008']})
melted = pd.melt(df, id_vars=['country', 'year', 'men_ed_yrs', 'women_ed_yrs',
'total_ed_yrs'], var_name='year_col')
result = melted.loc[melted['year'] == melted['year_col']]
print(result)
Why was a KeyError raised:
The KeyError is being raised by df_combined_column_named['year'][value]. Suppose value is '2004'. Then df_combined_column_named['year'] is a Series containing string representations of years and indexed by integers (like 0, 1, 2, ...). df_combined_column_named['year'][value] fails because it attempts to index this Series with the string '2004' which is not in the integer index.
Alternatively, here is another way to achieve the goal by looping through the rows using iterrows. This is perhaps simpler to understand, but in general using iterrows is slow compared to other column-based Pandas-centric methods:
data = []
for idx, row in df.iterrows():
data.append((row['country'], row['year'], row[row['year']]))
result = pd.DataFrame(data, columns=['country', 'year', 'value'])
print(result)
prints
country year value
0 Afghanistan 2004 5.285462e+09
1 Albania 2004 7.314865e+09
2 Algeria 2005 1.030000e+11
3 Andorra 2008 4.001349e+09
4 Anguilla 2008 NaN

How to encode a categorical variable (series) in the data frame in Python?

I have a dictionary of the following form:
{CA: California, NV: Nevada, TX: Texas}
I want to transform my data frame
{
'state':['California', 'California, 'Texas', 'Nevada', 'Texas],
'var':[100,200,300,400, 500]
}
into
{
'state':['CA','CA','TX','NV','TX'],
'var':[100,200,300,400,500]
}
What's the best way to do this?
If you reversed the keys and values in your dict then you can just use map:
# to swap the keys and values:
new_map = dict (zip(my_dict.values(),my_dict.keys()))
then call map:
df.state = df.state.map(new_map)
This assumes that your keys are present in the map, if not you will get a KeyError raised
So create dataframe:
In [12]:
df = pd.DataFrame({
'state':['California', 'California', 'Texas', 'Nevada', 'Texas'],
'var':[100,200,300,400, 500]
})
df
Out[12]:
state var
0 California 100
1 California 200
2 Texas 300
3 Nevada 400
4 Texas 500
[5 rows x 2 columns]
your dict:
my_dict = {'CA': 'California', 'NV': 'Nevada', 'TX': 'Texas'}
reverse the keys and values
new_dict = dict(zip(my_dict.values(), my_dict.keys()))
now call map to perform the lookup and assign back to state:
In [13]:
df.state = df.state.map(new_dict)
df
Out[13]:
state var
0 CA 100
1 CA 200
2 TX 300
3 NV 400
4 TX 500
[5 rows x 2 columns]
If you are worried that some values may not exist then you can use get on the dict so that it handles the KeyError and assigns None as a value:
setup a new df with 'New York'
In [19]:
df = pd.DataFrame({
'state':['California', 'California', 'Texas', 'Nevada', 'Texas', 'New York'],
'var':[100,200,300,400, 500, 600]
})
df
Out[19]:
state var
0 California 100
1 California 200
2 Texas 300
3 Nevada 400
4 Texas 500
5 New York 600
[6 rows x 2 columns]
Now call get instead:
In [25]:
df.state = df.state.map(lambda x: new_dict.get(x))
df
Out[25]:
state var
0 CA 100
1 CA 200
2 TX 300
3 NV 400
4 TX 500
5 None 600
[6 rows x 2 columns]

Categories