Transpose subset of pandas dataframe into multi-indexed data frame - python

I have the following dataframe:
df.head(14)
I'd like to transpose just the yr and the ['WA_','BA_','IA_','AA_','NA_','TOM_']
variables by Label. The resulting dataframe should then be a Multi-indexed frame with Label and the WA_, BA_, etc. and the columns names will be 2010, 2011, etc. I've tried,
transpose(), groubby(), pivot_table(), long_to_wide(),
and before I roll my own nested loop going line by line through this df I thought I'd ping the community. Something like this by every Label group:
I feel like the answer is in one of those functions but I'm just missing it. Thanks for your help!

From what I can tell by your illustrated screenshots, you want WA_, BA_ etc as rows and yr as columns, with Label remaining as a row index. If so, consider stack() and unstack():
# sample data
labels = ["Albany County","Big Horn County"]
n_per_label = 7
n_rows = n_per_label * len(labels)
years = np.arange(2010, 2017)
min_val = 10000
max_val = 40000
data = {"Label": sorted(np.array(labels * n_per_label)),
"WA_": np.random.randint(min_val, max_val, n_rows),
"BA_": np.random.randint(min_val, max_val, n_rows),
"IA_": np.random.randint(min_val, max_val, n_rows),
"AA_": np.random.randint(min_val, max_val, n_rows),
"NA_": np.random.randint(min_val, max_val, n_rows),
"TOM_": np.random.randint(min_val, max_val, n_rows),
"yr":np.append(years,years)
}
df = pd.DataFrame(data)
AA_ BA_ IA_ NA_ TOM_ WA_ Label yr
0 27757 23138 10476 20047 34015 12457 Albany County 2010
1 37135 30525 12296 22809 27235 29045 Albany County 2011
2 11017 16448 17955 33310 11956 19070 Albany County 2012
3 24406 21758 15538 32746 38139 39553 Albany County 2013
4 29874 33105 23106 30216 30176 13380 Albany County 2014
5 24409 27454 14510 34497 10326 29278 Albany County 2015
6 31787 11301 39259 12081 31513 13820 Albany County 2016
7 17119 20961 21526 37450 14937 11516 Big Horn County 2010
8 13663 33901 12420 27700 30409 26235 Big Horn County 2011
9 37861 39864 29512 24270 15853 29813 Big Horn County 2012
10 29095 27760 12304 29987 31481 39632 Big Horn County 2013
11 26966 39095 39031 26582 22851 18194 Big Horn County 2014
12 28216 33354 35498 23514 23879 17983 Big Horn County 2015
13 25440 28405 23847 26475 20780 29692 Big Horn County 2016
Now set Label and yr as indices.
df.set_index(["Label","yr"], inplace=True)
From here, unstack() will pivot the inner-most index to columns. Then, stack() can swing our value columns down into rows.
df.unstack().stack(level=0)
yr 2010 2011 2012 2013 2014 2015 2016
Label
Albany County AA_ 27757 37135 11017 24406 29874 24409 31787
BA_ 23138 30525 16448 21758 33105 27454 11301
IA_ 10476 12296 17955 15538 23106 14510 39259
NA_ 20047 22809 33310 32746 30216 34497 12081
TOM_ 34015 27235 11956 38139 30176 10326 31513
WA_ 12457 29045 19070 39553 13380 29278 13820
Big Horn County AA_ 17119 13663 37861 29095 26966 28216 25440
BA_ 20961 33901 39864 27760 39095 33354 28405
IA_ 21526 12420 29512 12304 39031 35498 23847
NA_ 37450 27700 24270 29987 26582 23514 26475
TOM_ 14937 30409 15853 31481 22851 23879 20780
WA_ 11516 26235 29813 39632 18194 17983 29692

Related

calculate bad month from the given csv

I tried finding the five worst months from the data but I'm not sure about the process as I'm very confused . The answer should be something like (June 2001, July 2002 )but when I tried to solve it my answer wasn't as expected. Only the data of January was sorted. This the the way I tried solving my question and the csv data file is also provided on the screenshot.
My solution is given below:
PATH = "tourist_arrival.csv"
df = pd.read_csv(PATH)
print(df.sort_values(by=['Jan.','Feb.','Mar.','Apr.','May.','Jun.','Jul.','Aug.','Sep.','Oct.','Nov.','Dec.'],ascending=False))
Year ,Jan.,Feb.,Mar.,Apr.,May.,Jun.,Jul.,Aug.,Sep.,Oct.,Nov.,Dec.,Total
1992, 17451,27489,31505,30682,29089,22469,20942,27338,24839,42647,32341,27561,334353
1993 ,19238,23931,30818,20121,20585,19602,13588,21583,23939,42242,30378,27542,293567
1994, 21735,24872,31586,27292,26232,22907,19739,27610,27959,39393,28008,29198,326531
1995 ,22207,28240,34219,33994,27843,25650,23980,27686,30569,46845,35782,26380,363395
1996 ,27886,29676,39336,36331,29728,26749,22684,29080,32181,47314,37650,34998,393613
1997,25585,32861,43177,35229,33456,26367,26091,35549,31981,56272,40173,35116,421857
1998,28822,37956,41338,41087,35814,29181,27895,36174,39664,62487,47403,35863,463684
1999,29752,38134,46218,40774,42712,31049,27193,38449,44117,66543,48865,37698,491504
2000,25307,38959,44944,43635,28363,26933,24480,34670,43523,59195,52993,40644,463646
2001,30454,38680,46709,39083,28345,13030,18329,25322,31170,41245,30282,18588,361237
2002,17176,20668,28815,21253,19887,17218,16621,21093,23752,35272,28723,24990,275468
2003,21215,24349,27737,25851,22704,20351,22661,27568,28724,45459,38398,33115,338132
2004,30988,35631,44290,33514,26802,19793,24860,33162,25496,43373,36381,31007,385297
2005,25477,20338,29875,23414,25541,22608,23996,36910,36066,51498,41505,38170,375398
2006,28769,25728,36873,21983,22870,26210,25183,33150,33362,49670,44119,36009,383926
2007,33192,39934,54722,40942,35854,31316,35437,44683,45552,70644,52273,42156,526705
2008,36913,46675,58735,38475,30410,24349,25427,40011,41622,66421,52399,38840,500277
2009,29278,40617,49567,43337,30037,31749,30432,44174,42771,72522,54423,41049,509956
2010,33645,49264,63058,45509,32542,33263,38991,54672,54848,79130,67537,50408,602867
2011,42622,56339,67565,59751,46202,46115,42661,71398,63033,96996,83460,60073,736215
2012,52501,66459,89151,69796,50317,53630,49995,71964,66383,86379,83173,63344,803092
2013,47846,67264,88697,65152,52834,54599,54011,68478,66755,99426,75485,57069,797616
melt your DataFrame and then sort_values:
output = df.melt("Year", df.drop(["Year", "Total"], axis=1).columns, var_name="Month").sort_values("value").reset_index(drop=True)
>>> output
Year Month value
0 2001 Jun. 13030
1 1993 Jul. 13588
2 2002 Jul. 16621
3 2002 Jan. 17176
4 2002 Jun. 17218
.. ... ... ...
259 2012 Oct. 86379
260 2013 Mar. 88697
261 2012 Mar. 89151
262 2011 Oct. 96996
263 2013 Oct. 99426
[264 rows x 3 columns]
For just the 5 worst months, you can do:
>>> output.iloc[:5]
Year Month value
0 2001 Jun. 13030
1 1993 Jul. 13588
2 2002 Jul. 16621
3 2002 Jan. 17176
4 2002 Jun. 17218

Find top n elements in pandas dataframe column by keeping the grouping

I am trying to find the top 5 elements of the column total_petitions, but keeping the ordered grouping I did.
df = df[['fy', 'EmployerState', 'total_petitions']]
table = df.groupby(['fy','EmployerState']).mean()
table.nlargest(5, 'total_petitions')
sample output:
fy EmployerState total_petitions
2020 WA 7039.333333
2016 MD 2647.400000
2017 MD 2313.142857
... TX 2305.541667
2020 TX 2081.952381
desired output:
fy EmployerState total_petitions
2016 AL 3.875000
AR 225.333333
AZ 26.666667
CA 326.056604
CO 21.333333
... ... ...
2020 VA 36.714286
WA 7039.333333
WI 43.750000
WV 8986086.08
WY 1.000000
with the elements of total_petitions being the 5 states with highest means by year
What you are looking for is a pivot table:
df = df.pivot_table(values='total_petitions', index=['fy','EmployerState'])
df = df.groupby(level='fy')['total_petitions'].nlargest(5).reset_index(level=0, drop=True).reset_index()

Add a column of repeating numbers to existing dataframe

I have the following dataframe where each row is a unique state-city pair:
State City
NY Albany
NY NYC
MA Boston
MA Cambridge
I want to a add a column of years ranging from 2000 to 2018:
State City. Year
NY Albany 2000
NY Albany 2001
NY Albany 2002
...
NY Albany 2018
NY NYC 2000
NY NYC 2018
...
MA Cambridge 2018
I know I can create a list of numbers using Year = list(range(2000,2019))
Does anyone know how to put this list as a column in the dataframe for each state-city?
You could try adding it as a list and then performing explode. I think it should work:
df['Year'] = [list(range(2000,2019))] * len(df)
df = df.explode('Year')
One way is to use the DataFrame.stack() method.
Here is sample of your current data:
data = [['NY', 'Albany'],
['NY', 'NYC'],
['MA', 'Boston'],
['MA', 'Cambridge']]
cities = pd.DataFrame(data, columns=['State', 'City'])
print(cities)
# State City
# 0 NY Albany
# 1 NY NYC
# 2 MA Boston
# 3 MA Cambridge
First, make this into a multi-level index (this will end up in the final dataframe):
cities_index = pd.MultiIndex.from_frame(cities)
print(cities_index)
# MultiIndex([('NY', 'Albany'),
# ('NY', 'NYC'),
# ('MA', 'Boston'),
# ('MA', 'Cambridge')],
# names=['State', 'City'])
Now, make a dataframe with all the years in it (I only use 3 years for brevity):
years = list(range(2000, 2003))
n_cities = len(cities)
years_data = np.repeat(years, n_cities).reshape(len(years), n_cities).T
years_data = pd.DataFrame(years_data, index=cities_index)
years_data.columns.name = 'Year index'
print(years_data)
# Year index 0 1 2
# State City
# NY Albany 2000 2001 2002
# NYC 2000 2001 2002
# MA Boston 2000 2001 2002
# Cambridge 2000 2001 2002
Finally, use stack to transform this dataframe into a vertically-stacked series which I think is what you want:
years_by_city = years_data.stack().rename('Year')
print(years_by_city.head())
# State City Year index
# NY Albany 0 2000
# 1 2001
# 2 2002
# NYC 0 2000
# 1 2001
# Name: Year, dtype: int64
If you want to remove the index and have all the values as a dataframe just do
cities_and_years = years_by_city.reset_index()

create unique identifier in dataframe based on combination of columns, but only for duplicated rows

A corollary of the question here:
create unique identifier in dataframe based on combination of columns
In the foll. dataframe,
id Lat Lon Year Area State
50319 -36.0629 -62.3423 2019 90 Iowa
18873 -36.0629 -62.3423 2017 90 Iowa
18876 -36.0754 -62.327 2017 124 Illinois
18878 -36.0688 -62.3353 2017 138 Kansas
I want to create a new column which assigns a unique identifier based on whether the columns Lat, Lon and Area have the same values. E.g. in this case rows 1 and 2 have the same values in those columns and will be given the same unique identifier 0_Iowa where Iowa comes from the State column. However, if there is no duplicate for a row, then I just want to use the state name. The end result should look like this:
id Lat Lon Year Area State unique_id
50319 -36.0629 -62.3423 2019 90 Iowa 0_Iowa
18873 -36.0629 -62.3423 2017 90 Iowa 0_Iowa
18876 -36.0754 -62.327 2017 124 Illinois Illinois
18878 -36.0688 -62.3353 2017 138 Kansas Kansas
You can use an np.where:
df['unique_id'] = np.where(df.duplicated(['Lat','Lon'], keep=False),
df.groupby(['Lat','Lon'], sort=False).ngroup().astype('str') + '_' + df['State'],
df['State'])
Or similar idea with pd.Series.where:
df['unique_id'] = (df.groupby(['Lat','Lon'], sort=False)
.ngroup().astype('str')
.add('_' + df['State'])
.where(df.duplicated(['Lat','Lon'], keep=False),
df['State']
)
)
Output:
id Lat Lon Year Area State unique_id
0 50319 -36.0629 -62.3423 2019 90 Iowa 0_Iowa
1 18873 -36.0629 -62.3423 2017 90 Iowa 0_Iowa
2 18876 -36.0754 -62.3270 2017 124 Illinois Illinois
3 18878 -36.0688 -62.3353 2017 138 Kansas Kansas

Summarize values in panda data frames

I want to calculate the maximum value for each year and show the sector and that value. For example, from the screenshot, I would like to display:
2010: Telecom 781
2011: Tech 973
I have tried using:
df.groupby(['Year', 'Sector'])['Revenue'].max()
but this does not give me the name of Sector which has the highest value.
Try using idxmax and loc:
df.loc[df.groupby(['Sector','Year'])['Revenue'].idxmax()]
MVCE:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Sector':['Telecom','Tech','Financial Service','Construction','Heath Care']*3,
'Year':[2010,2011,2012,2013,2014]*3,
'Revenue':np.random.randint(101,999,15)})
df.loc[df.groupby(['Sector','Year'])['Revenue'].idxmax()]
Output:
Sector Year Revenue
3 Construction 2013 423
12 Financial Service 2012 838
9 Heath Care 2014 224
1 Tech 2011 466
5 Telecom 2010 843
Also .sort_values + .tail, grouping on just year. Data from #Scott Boston
df.sort_values('Revenue').groupby('Year').tail(1)
Output:
Sector Year Revenue
9 Heath Care 2014 224
3 Construction 2013 423
1 Tech 2011 466
12 Financial Service 2012 838
5 Telecom 2010 843

Categories