I received the following tables besides each other in one excel file and need to order them into one table with the column names Name, Blue, Green, Year.
The names are the same for both tables and the indicator is the same in the whole column
. There is one empty column between these tables.
This is the sheet I received:
Name
Indicator
2016
2017
2018
Name
Indicator
2016
2017
2018
Name 1
blue
524
108
387
Name 1
green
92
872
90
Name 2
blue
77
274
50
Name 2
green
402
312
528
Name 3
blue
201
774
18
Name 3
green
457
827
20
Name 4
blue
35
100
129
Name 4
green
183
428
510
This is how it should look like:
Name
Blue
Green
Year
Name 1
524
92
2016
Name 1
108
872
2017
Name 1
387
90
2018
Name 2
77
402
2016
Name 2
274
312
2017
Name 2
50
528
2018
Name 3
201
457
2016
Name 3
774
827
2017
Name 3
18
20
2018
...
...
...
...
Is there any way to do this with pandas?
you can use merge on in pandas to do this.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
seperate the data into two dataframes.
call one df_one and another df_two then merge.
merged_df = df_one.merge(df_two, how = "right")
I have two dataframes: data_df and geo_dimension_df.
I would like to take the index of geo_dimension_df, which I renamed to id, and make it a column on data_df called geo_id.
I'll be inserting both of these dataframes as tables into a database, and the id columns will be their primary keys while geo_id is a foreign key that will link data_df to geo_dimension_df.
As can be seen, the cbsa and name values can change over time. (Yuba City, CA -> Yuba City-Marysville, CA). Therefore, the geo_dimension_df is all the unique combinations of cbsa and name.
I need to compare the cbsa and name values on both dataframes and then when matching set geo_dimension_df.id as the data_df.geo_id value.
I tried using merge for a bit, but got confused, so now I'm trying with apply and looking at it like an Excel vlookup across multiple column values, but having no luck. The following is my attempt, but it's a bit gibberish...
data_df['geo_id'] = data_df[['cbsa', 'name']]
.apply(
lambda x, y:
geo_dimension_df
.index[geo_dimension_df[['cbsa', 'name]]
.to_list()
== [x,y])
Below are the two original dataframes followed by the desired result. Thank you.
geo_dimension_df:
cbsa name
id
1 10180 Abilene, TX
2 10420 Akron, OH
3 10500 Albany, GA
4 10540 Albany, OR
5 10540 Albany-Lebanon, OR
...
519 49620 York-Hanover, PA
520 49660 Youngstown-Warren-Boardman, OH-PA
521 49700 Yuba City, CA
522 49700 Yuba City-Marysville, CA
523 49740 Yuma, AZ
data_df:
cbsa name month year units_total
id
1 10180 Abilene, TX 1 2004 22
2 10180 Abilene, TX 2 2004 12
3 10180 Abilene, TX 3 2004 44
4 10180 Abilene, TX 4 2004 32
5 10180 Abilene, TX 5 2004 21
...
67145 49740 Yuma, AZ 12 2018 68
67146 49740 Yuma, AZ 1 2019 86
67147 49740 Yuma, AZ 2 2019 99
67148 49740 Yuma, AZ 3 2019 99
67149 49740 Yuma, AZ 4 2019 94
Desired Outcome:
data_df (with geo_id foreign key column added):
cbsa name month year units_total geo_id
id
1 10180 Abilene, TX 1 2004 22 1
2 10180 Abilene, TX 2 2004 12 1
3 10180 Abilene, TX 3 2004 44 1
4 10180 Abilene, TX 4 2004 32 1
5 10180 Abilene, TX 5 2004 21 1
...
67145 49740 Yuma, AZ 12 2018 68 523
67146 49740 Yuma, AZ 1 2019 86 523
67147 49740 Yuma, AZ 2 2019 99 523
67148 49740 Yuma, AZ 3 2019 99 523
67149 49740 Yuma, AZ 4 2019 94 523
Note: I'll be dropping cbsa and name from data_df after this, in case anybody is curious as to why I'm duplicating data.
First, because the index is not a proper column, make it a column so that it can be used in a later merge:
geo_dimension_df['geo_id'] = geo_dimension_df.index
Next, join data_df and geo_dimension_df
data_df = pd.merge(data_df,
geo_dimension_df['cbsa', 'name', 'geo_id'],
on=['cbsa', 'name'],
how='left')
Finally, drop the column you added to the geo_dimension_df at the start:
geo_dimension_df.drop('geo_id', axis=1, inplace=True)
After doing this, geo_dimension_df's index column, id, will now appear on data_df under the column geo_id:
data_df:
cbsa name month year units_total geo_id
id
1 10180 Abilene, TX 1 2004 22 1
2 10180 Abilene, TX 2 2004 12 1
3 10180 Abilene, TX 3 2004 44 1
4 10180 Abilene, TX 4 2004 32 1
5 10180 Abilene, TX 5 2004 21 1
...
67145 49740 Yuma, AZ 12 2018 68 523
67146 49740 Yuma, AZ 1 2019 86 523
67147 49740 Yuma, AZ 2 2019 99 523
67148 49740 Yuma, AZ 3 2019 99 523
67149 49740 Yuma, AZ 4 2019 94 523
Geography Age group 2016
0 Toronto All 1525
1 Toronto 1~7 5
2 Toronto 7~20 7
3 Toronto 20~40 500
4 Vancouver All 3000
5 Vancouver 1~7 10
6 Vancouver 7~20 565
7 Vancouver 20~40 564
.
.
.
NOTE: This is just an example. my dataframe contains different numbers
I want to create multi-index where first index is by Geography and second is by age group.
Also is it possible to groupby w/o performing any functions at the end?
Output should be:
Geography Age group 2016
0 Toronto All 1525
1 1~7 5
2 7~20 7
3 20~40 500
4 Vancouver All 3000
5 1~7 10
6 7~20 565
7 20~40 564
.
.
In order to create a MultiIndex as specified, you can simply use DataFrame.set_index():
df.set_index(['Geography','Agegroup' ])
2016
Geography Age group
Toronto All 1525
1~7 5
7~20 7
20~40 500
Vancouver All 3000
1~7 10
7~20 565
20~40 564
Hi i am a stata user and i am trying to pass my codes to Pandas. I have a panel data as shown below, and i am looking for a command that can create a constant variable according to which year and quarter the row is located. In stata such command would be reproduced by gen new_variable = yq(year, quarter)
My dataframe look like this
id year quarter
1 2007 1
1 2007 2
1 2007 3
1 2007 4
1 2008 1
1 2008 2
1 2008 3
1 2008 4
1 2009 1
1 2009 2
1 2009 3
1 2009 4
2 2007 1
2 2007 2
2 2007 3
2 2007 4
2 2008 1
2 2008 2
2 2008 3
2 2008 4
3 2009 2
3 2009 3
3 2010 2
3 2010 3
I my expected output should look like this: (Values inside new_variable are arbitrary, just looking for a constant value the would be always the same for each year and quarter and consecutives)
id year quarter new_variable
1 2007 1 220
1 2007 2 221
1 2007 3 222
1 2007 4 223
1 2008 1 224
1 2008 2 225
1 2008 3 226
1 2008 4 227
1 2009 1 228
1 2009 2 229
1 2009 3 230
1 2009 4 231
2 2007 1 220
2 2007 2 221
2 2007 3 222
2 2007 4 223
2 2008 1 224
2 2008 2 225
2 2008 3 226
2 2008 4 227
3 2009 2 229
3 2009 3 230
3 2010 2 233
3 2010 3 234
My solution extends the idea of #johnchase: build a dictionary mapping from the Cartesian product of year by quarter that takes the string representation year + quarter to integers.
ys = df['year'].unique()
qs = df['quarter'].unique()
new_idx = pd.MultiIndex.from_product([ys, qs], names=['year', 'quarter'])
yq = [''.join([str(a), str(b)]) for a, b in new_idx.values]
# yq
# ['20071', '20072', '20073', '20074',
# '20081', '20082', '20083', '20084',
# '20091', '20092', '20093', '20094',
# '20101', '20102', '20103', '20104']
mapper = {k:i+220 for i, k in enumerate(yq)}
df['new_variable'] = df['year'].astype(str) + df['quarter'].astype(str)
df['new_variable'] = df['new_variable'].map(mapper)
df
id year quarter new_variable
0 1 2007 1 220
1 1 2007 2 221
2 1 2007 3 222
3 1 2007 4 223
4 1 2008 1 224
5 1 2008 2 225
6 1 2008 3 226
7 1 2008 4 227
8 1 2009 1 228
9 1 2009 2 229
10 1 2009 3 230
11 1 2009 4 231
12 2 2007 1 220
13 2 2007 2 221
14 2 2007 3 222
15 2 2007 4 223
16 2 2008 1 224
17 2 2008 2 225
18 2 2008 3 226
19 2 2008 4 227
20 3 2009 2 229
21 3 2009 3 230
22 3 2010 2 233
23 3 2010 3 234
I have the following:
result.head(4)
district end party start state type id.thomas current
564 1 1987 Democrat 1985-01-03 HI rep 2 1985
565 1 1993 Democrat 1991-01-03 HI rep 2 1991
566 1 1995 Democrat 1993-01-05 HI rep 2 2019
567 1 1997 Democrat 1995-01-04 HI rep 2 2017
I would like to change all values greater than 2014 in the column end to 2014. I'm not sure how to go about doing this
Use clip_upper:
In [207]:
df['end'] = df['end'].clip_upper(1990)
df
Out[207]:
district end party start state type id.thomas current
564 1 1987 Democrat 1985-01-03 HI rep 2 1985
565 1 1990 Democrat 1991-01-03 HI rep 2 1991
566 1 1990 Democrat 1993-01-05 HI rep 2 2019
567 1 1990 Democrat 1995-01-04 HI rep 2 2017
So in your case df['end'] = df['end'].clip_upper(2014) should work