I received the following tables besides each other in one excel file and need to order them into one table with the column names Name, Blue, Green, Year.
The names are the same for both tables and the indicator is the same in the whole column
. There is one empty column between these tables.
This is the sheet I received:
Name
Indicator
2016
2017
2018
Name
Indicator
2016
2017
2018
Name 1
blue
524
108
387
Name 1
green
92
872
90
Name 2
blue
77
274
50
Name 2
green
402
312
528
Name 3
blue
201
774
18
Name 3
green
457
827
20
Name 4
blue
35
100
129
Name 4
green
183
428
510
This is how it should look like:
Name
Blue
Green
Year
Name 1
524
92
2016
Name 1
108
872
2017
Name 1
387
90
2018
Name 2
77
402
2016
Name 2
274
312
2017
Name 2
50
528
2018
Name 3
201
457
2016
Name 3
774
827
2017
Name 3
18
20
2018
...
...
...
...
Is there any way to do this with pandas?
you can use merge on in pandas to do this.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
seperate the data into two dataframes.
call one df_one and another df_two then merge.
merged_df = df_one.merge(df_two, how = "right")
Related
So I have been trying to use pandas to create a DataFrame that reports the number of graduates working at jobs that do require college degrees ('college_jobs'), and do not require college degrees ('non_college_jobs').
note: the name of the dataframe I am dealing with is recent_grads
I tried the following code:
df1 = recent_grads.groupby(['major_category']).college_jobs.non_college_jobs.sum()
or
df1 = recent_grads.groupby(['major_category']).recent_grads['college_jobs','non_college_jobs'].sum()
or
df1 = recent_grads.groupby(['major_category']).recent_grads['college_jobs'],['non_college_jobs'].sum()
none of them worked! what am I supposed to do? can somebody give me a simple explanation regarding this? I had been trying to read through pandas documentations and did not find the explanation wanted.
here is the head of the dataframe:
rank major_code major major_category \
0 1 2419 PETROLEUM ENGINEERING Engineering
1 2 2416 MINING AND MINERAL ENGINEERING Engineering
2 3 2415 METALLURGICAL ENGINEERING Engineering
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering
4 5 2405 CHEMICAL ENGINEERING Engineering
total sample_size men women sharewomen employed ... \
0 2339 36 2057 282 0.120564 1976 ...
1 756 7 679 77 0.101852 640 ...
2 856 3 725 131 0.153037 648 ...
3 1258 16 1123 135 0.107313 758 ...
4 32260 289 21239 11021 0.341631 25694 ...
part_time full_time_year_round unemployed unemployment_rate median \
0 270 1207 37 0.018381 110000
1 170 388 85 0.117241 75000
2 133 340 16 0.024096 73000
3 150 692 40 0.050125 70000
4 5180 16697 1672 0.061098 65000
p25th p75th college_jobs non_college_jobs low_wage_jobs
0 95000 125000 1534 364 193
1 55000 90000 350 257 50
2 50000 105000 456 176 0
3 43000 80000 529 102 0
4 50000 75000 18314 4440 972
[5 rows x 21 columns]
You could filter the initial DataFrame by the columns you're interested in and then perform the groupby and summation as below:
recent_grads[['major_category', 'college_jobs', 'non_college_jobs']].groupby('major_category').sum()
Conversely, if you don't perform the initial column filter and then do a .sum() on the recent_grads.groupby('major_category') it will be applied to all numeric columns possible.
Here are 2 dataframes
df1:
Index Number Name Amount
0 123 John 31
1 124 Alle 33
2 312 Amy 33
3 314 Holly 35
df2:
Index Number Name Amount
0 312 Amy 13
1 124 Alle 35
2 317 Jack 53
The resulting dataframe should look like this
result_df:
Index Number Name Amount Curr_amount
0 123 John 31 31
1 124 Alle 33 68
2 312 Amy 33 46
3 314 Holly 35 35
4 317 Jack 53
I have tried using pandas isin but it only says if the Number column was present or no in boolean. Is there any way to do this efficiently?
Use merge with outer join and then add Series.add (or
Series.sub if necessary):
df = df1.merge(df2, on=['Number','Name'], how='outer', suffixes=('','_curr'))
df['Amount_curr'] = df['Amount_curr'].add(df['Amount'], fill_value=0)
print (df)
Number Name Amount Amount_curr
0 123 John 31.0 31.0
1 124 Alle 33.0 68.0
2 312 Amy 33.0 46.0
3 314 Holly 35.0 35.0
4 317 Jack NaN 53.0
My knowledge of Pandas is relatively limited, and I've accomplished a lot with a small foundation + all the help in SO. This is the first time I've found myself at a dead end.
I'm trying to find the most efficient way to do the following:
I have a single df of ~150000 rows, with ~40 columns.
Here is a sample dataframe to work with for investigating a solution:
UniqueID CST WEIGHT VOLUME PRODUCTIVITY
0 413-20012 3 123 12 1113
1 413-45365 1 889 75 6748
2 413-21165 8 554 13 4536
3 413-24354 1 387 35 7649
4 413-34658 2 121 88 2468
5 413-36889 4 105 76 3336
6 413-23457 5 355 42 7894
7 413-30089 5 146 10 9112
8 413-41158 5 453 91 4545
9 413-51015 9 654 66 2232
One of the columns is a unique ID, the remaining columns contain data corresponding to the object of that ID. Example:
I've determined a merged-style relationship between the objects outside of the DF, and now need to paste data where that relationship exists, from a 'parent' ID to all of its 'child' IDs.
If I've determined that 413-23457 is the parent of 413-20012 and 413-21165, I then need to copy the values from the parent only in columns WEIGHT, VOLUME, and PRODUCTIVITY (but not UniqueID or CST) to the child objects. I also determine that 413-41158 is the parent of 413-45365 and 413-51015.
I have to do this for many sets of these types of associations across the dataframe.
I've attempted to manipulate a lot of sample code for pasting between dataframes, but several of my requirements appear to be making it difficult to search for a useful enough sample. I can also envision a way where I create objects of everything using .itterows(), and then matching and pasting accordingly in a loop. But, having relegated to .iterrows() for past solutions, and noting how long it can take, I don't think I can apply that here and sustain it for larger datasets.
Any help would be greatly appreciated.
Edit with additional content per suggested solution
If I rearrange the input dataframe to sort rows more randomly, the suggested answers do not really do the trick (my fault for not better reflecting the actual dataset to this test sample).
Starting Dataframe is:
UniqueID CST WEIGHT VOLUME PRODUCTIVITY
0 413-20012 3 123 12 1113
1 413-45365 1 889 75 6748
2 413-21165 8 554 13 4536
3 413-24354 1 387 35 7649
4 413-34658 2 121 88 2468
5 413-36889 4 105 76 3336
6 413-23457 5 355 42 7894
7 413-30089 5 146 10 9112
8 413-41158 5 453 91 4545
9 413-51015 9 654 66 2232
Current suggested solution is:
parent_child_dict = {
'413-51015': '413-41158',
'413-21165': '413-23457',
'413-45365': '413-41158',
'413-20012': '413-23457'
}
(df.merge(df.UniqueID
.replace(parent_child_dict),
on='UniqueID',
how='right')
.set_index(df.index)
.assign(UniqueID=df.UniqueID,
CST=df.CST)
)
Resulting Dataframe is:
UniqueID CST WEIGHT VOLUME PRODUCTIVITY
0 413-20012 3 387 35 7649
1 413-45365 1 121 88 2468
2 413-21165 8 105 76 3336
3 413-24354 1 355 42 7894
4 413-34658 2 355 42 7894
5 413-36889 4 355 42 7894
6 413-23457 5 146 10 9112
7 413-30089 5 453 91 4545
8 413-41158 5 453 91 4545
9 413-51015 9 453 91 4545
The results are not what was expected now that the rows are in a random order, and I don't understand some of what has happened. Row with UniqueID 413-45365 was intended to mirror the data for 413-41158, but has some combination of data (121, 88, 2468) that does not exist in any of the other rows or even cells in the starting DF.
First thing i would do is to get your parent-child relationship into a dictionary. and then we can use replace and merge:
# create a dictionary of parent-child relationship
parent_child_dict = {}
for parent_id in parent_objects:
children = get_merge(parent_id)
for child in children:
child_id = get_object_info(child)
# update dict
parent_child_dict[child_id] = parent_id
# parent_child_dict = {
# '413-20012': '413-23457',
# '413-21165': '413-23457',
# '413-45365': '413-41158',
# '413-51015': '413-41158'
# }
# merge and copy data back
(df.merge(df.UniqueID
.replace(parent_child_dict),
on='UniqueID',
how='right')
.set_index(df.index)
.assign(UniqueID=df.UniqueID,
CST=df.CST)
)
Output:
UniqueID CST WEIGHT VOLUME PRODUCTIVITY
1 413-23457 5 355 42 7894
2 413-20012 3 355 42 7894
3 413-21165 8 355 42 7894
4 413-24354 1 387 35 7649
5 413-34658 2 121 88 2468
6 413-36889 4 105 76 3336
7 413-30089 5 146 10 9112
9 413-41158 5 453 91 4545
10 413-45365 1 453 91 4545
11 413-51015 9 453 91 4545
I have a dataframe and I want to pull the first Index value after each time I sort the dataframe based on values as a string.
And what I want my function to do is pull the country name at the top of the list. In this example, it would pull 'United States' as a string. Because the country names are the indexes and not Series values I can't just do summer_gold.iloc[0].
Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
Afghanistan 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2 AFG
Algeria 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15 ALG
Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70 ARG
Armenia 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12 ARM
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 ANZ
So if I were to sort based on number of Gold medals I'd get a
dataframe that looks like:
# Summer Gold Silver Bronze Total # Winter Gold.1 \
United States 26 976 757 666 2399 22 96
Soviet Union 9 395 319 296 1010 9 78
Great Britain 27 236 272 272 780 22 10
France 27 202 223 246 671 22 31
China 9 201 146 126 473 10 12
Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 \
United States 102 84 282 48 1072 859
Soviet Union 57 59 194 18 473 376
Great Britain 4 12 26 49 246 276
France 31 47 109 49 233 254
China 22 19 53 19 213 168
Bronze.2 Combined total ID
United States 750 2681 USA
Soviet Union 355 1204 URS
Great Britain 284 806 GBR
France 293 780 FRA
China 145 526 CHN
So far my overall code looks like:
def answer_one():
summer_gold = df.sort_values('Gold', ascending=False)
summer_gold = summer_gold.iloc[0]
return summer_gold
answer_one()
Output:
# Summer 26
Gold 976
Silver 757
Bronze 666
Total 2399
# Winter 22
Gold.1 96
Silver.1 102
Bronze.1 84
Total.1 282
# Games 48
Gold.2 1072
Silver.2 859
Bronze.2 750
Combined total 2681
ID USA
Name: United States, dtype: object
I want an output of 'United States', in this case, or the name of whatever the country is at the top of my sorted dataframe.
After you sorted your dataframe, you can access the first row index like:
df.index[0]
I have a following data Frame
Year Sector Number Count
2015 AA 173 277
2015 AA 172 278
2015 AA 173 234
2015 BB 173 234
2015 BB 171 273
2015 BB 173 272
2015 CC 172 272
2015 CC 172 234
2015 CC 173 234
2015 CC 173 345
2016 AA 173 277
2016 AA 173 277
2016 BB 173 277
2016 BB 173 277
2016 CC 173 277
2016 CC 173 272
2016 CC 170 273
2016 CC 170 275
I need to calculcate the 90th percentile value of 'Count' for each group of ['Year','Sector','Number'] and return the next closest highest record in the group.
For example:
In the group
2015 CC 172 272
2015 CC 172 234
2015 CC 173 234
2015 CC 173 345
90th percentile value is 323.1 using np.percentile() function. I would want to return value of 345 which is the next highest in the group. Any help here ?
You can implement it as a 5 steps process:
Group by
Finding 90% percentile
Finding all the values above
Keep the id of the minimal
Retrieve all necessary ids
assume your dataframe named df:
ids = [data[data.Count>=np.percentile(data.Count,90)].Count.idxmin()
for group,data in df.groupby('Sector')]
df.loc[ids]
I'll break it down into steps:
1 - iterate over groups by Sector:
for group,data in df.groupby('Sector')
2 - find the percentile:
perc = np.percentile(data.Count,90)
3 - filter the values:
subdf = data[data.Count>=np.percentile(data.Count,90)]
4 - find the id of the minimal value:
subdf.Count.idmin()
5 - return the rows with minimal id:
df.loc[ids]