Adding columns and index to sum up values using Panda in Python - python

I have a .csv file, after reading it using Panda I have this output
Year Month Brunei Darussalam ... Thailand Viet Nam Myanmar
348 2007 Jan 3813 ... 25863 12555 4887
349 2007 Feb 3471 ... 22575 11969 3749
350 2007 Mar 4547 ... 33087 14060 5480
351 2007 Apr 3265 ... 34500 15553 6838
352 2007 May 3641 ... 30555 14995 5295
.. ... ... ... ... ... ... ...
474 2017 Jul 5625 ... 48620 71153 12619
475 2017 Aug 4610 ... 40993 51866 10934
476 2017 Sep 5387 ... 39692 40270 9888
477 2017 Oct 4202 ... 61448 39013 11616
478 2017 Nov 5258 ... 39304 36964 11402
I use this to get me the sum of all countries within the total years to display top 3
top3_country = new_df.iloc[0:, 2:9].sum(axis=0).sort_values(ascending=False).nlargest(3)
though my output is this
Indonesia 27572424
Malaysia 11337420
Philippines 6548622
I want to add columns and index into the sum value as if it was a new dataframe like this
Countries Visitors
0 Indonesia 27572424
1 Malaysia 11337420
2 Philippines 6548622
Sorry I am just starting to learn learn Panda any help will be gladly appreciated

Use Series.reset_index for 2 columns DataFrame and then set new columns names from list:
top3_country = top3_country.reset_index()
top3_country.columns = ['Countries', 'Visitors']
Or use Series.rename_axis with Series.reset_index:
top3_country = top3_country.rename_axis('Countries').reset_index(name='Visitors')

You can return back pd.DataFrame, use reset_index, and rename. Change your code to:
import pandas as pd
top3_country = pd.DataFrame(df.iloc[0:, 2:9].sum(axis=0).sort_values(ascending=False).nlargest(3)
).reset_index(
).rename(columns={'index':'Countries',0:'visitors'})
top3_country
Countries visitors
0 Indonesia 27572424
1 Malaysia 11337420
2 Philippines 6548622

Related

Group by multiple column data frame in pandas and get mean value of a column

I have a dataframe like this.
input:
Country Year AvgTemperature
1826 Algeria 2000 43.9
1827 Algeria 2000 46.5
.
.
7826 Algeria 2016 72.2
7827 Algeria 2016 69.4
.
.
858661 Poland 2000 63.6
858662 Poland 2000 61.9
.
.
857763 Poland 2015 34.8
857764 Poland 2015 39.2
...
I want the output to be grouped by Year and Country and mean of AvgTemperature column. So the output is like this:
Country Year AvgTemperature
1826 Algeria 2000 45.5
.
.
7826 Algeria 2016 70.9
.
.
858661 Poland 2000 62.8
.
.
857763 Poland 2015 37
...
So far I have tried this:
aggregation_functions = {'AvgTemperature': 'mean'}
df_new = df.groupby(df['Year', 'Country']).aggregate(aggregation_functions)
But getting this error : KeyError: ('Year', 'Country')
df_new = df.groupby(['Year', 'Country']).aggregate(aggregation_functions)
# Import Module
import pandas as pd
# Data Import and Pre-Process
df = pd.DataFrame({'Country':['Algeria','Algeria','Algeria','Algeria','Poland','Poland','Poland','Poland'],
'Year':['2000','2000','2016','2016','2000','2000','2015','2015'],
'AvgTemperature':[43.9,46.5,72.2,69.4,63.6,61.9,34.8,39.2]})
df_v2 = df.groupby(['Country','Year'])['AvgTemperature'].mean().reset_index()
# Output Display
df_v2
Hi Ferdous, Please try the code above, if you have any question please let me know
Thanks
Leon

calculate bad month from the given csv

I tried finding the five worst months from the data but I'm not sure about the process as I'm very confused . The answer should be something like (June 2001, July 2002 )but when I tried to solve it my answer wasn't as expected. Only the data of January was sorted. This the the way I tried solving my question and the csv data file is also provided on the screenshot.
My solution is given below:
PATH = "tourist_arrival.csv"
df = pd.read_csv(PATH)
print(df.sort_values(by=['Jan.','Feb.','Mar.','Apr.','May.','Jun.','Jul.','Aug.','Sep.','Oct.','Nov.','Dec.'],ascending=False))
Year ,Jan.,Feb.,Mar.,Apr.,May.,Jun.,Jul.,Aug.,Sep.,Oct.,Nov.,Dec.,Total
1992, 17451,27489,31505,30682,29089,22469,20942,27338,24839,42647,32341,27561,334353
1993 ,19238,23931,30818,20121,20585,19602,13588,21583,23939,42242,30378,27542,293567
1994, 21735,24872,31586,27292,26232,22907,19739,27610,27959,39393,28008,29198,326531
1995 ,22207,28240,34219,33994,27843,25650,23980,27686,30569,46845,35782,26380,363395
1996 ,27886,29676,39336,36331,29728,26749,22684,29080,32181,47314,37650,34998,393613
1997,25585,32861,43177,35229,33456,26367,26091,35549,31981,56272,40173,35116,421857
1998,28822,37956,41338,41087,35814,29181,27895,36174,39664,62487,47403,35863,463684
1999,29752,38134,46218,40774,42712,31049,27193,38449,44117,66543,48865,37698,491504
2000,25307,38959,44944,43635,28363,26933,24480,34670,43523,59195,52993,40644,463646
2001,30454,38680,46709,39083,28345,13030,18329,25322,31170,41245,30282,18588,361237
2002,17176,20668,28815,21253,19887,17218,16621,21093,23752,35272,28723,24990,275468
2003,21215,24349,27737,25851,22704,20351,22661,27568,28724,45459,38398,33115,338132
2004,30988,35631,44290,33514,26802,19793,24860,33162,25496,43373,36381,31007,385297
2005,25477,20338,29875,23414,25541,22608,23996,36910,36066,51498,41505,38170,375398
2006,28769,25728,36873,21983,22870,26210,25183,33150,33362,49670,44119,36009,383926
2007,33192,39934,54722,40942,35854,31316,35437,44683,45552,70644,52273,42156,526705
2008,36913,46675,58735,38475,30410,24349,25427,40011,41622,66421,52399,38840,500277
2009,29278,40617,49567,43337,30037,31749,30432,44174,42771,72522,54423,41049,509956
2010,33645,49264,63058,45509,32542,33263,38991,54672,54848,79130,67537,50408,602867
2011,42622,56339,67565,59751,46202,46115,42661,71398,63033,96996,83460,60073,736215
2012,52501,66459,89151,69796,50317,53630,49995,71964,66383,86379,83173,63344,803092
2013,47846,67264,88697,65152,52834,54599,54011,68478,66755,99426,75485,57069,797616
melt your DataFrame and then sort_values:
output = df.melt("Year", df.drop(["Year", "Total"], axis=1).columns, var_name="Month").sort_values("value").reset_index(drop=True)
>>> output
Year Month value
0 2001 Jun. 13030
1 1993 Jul. 13588
2 2002 Jul. 16621
3 2002 Jan. 17176
4 2002 Jun. 17218
.. ... ... ...
259 2012 Oct. 86379
260 2013 Mar. 88697
261 2012 Mar. 89151
262 2011 Oct. 96996
263 2013 Oct. 99426
[264 rows x 3 columns]
For just the 5 worst months, you can do:
>>> output.iloc[:5]
Year Month value
0 2001 Jun. 13030
1 1993 Jul. 13588
2 2002 Jul. 16621
3 2002 Jan. 17176
4 2002 Jun. 17218

Summarize values in panda data frames

I want to calculate the maximum value for each year and show the sector and that value. For example, from the screenshot, I would like to display:
2010: Telecom 781
2011: Tech 973
I have tried using:
df.groupby(['Year', 'Sector'])['Revenue'].max()
but this does not give me the name of Sector which has the highest value.
Try using idxmax and loc:
df.loc[df.groupby(['Sector','Year'])['Revenue'].idxmax()]
MVCE:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Sector':['Telecom','Tech','Financial Service','Construction','Heath Care']*3,
'Year':[2010,2011,2012,2013,2014]*3,
'Revenue':np.random.randint(101,999,15)})
df.loc[df.groupby(['Sector','Year'])['Revenue'].idxmax()]
Output:
Sector Year Revenue
3 Construction 2013 423
12 Financial Service 2012 838
9 Heath Care 2014 224
1 Tech 2011 466
5 Telecom 2010 843
Also .sort_values + .tail, grouping on just year. Data from #Scott Boston
df.sort_values('Revenue').groupby('Year').tail(1)
Output:
Sector Year Revenue
9 Heath Care 2014 224
3 Construction 2013 423
1 Tech 2011 466
12 Financial Service 2012 838
5 Telecom 2010 843

Use apply & lambda for serie

I have this:
df.loc['United Kingdom']
It is a series:
Rank 4.000000e+00
Documents 2.094400e+04
Citable documents 2.035700e+04
Citations 2.060910e+05
Self-citations 3.787400e+04
Citations per document 9.840000e+00
H index 1.390000e+02
Energy Supply NaN
Energy Supply per Capita NaN
% Renewable's NaN
2006 2.419631e+12
2007 2.482203e+12
2008 2.470614e+12
2009 2.367048e+12
2010 2.403504e+12
2011 2.450911e+12
2012 2.479809e+12
2013 2.533370e+12
2014 2.605643e+12
2015 2.666333e+12
Name: United Kingdom, dtype: float64
Now, I want to get
apply(lambda x: x['2015'] - x['2006'])
But it returned an error:
TypeError: 'float' object is not subscriptable
But if I get it separate:
df.loc['United Kingdom']['2015'] - df.loc['United Kingdom']['2006']
It worked okay.
How could I apply and lambda in here?
Thanks.
Ps: I want to apply it for a Dataframe
Rank Documents Citable documents Citations Self-citations Citations per document H index Energy Supply Energy Supply per Capita % Renewable's ... 2008 2009 2010 2011 2012 2013 2014 2015 Citation Ratio Population
Country
China 1 127050 126767 597237 411683 4.70 138 NaN NaN NaN ... 4.997775e+12 5.459247e+12 6.039659e+12 6.612490e+12 7.124978e+12 7.672448e+12 8.230121e+12 8.797999e+12 0.689313 NaN
United States 2 96661 94747 792274 265436 8.20 230 NaN NaN NaN ... 1.501149e+13 1.459484e+13 1.496437e+13 1.520402e+13 1.554216e+13 1.577367e+13 1.615662e+13 1.654857e+13 0.335031 NaN
Japan 3 30504 30287 223024 61554 7.31 134 NaN NaN NaN ... 5.558527e+12 5.251308e+12 5.498718e+12 5.473738e+12 5.569102e+12 5.644659e+12 5.642884e+12 5.669563e+12 0.275997 NaN
United Kingdom 4 20944 20357 206091 37874 9.84 139 NaN NaN NaN ... 2.470614e+12 2.367048e+12 2.403504e+12 2.450911e+12 2.479809e+12 2.533370e+12 2.605643e+12 2.666333e+12 0.183773 NaN
enter code here
If you want to apply it against all your dataframe, then just calculate it:
df['2015'] - df['2006']

Find key from value for Pandas Series

I have a dictionary whose values are in a pandas series. I want to make a new series that will look up a value in a series and return a new series with associated key. Example:
import pandas as pd
df = pd.DataFrame({'season' : ['Nor 2014', 'Nor 2013', 'Nor 2013', 'Norv 2013',
'Swe 2014', 'Swe 2014', 'Swe 2013',
'Swe 2013', 'Sven 2013', 'Sven 2013', 'Norv 2014']})
nmdict = {'Norway' : [s for s in list(set(df.season)) if 'No' in s],
'Sweden' : [s for s in list(set(df.season)) if 'S' in s]}
Desired result with df['country'] as the new column name:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
Due to nature of my data I must manually make the nmdict as shown. I've tried this but couldn't reverse my nmdict as arrays are not same length.
More importantly, I think my approach may be wrong. I'm coming from Excel and thinking of a vlookup solution, but according to this answer, I shouldn't be using the dictionary in this way.
Any answers appreciated.
I've done it in a verbose manner to allow you to follow through.
First, let's define a function that determines the value 'country'
In [4]: def get_country(s):
...: if 'Nor' in s:
...: return 'Norway'
...: if 'S' in s:
...: return 'Sweden'
...: # return 'Default Country' # if you get unmatched values
In [5]: get_country('Sven')
Out[5]: 'Sweden'
In [6]: get_country('Norv')
Out[6]: 'Norway'
We can use map to run get_country on every row. Pandas DataFrames also have a apply() which works similarly*.
In [7]: map(get_country, df['season'])
Out[7]:
['Norway',
'Norway',
'Norway',
'Norway',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Norway']
Now we assign that result to the column called 'country'
In [8]: df['country'] = map(get_country, df['season'])
Let's view the final result:
In [9]: df
Out[9]:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
*With apply() here's how it would look:
In [16]: df['country'] = df['season'].apply(get_country)
In [17]: df
Out[17]:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
A more scalable country matcher
pseudo-code only :)
# Modify this as needed
country_matchers = {
'Norway': ['Nor', 'Norv'],
'Sweden': ['S', 'Swed'],
}
def get_country(s):
"""
Run the passed string s against "matchers" for each country
Return the first matched country
"""
for country, matchers in country_matchers.items():
for matcher in matchers:
if matcher in s:
return country
IIUC, I would do the following:
df['country'] = df['season'].apply(lambda x: 'Norway' if 'No' in x else 'Sweden' if 'S' in x else x)
You could create the country dictionary using a dictionary comprehension:
country_id = df.season.str.split().str.get(0).drop_duplicates()
country_dict = {c: ('Norway' if c.startswith('N') else 'Sweden') for c in country_id.values}
to get:
{'Nor': 'Norway', 'Swe': 'Sweden', 'Sven': 'Sweden', 'Norv': 'Norway'}
This works fine for two countries, otherwise you can apply a self-defined function in similar way:
def country_dict(country_id):
if country_id.startswith('S'):
return 'Sweden'
elif country_id.startswith('N'):
return 'Norway'
elif country_id.startswith('XX'):
return ...
else:
return 'default'
Either way, map the dictionary to the country_id part of the season column, extracted using pandas string methods:
df['country'] = df.season.str.split().str.get(0).map(country_dict)
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway

Categories