Adding columns and index to sum up values using Panda in Python - python
I have a .csv file, after reading it using Panda I have this output
Year Month Brunei Darussalam ... Thailand Viet Nam Myanmar
348 2007 Jan 3813 ... 25863 12555 4887
349 2007 Feb 3471 ... 22575 11969 3749
350 2007 Mar 4547 ... 33087 14060 5480
351 2007 Apr 3265 ... 34500 15553 6838
352 2007 May 3641 ... 30555 14995 5295
.. ... ... ... ... ... ... ...
474 2017 Jul 5625 ... 48620 71153 12619
475 2017 Aug 4610 ... 40993 51866 10934
476 2017 Sep 5387 ... 39692 40270 9888
477 2017 Oct 4202 ... 61448 39013 11616
478 2017 Nov 5258 ... 39304 36964 11402
I use this to get me the sum of all countries within the total years to display top 3
top3_country = new_df.iloc[0:, 2:9].sum(axis=0).sort_values(ascending=False).nlargest(3)
though my output is this
Indonesia 27572424
Malaysia 11337420
Philippines 6548622
I want to add columns and index into the sum value as if it was a new dataframe like this
Countries Visitors
0 Indonesia 27572424
1 Malaysia 11337420
2 Philippines 6548622
Sorry I am just starting to learn learn Panda any help will be gladly appreciated
Use Series.reset_index for 2 columns DataFrame and then set new columns names from list:
top3_country = top3_country.reset_index()
top3_country.columns = ['Countries', 'Visitors']
Or use Series.rename_axis with Series.reset_index:
top3_country = top3_country.rename_axis('Countries').reset_index(name='Visitors')
You can return back pd.DataFrame, use reset_index, and rename. Change your code to:
import pandas as pd
top3_country = pd.DataFrame(df.iloc[0:, 2:9].sum(axis=0).sort_values(ascending=False).nlargest(3)
).reset_index(
).rename(columns={'index':'Countries',0:'visitors'})
top3_country
Countries visitors
0 Indonesia 27572424
1 Malaysia 11337420
2 Philippines 6548622
Related
Group by multiple column data frame in pandas and get mean value of a column
I have a dataframe like this. input: Country Year AvgTemperature 1826 Algeria 2000 43.9 1827 Algeria 2000 46.5 . . 7826 Algeria 2016 72.2 7827 Algeria 2016 69.4 . . 858661 Poland 2000 63.6 858662 Poland 2000 61.9 . . 857763 Poland 2015 34.8 857764 Poland 2015 39.2 ... I want the output to be grouped by Year and Country and mean of AvgTemperature column. So the output is like this: Country Year AvgTemperature 1826 Algeria 2000 45.5 . . 7826 Algeria 2016 70.9 . . 858661 Poland 2000 62.8 . . 857763 Poland 2015 37 ... So far I have tried this: aggregation_functions = {'AvgTemperature': 'mean'} df_new = df.groupby(df['Year', 'Country']).aggregate(aggregation_functions) But getting this error : KeyError: ('Year', 'Country')
df_new = df.groupby(['Year', 'Country']).aggregate(aggregation_functions)
# Import Module import pandas as pd # Data Import and Pre-Process df = pd.DataFrame({'Country':['Algeria','Algeria','Algeria','Algeria','Poland','Poland','Poland','Poland'], 'Year':['2000','2000','2016','2016','2000','2000','2015','2015'], 'AvgTemperature':[43.9,46.5,72.2,69.4,63.6,61.9,34.8,39.2]}) df_v2 = df.groupby(['Country','Year'])['AvgTemperature'].mean().reset_index() # Output Display df_v2 Hi Ferdous, Please try the code above, if you have any question please let me know Thanks Leon
calculate bad month from the given csv
I tried finding the five worst months from the data but I'm not sure about the process as I'm very confused . The answer should be something like (June 2001, July 2002 )but when I tried to solve it my answer wasn't as expected. Only the data of January was sorted. This the the way I tried solving my question and the csv data file is also provided on the screenshot. My solution is given below: PATH = "tourist_arrival.csv" df = pd.read_csv(PATH) print(df.sort_values(by=['Jan.','Feb.','Mar.','Apr.','May.','Jun.','Jul.','Aug.','Sep.','Oct.','Nov.','Dec.'],ascending=False)) Year ,Jan.,Feb.,Mar.,Apr.,May.,Jun.,Jul.,Aug.,Sep.,Oct.,Nov.,Dec.,Total 1992, 17451,27489,31505,30682,29089,22469,20942,27338,24839,42647,32341,27561,334353 1993 ,19238,23931,30818,20121,20585,19602,13588,21583,23939,42242,30378,27542,293567 1994, 21735,24872,31586,27292,26232,22907,19739,27610,27959,39393,28008,29198,326531 1995 ,22207,28240,34219,33994,27843,25650,23980,27686,30569,46845,35782,26380,363395 1996 ,27886,29676,39336,36331,29728,26749,22684,29080,32181,47314,37650,34998,393613 1997,25585,32861,43177,35229,33456,26367,26091,35549,31981,56272,40173,35116,421857 1998,28822,37956,41338,41087,35814,29181,27895,36174,39664,62487,47403,35863,463684 1999,29752,38134,46218,40774,42712,31049,27193,38449,44117,66543,48865,37698,491504 2000,25307,38959,44944,43635,28363,26933,24480,34670,43523,59195,52993,40644,463646 2001,30454,38680,46709,39083,28345,13030,18329,25322,31170,41245,30282,18588,361237 2002,17176,20668,28815,21253,19887,17218,16621,21093,23752,35272,28723,24990,275468 2003,21215,24349,27737,25851,22704,20351,22661,27568,28724,45459,38398,33115,338132 2004,30988,35631,44290,33514,26802,19793,24860,33162,25496,43373,36381,31007,385297 2005,25477,20338,29875,23414,25541,22608,23996,36910,36066,51498,41505,38170,375398 2006,28769,25728,36873,21983,22870,26210,25183,33150,33362,49670,44119,36009,383926 2007,33192,39934,54722,40942,35854,31316,35437,44683,45552,70644,52273,42156,526705 2008,36913,46675,58735,38475,30410,24349,25427,40011,41622,66421,52399,38840,500277 2009,29278,40617,49567,43337,30037,31749,30432,44174,42771,72522,54423,41049,509956 2010,33645,49264,63058,45509,32542,33263,38991,54672,54848,79130,67537,50408,602867 2011,42622,56339,67565,59751,46202,46115,42661,71398,63033,96996,83460,60073,736215 2012,52501,66459,89151,69796,50317,53630,49995,71964,66383,86379,83173,63344,803092 2013,47846,67264,88697,65152,52834,54599,54011,68478,66755,99426,75485,57069,797616
melt your DataFrame and then sort_values: output = df.melt("Year", df.drop(["Year", "Total"], axis=1).columns, var_name="Month").sort_values("value").reset_index(drop=True) >>> output Year Month value 0 2001 Jun. 13030 1 1993 Jul. 13588 2 2002 Jul. 16621 3 2002 Jan. 17176 4 2002 Jun. 17218 .. ... ... ... 259 2012 Oct. 86379 260 2013 Mar. 88697 261 2012 Mar. 89151 262 2011 Oct. 96996 263 2013 Oct. 99426 [264 rows x 3 columns] For just the 5 worst months, you can do: >>> output.iloc[:5] Year Month value 0 2001 Jun. 13030 1 1993 Jul. 13588 2 2002 Jul. 16621 3 2002 Jan. 17176 4 2002 Jun. 17218
Summarize values in panda data frames
I want to calculate the maximum value for each year and show the sector and that value. For example, from the screenshot, I would like to display: 2010: Telecom 781 2011: Tech 973 I have tried using: df.groupby(['Year', 'Sector'])['Revenue'].max() but this does not give me the name of Sector which has the highest value.
Try using idxmax and loc: df.loc[df.groupby(['Sector','Year'])['Revenue'].idxmax()] MVCE: import pandas as pd import numpy as np np.random.seed(123) df = pd.DataFrame({'Sector':['Telecom','Tech','Financial Service','Construction','Heath Care']*3, 'Year':[2010,2011,2012,2013,2014]*3, 'Revenue':np.random.randint(101,999,15)}) df.loc[df.groupby(['Sector','Year'])['Revenue'].idxmax()] Output: Sector Year Revenue 3 Construction 2013 423 12 Financial Service 2012 838 9 Heath Care 2014 224 1 Tech 2011 466 5 Telecom 2010 843
Also .sort_values + .tail, grouping on just year. Data from #Scott Boston df.sort_values('Revenue').groupby('Year').tail(1) Output: Sector Year Revenue 9 Heath Care 2014 224 3 Construction 2013 423 1 Tech 2011 466 12 Financial Service 2012 838 5 Telecom 2010 843
Use apply & lambda for serie
I have this: df.loc['United Kingdom'] It is a series: Rank 4.000000e+00 Documents 2.094400e+04 Citable documents 2.035700e+04 Citations 2.060910e+05 Self-citations 3.787400e+04 Citations per document 9.840000e+00 H index 1.390000e+02 Energy Supply NaN Energy Supply per Capita NaN % Renewable's NaN 2006 2.419631e+12 2007 2.482203e+12 2008 2.470614e+12 2009 2.367048e+12 2010 2.403504e+12 2011 2.450911e+12 2012 2.479809e+12 2013 2.533370e+12 2014 2.605643e+12 2015 2.666333e+12 Name: United Kingdom, dtype: float64 Now, I want to get apply(lambda x: x['2015'] - x['2006']) But it returned an error: TypeError: 'float' object is not subscriptable But if I get it separate: df.loc['United Kingdom']['2015'] - df.loc['United Kingdom']['2006'] It worked okay. How could I apply and lambda in here? Thanks. Ps: I want to apply it for a Dataframe Rank Documents Citable documents Citations Self-citations Citations per document H index Energy Supply Energy Supply per Capita % Renewable's ... 2008 2009 2010 2011 2012 2013 2014 2015 Citation Ratio Population Country China 1 127050 126767 597237 411683 4.70 138 NaN NaN NaN ... 4.997775e+12 5.459247e+12 6.039659e+12 6.612490e+12 7.124978e+12 7.672448e+12 8.230121e+12 8.797999e+12 0.689313 NaN United States 2 96661 94747 792274 265436 8.20 230 NaN NaN NaN ... 1.501149e+13 1.459484e+13 1.496437e+13 1.520402e+13 1.554216e+13 1.577367e+13 1.615662e+13 1.654857e+13 0.335031 NaN Japan 3 30504 30287 223024 61554 7.31 134 NaN NaN NaN ... 5.558527e+12 5.251308e+12 5.498718e+12 5.473738e+12 5.569102e+12 5.644659e+12 5.642884e+12 5.669563e+12 0.275997 NaN United Kingdom 4 20944 20357 206091 37874 9.84 139 NaN NaN NaN ... 2.470614e+12 2.367048e+12 2.403504e+12 2.450911e+12 2.479809e+12 2.533370e+12 2.605643e+12 2.666333e+12 0.183773 NaN enter code here
If you want to apply it against all your dataframe, then just calculate it: df['2015'] - df['2006']
Find key from value for Pandas Series
I have a dictionary whose values are in a pandas series. I want to make a new series that will look up a value in a series and return a new series with associated key. Example: import pandas as pd df = pd.DataFrame({'season' : ['Nor 2014', 'Nor 2013', 'Nor 2013', 'Norv 2013', 'Swe 2014', 'Swe 2014', 'Swe 2013', 'Swe 2013', 'Sven 2013', 'Sven 2013', 'Norv 2014']}) nmdict = {'Norway' : [s for s in list(set(df.season)) if 'No' in s], 'Sweden' : [s for s in list(set(df.season)) if 'S' in s]} Desired result with df['country'] as the new column name: season country 0 Nor 2014 Norway 1 Nor 2013 Norway 2 Nor 2013 Norway 3 Norv 2013 Norway 4 Swe 2014 Sweden 5 Swe 2014 Sweden 6 Swe 2013 Sweden 7 Swe 2013 Sweden 8 Sven 2013 Sweden 9 Sven 2013 Sweden 10 Norv 2014 Norway Due to nature of my data I must manually make the nmdict as shown. I've tried this but couldn't reverse my nmdict as arrays are not same length. More importantly, I think my approach may be wrong. I'm coming from Excel and thinking of a vlookup solution, but according to this answer, I shouldn't be using the dictionary in this way. Any answers appreciated.
I've done it in a verbose manner to allow you to follow through. First, let's define a function that determines the value 'country' In [4]: def get_country(s): ...: if 'Nor' in s: ...: return 'Norway' ...: if 'S' in s: ...: return 'Sweden' ...: # return 'Default Country' # if you get unmatched values In [5]: get_country('Sven') Out[5]: 'Sweden' In [6]: get_country('Norv') Out[6]: 'Norway' We can use map to run get_country on every row. Pandas DataFrames also have a apply() which works similarly*. In [7]: map(get_country, df['season']) Out[7]: ['Norway', 'Norway', 'Norway', 'Norway', 'Sweden', 'Sweden', 'Sweden', 'Sweden', 'Sweden', 'Sweden', 'Norway'] Now we assign that result to the column called 'country' In [8]: df['country'] = map(get_country, df['season']) Let's view the final result: In [9]: df Out[9]: season country 0 Nor 2014 Norway 1 Nor 2013 Norway 2 Nor 2013 Norway 3 Norv 2013 Norway 4 Swe 2014 Sweden 5 Swe 2014 Sweden 6 Swe 2013 Sweden 7 Swe 2013 Sweden 8 Sven 2013 Sweden 9 Sven 2013 Sweden 10 Norv 2014 Norway *With apply() here's how it would look: In [16]: df['country'] = df['season'].apply(get_country) In [17]: df Out[17]: season country 0 Nor 2014 Norway 1 Nor 2013 Norway 2 Nor 2013 Norway 3 Norv 2013 Norway 4 Swe 2014 Sweden 5 Swe 2014 Sweden 6 Swe 2013 Sweden 7 Swe 2013 Sweden 8 Sven 2013 Sweden 9 Sven 2013 Sweden 10 Norv 2014 Norway A more scalable country matcher pseudo-code only :) # Modify this as needed country_matchers = { 'Norway': ['Nor', 'Norv'], 'Sweden': ['S', 'Swed'], } def get_country(s): """ Run the passed string s against "matchers" for each country Return the first matched country """ for country, matchers in country_matchers.items(): for matcher in matchers: if matcher in s: return country
IIUC, I would do the following: df['country'] = df['season'].apply(lambda x: 'Norway' if 'No' in x else 'Sweden' if 'S' in x else x)
You could create the country dictionary using a dictionary comprehension: country_id = df.season.str.split().str.get(0).drop_duplicates() country_dict = {c: ('Norway' if c.startswith('N') else 'Sweden') for c in country_id.values} to get: {'Nor': 'Norway', 'Swe': 'Sweden', 'Sven': 'Sweden', 'Norv': 'Norway'} This works fine for two countries, otherwise you can apply a self-defined function in similar way: def country_dict(country_id): if country_id.startswith('S'): return 'Sweden' elif country_id.startswith('N'): return 'Norway' elif country_id.startswith('XX'): return ... else: return 'default' Either way, map the dictionary to the country_id part of the season column, extracted using pandas string methods: df['country'] = df.season.str.split().str.get(0).map(country_dict) season country 0 Nor 2014 Norway 1 Nor 2013 Norway 2 Nor 2013 Norway 3 Norv 2013 Norway 4 Swe 2014 Sweden 5 Swe 2014 Sweden 6 Swe 2013 Sweden 7 Swe 2013 Sweden 8 Sven 2013 Sweden 9 Sven 2013 Sweden 10 Norv 2014 Norway