pandas data frame add columns with function by year and company - python

I have pandas dataframe called 'sm' like below with many lines, and I want to calculate sales growth rate by company and year and insert it into new column name 'sg'.
fyear ch conm sale ipodate
0 1996 51.705 AAR CORP 589.328 NaN
1 1997 17.222 AAR CORP 782.123 NaN
2 1998 8.250 AAR CORP 918.036 NaN
3 1999 1.241 AAR CORP 1024.333 NaN
4 2000 13.809 AAR CORP 874.255 NaN
5 2001 34.522 AAR CORP 638.721 NaN
6 2002 29.154 AAR CORP 606.337 NaN
7 2003 41.010 AAR CORP 651.958 NaN
8 2004 40.508 AAR CORP 747.848 NaN
9 2005 121.738 AAR CORP 897.284 NaN
10 2006 83.317 AAR CORP 1061.169 NaN
11 2007 109.391 AAR CORP 1384.919 NaN
12 2008 112.505 AAR CORP 1423.976 NaN
13 2009 79.370 AAR CORP 1352.151 NaN
14 2010 57.433 AAR CORP 1775.782 NaN
15 2011 67.720 AAR CORP 2074.498 NaN
16 2012 75.300 AAR CORP 2167.100 NaN
17 2013 89.200 AAR CORP 2035.000 NaN
18 2014 54.700 AAR CORP 1594.300 NaN
19 2015 31.200 AAR CORP 1662.600 NaN
20 1997 64.000 AMERICAN AIRLINES GROUP INC 18570.000 NaN
21 1998 95.000 AMERICAN AIRLINES GROUP INC 19205.000 NaN
22 1999 85.000 AMERICAN AIRLINES GROUP INC 17730.000 NaN
23 2000 89.000 AMERICAN AIRLINES GROUP INC 19703.000 NaN
24 2001 120.000 AMERICAN AIRLINES GROUP INC 18963.000 NaN
115466 2014 290.500 ALLEGION PLC 2118.300 NaN
115467 2015 199.700 ALLEGION PLC 2068.100 NaN
115468 2016 312.400 ALLEGION PLC 2238.000 NaN
115470 2013 2.063 AGILITY HEALTH INC 63.052 NaN
115471 2014 1.301 AGILITY HEALTH INC 62.105 NaN
115472 2015 1.307 AGILITY HEALTH INC 62.328 NaN
115473 2013 109.819 NORDIC AMERICAN OFFSHORE NaN NaN
115474 2014 46.398 NORDIC AMERICAN OFFSHORE 52.789 NaN
115475 2015 5.339 NORDIC AMERICAN OFFSHORE 36.372 NaN
115476 2016 2.953 NORDIC AMERICAN OFFSHORE 16.249 NaN
115477 2011 2.040 DORIAN LPG LTD 34.571 20140508.0
115478 2012 1.042 DORIAN LPG LTD 38.662 20140508.0
115479 2013 279.132 DORIAN LPG LTD 29.634 20140508.0
115480 2014 204.821 DORIAN LPG LTD 104.129 20140508.0
115481 2015 46.412 DORIAN LPG LTD 289.208 20140508.0
115482 2013 948.684 NOMAD FOODS LTD 2074.842 NaN
115483 2014 855.541 NOMAD FOODS LTD 1816.239 NaN
115484 2015 671.846 NOMAD FOODS LTD 971.013 NaN
115485 2016 347.688 NOMAD FOODS LTD 2034.109 NaN
115487 2014 2638.000 ATHENE HOLDING LTD 4100.000 20161209.0
115488 2015 2720.000 ATHENE HOLDING LTD 2616.000 20161209.0
115489 2016 2459.000 ATHENE HOLDING LTD 4107.000 20161209.0
115490 2013 3.956 MIDATECH PHARMA PLC 0.244 NaN
115491 2014 47.240 MIDATECH PHARMA PLC 0.245 NaN
115492 2015 23.852 MIDATECH PHARMA PLC 2.028 NaN
115493 2016 21.723 MIDATECH PHARMA PLC 8.541 NaN
I somewhat implement codes like this,
d = sm.loc[sm['conm'] == 'AAR CORP']
dt = d.loc[d.fyear == 1996,'sale'].values[0]
dtp1 = d.loc[d.fyear == 1997,'sale'].values[0]
sg = (dtp1-dt)/ dt * 100
d.ix[d.fyear == 1997,'sg'] = sg
and it gives me the column at the end.
fyear ch conm sale ipodate sg
0 1996 51.705 AAR CORP 589.328 NaN Nan
1 1997 17.222 AAR CORP 782.123 NaN 32.71438
2 1998 8.250 AAR CORP 918.036 NaN Nan
I want 'sg' column to be next to 'sale' column and I want to calculate this sales growth rate for every company for every years (1996-2015) and insert into same row with given year t. Now I sliced by company name into small data frame and then calculate sales growth rate but since my unique company names are more 9000, so my current method seems to inefficient. Can I do this without slicing by all company names? Thank you in advance.

Take a look at pct_change
df['sg'] = df[['sale']].pct_change()

#sort df first
df.sort_values(by=['conm','fyear'],inplace=True)
#you can use groupby first by company, get the rolling increase for each company and then insert it to the dataframe.
df.insert(df.columns.tolist().index('sale')+1,
'sm',df.groupby(by=['conm'])['sale']\
.apply(lambda x: x.rolling(2).apply(lambda y: (y[1]-y[0])/y[0]*100)))
df
Out[296]:
fyear ch conm sale sm ipodate
0 1996 51.705 AAR CORP 589.328 NaN NaN
1 1997 17.222 AAR CORP 782.123 32.714380 NaN
2 1998 8.250 AAR CORP 918.036 17.377446 NaN
3 1999 1.241 AAR CORP 1024.333 11.578740 NaN
4 2000 13.809 AAR CORP 874.255 -14.651290 NaN
5 2001 34.522 AAR CORP 638.721 -26.941110 NaN
6 2002 29.154 AAR CORP 606.337 -5.070132 NaN
7 2003 41.010 AAR CORP 651.958 7.524034 NaN
8 2004 40.508 AAR CORP 747.848 14.708003 NaN
9 2005 121.738 AAR CORP 897.284 19.982135 NaN
10 2006 83.317 AAR CORP 1061.169 18.264563 NaN
11 2007 109.391 AAR CORP 1384.919 30.508807 NaN
12 2008 112.505 AAR CORP 1423.976 2.820165 NaN
13 2009 79.370 AAR CORP 1352.151 -5.043975 NaN
14 2010 57.433 AAR CORP 1775.782 31.330155 NaN
15 2011 67.720 AAR CORP 2074.498 16.821659 NaN
16 2012 75.300 AAR CORP 2167.100 4.463827 NaN
17 2013 89.200 AAR CORP 2035.000 -6.095704 NaN
18 2014 54.700 AAR CORP 1594.300 -21.656020 NaN
19 2015 31.200 AAR CORP 1662.600 4.284012 NaN
20 1997 64.000 AMERICAN AIRLINES GROUP INC 18570.000 NaN NaN
21 1998 95.000 AMERICAN AIRLINES GROUP INC 19205.000 3.419494 NaN
22 1999 85.000 AMERICAN AIRLINES GROUP INC 17730.000 -7.680292 NaN
23 2000 89.000 AMERICAN AIRLINES GROUP INC 19703.000 11.128032 NaN
24 2001 120.000 AMERICAN AIRLINES GROUP INC 18963.000 -3.755773 NaN

Another possible solution (since it does not use apply(), it is potentially faster):
df['sm'] = (df.sort_values(['conm', 'fyear'])\
.groupby('conm')['sale']\
.diff()\
.shift(-1) / df['sale']).shift() * 100
This solution assumes that there is always a 1-year difference between consecuitive years.

Related

Identify which country won the most gold, in each olympic games

When I write the below codes in pandas
gold.groupby(['Games','country'])['Medal'].value_counts()
I get the below result, how to extract the top medal winner for each Games,The result should be all the games,country with most medal,medal tally
Games country Medal
1896 Summer Australia Gold 2
Austria Gold 2
Denmark Gold 1
France Gold 5
Germany Gold 25
...
2016 Summer UK Gold 64
USA Gold 139
Ukraine Gold 2
Uzbekistan Gold 4
Vietnam Gold 1
Name: Medal, Length: 1101, dtype: int64
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal country notes
68 17294 Cai Yalin M 23.0 174.0 60.0 China CHN 2000 Summer 2000 Summer Sydney Shooting Shooting Men's Air Rifle, 10 metres Gold China NaN
77 17299 Cai Yun M 32.0 181.0 68.0 China-1 CHN 2012 Summer 2012 Summer London Badminton Badminton Men's Doubles Gold China NaN
87 17995 Cao Lei F 24.0 168.0 75.0 China CHN 2008 Summer 2008 Summer Beijing Weightlifting Weightlifting Women's Heavyweight Gold China NaN
104 18005 Cao Yuan M 17.0 160.0 42.0 China CHN 2012 Summer 2012 Summer London Diving Diving Men's Synchronized Platform Gold China NaN
105 18005 Cao Yuan M 21.0 160.0 42.0 China CHN 2016 Summer 2016 Summer Rio de Janeiro Diving Diving Men's Springboard Gold China NaN
The data Your data only included Chinese gold medal winners so I added a row:
ID Name Sex Age Height Weight Team NOC \
0 17294 Cai Yalin M 23.0 174.0 60.0 China CHN
1 17299 Cai Yun M 32.0 181.0 68.0 China-1 CHN
2 17995 Cao Lei F 24.0 168.0 75.0 China CHN
3 18005 Cao Yuan M 17.0 160.0 42.0 China CHN
4 18005 Cao Yuan M 21.0 160.0 42.0 China CHN
5 292929 Serge de Gosson M 52.0 178.0 69.0 France FR
Games Year Season City Sport \
0 2000 Summer 2000 Summer Sydney Shooting
1 2012 Summer 2012 Summer London Badminton
2 2008 Summer 2008 Summer Beijing Weightlifting
3 2012 Summer 2012 Summer London Diving
4 2016 Summer 2016 Summer Rio de Janeiro Diving
5 2022 Summer 2022 Summer Stockholm Calisthenics
Event Medal country notes
0 Shooting Men's Air Rifle, 10 metres Gold China NaN
1 Badminton Men's Doubles Gold China NaN
2 Weightlifting Women's Heavyweight Gold China NaN
3 Diving Men's Synchronized Platform Gold China NaN
4 Diving Men's Springboard Gold China NaN
5 Planche Gold France NaN
YOu want to de exactly what you did but sort the data and keep the top row:
gold.groupby(['Games','country'])['Medal'].value_counts().groupby(level=0, group_keys=False).head(1)
Which returns:
Games country Medal
2000 Summer China Gold 1
2008 Summer China Gold 1
2012 Summer China Gold 2
2016 Summer China Gold 1
2022 Summer France Gold 1
Name: Medal, dtype: int64
or as a dataframe:
GOLD_TOP = pd.DataFrame(gold.groupby(['Games','country'])['Medal'].value_counts().groupby(level=0, group_keys=False).head(1))
df_gold = df[df["Medal"]=="Gold"].groupby("Team").Medal.count().reset_index()
df_gold = df_gold.sort_values(by="Medal",ascending=False)[:8]
df_gold

Set "Year" column to individual columns to create a panel

I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column.
Out[34]:
Award Year 0
State
Alabama 2003 89
Alabama 2004 92
Alabama 2005 108
Alabama 2006 81
Alabama 2007 71
... ...
Wyoming 2011 4
Wyoming 2012 2
Wyoming 2013 1
Wyoming 2014 4
Wyoming 2015 3
[648 rows x 2 columns]
I want the years to each be individual columns, this is an example,
Out[48]:
State 2003 2004 2005 2006
0 NewYork 10 10 10 10
1 Alabama 15 15 15 15
2 Washington 20 20 20 20
I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for.
If anyone can help that would be great!
Use set_index with append=True then select the column 0 and use unstack to reshape:
df = df.set_index('Award Year', append=True)['0'].unstack()
Result:
Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
Pivot Table can help.
df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State'])
df2
Result:
AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0

Why is that when I use pandas to scrape a table from a website it skips the middle columns and only prints the first 2 and last 2

I am currently working on a program that scrapes Yahoo Finance Earnings Calendar Page and stores the data in a file. I am able to scrape the data but I am confused as to why it only scrapes the first 2 and last 2 columns. I also tried to do the same with a table on Wikipedia for List of S&P 500 Companies and am running into the same problem. Any help is appreciated.
Yahoo Finance Code
import csv
import pandas as pd
earnings = pd.read_html('https://finance.yahoo.com/calendar/earnings?day=2019-11-19')[0]
fileName = "testFile"
with open(fileName + ".csv", mode='w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([earnings])
print(earnings)
Wikipedia Code
import pandas as pd
url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest
print(sp500_table)
~EDIT~
Here is the output I get from the Yahoo Finance Code
" Symbol Company ... Reported EPS Surprise(%)
0 WUBA 58.com Inc ... NaN NaN
1 ARMK Aramark ... NaN NaN
2 AFMD Affimed NV ... NaN NaN
3 NJR New Jersey Resources Corp ... NaN NaN
4 ECCB Eagle Point Credit Company Inc ... NaN NaN
5 TOUR Tuniu Corp ... NaN NaN
6 EIC Eagle Point Income Company Inc ... NaN NaN
7 KSS Kohls Corp ... NaN NaN
8 JKS JinkoSolar Holding Co Ltd ... NaN NaN
9 DL China Distance Education Holdings Ltd ... NaN NaN
10 TJX TJX Companies Inc ... NaN NaN
11 HD Home Depot Inc ... NaN NaN
12 PAGS PagSeguro Digital Ltd ... NaN NaN
13 ESE ESCO Technologies Inc ... NaN NaN
14 RADA Rada Electronic Industries Ltd ... NaN NaN
15 RADA Rada Electronic Industries Ltd ... NaN NaN
16 DAVA Endava PLC ... NaN NaN
17 FALC FalconStor Software Inc ... NaN NaN
18 GVP GSE Systems Inc ... NaN NaN
19 TDG TransDigm Group Inc ... NaN NaN
20 PPDF PPDAI Group Inc ... NaN NaN
21 GRBX Greenbox Pos ... NaN NaN
22 THMO Thermogenesis Holdings Inc ... NaN NaN
23 MMS Maximus Inc ... NaN NaN
24 NXTD NXT-ID Inc ... NaN NaN
25 URBN Urban Outfitters Inc ... NaN NaN
26 SINT SINTX Technologies Inc ... NaN NaN
27 ORNC Oranco Inc ... NaN NaN
28 LAIX LAIX Inc ... NaN NaN
29 MDT Medtronic PLC ... NaN NaN
[30 rows x 6 columns]"
Here is the output I get from Wikipedia Code
Symbol Security ... CIK Founded
0 MMM 3M Company ... 66740 1902
1 ABT Abbott Laboratories ... 1800 1888
2 ABBV AbbVie Inc. ... 1551152 2013 (1888)
3 ABMD ABIOMED Inc ... 815094 1981
4 ACN Accenture plc ... 1467373 1989
5 ATVI Activision Blizzard ... 718877 2008
6 ADBE Adobe Systems Inc ... 796343 1982
7 AMD Advanced Micro Devices Inc ... 2488 1969
8 AAP Advance Auto Parts ... 1158449 1932
9 AES AES Corp ... 874761 1981
10 AMG Affiliated Managers Group Inc ... 1004434 1993
11 AFL AFLAC Inc ... 4977 1955
12 A Agilent Technologies Inc ... 1090872 1999
13 APD Air Products & Chemicals Inc ... 2969 1940
14 AKAM Akamai Technologies Inc ... 1086222 1998
15 ALK Alaska Air Group Inc ... 766421 1985
16 ALB Albemarle Corp ... 915913 1994
17 ARE Alexandria Real Estate Equities ... 1035443 1994
18 ALXN Alexion Pharmaceuticals ... 899866 1992
19 ALGN Align Technology ... 1097149 1997
20 ALLE Allegion ... 1579241 1908
21 AGN Allergan, Plc ... 1578845 1983
22 ADS Alliance Data Systems ... 1101215 1996
23 LNT Alliant Energy Corp ... 352541 1917
24 ALL Allstate Corp ... 899051 1931
25 GOOGL Alphabet Inc Class A ... 1652044 1998
26 GOOG Alphabet Inc Class C ... 1652044 1998
27 MO Altria Group Inc ... 764180 1985
28 AMZN Amazon.com Inc. ... 1018724 1994
29 AMCR Amcor plc ... 1748790 NaN
.. ... ... ... ... ...
475 VIAB Viacom Inc. ... 1339947 NaN
476 V Visa Inc. ... 1403161 NaN
477 VNO Vornado Realty Trust ... 899689 NaN
478 VMC Vulcan Materials ... 1396009 NaN
479 WAB Wabtec Corporation ... 943452 NaN
480 WMT Walmart ... 104169 NaN
481 WBA Walgreens Boots Alliance ... 1618921 NaN
482 DIS The Walt Disney Company ... 1001039 NaN
483 WM Waste Management Inc. ... 823768 1968
484 WAT Waters Corporation ... 1000697 1958
485 WEC Wec Energy Group Inc ... 783325 NaN
486 WCG WellCare ... 1279363 NaN
487 WFC Wells Fargo ... 72971 NaN
488 WELL Welltower Inc. ... 766704 NaN
489 WDC Western Digital ... 106040 NaN
490 WU Western Union Co ... 1365135 1851
491 WRK WestRock ... 1636023 NaN
492 WY Weyerhaeuser ... 106535 NaN
493 WHR Whirlpool Corp. ... 106640 1911
494 WMB Williams Cos. ... 107263 NaN
495 WLTW Willis Towers Watson ... 1140536 NaN
496 WYNN Wynn Resorts Ltd ... 1174922 NaN
497 XEL Xcel Energy Inc ... 72903 1909
498 XRX Xerox ... 108772 1906
499 XLNX Xilinx ... 743988 NaN
500 XYL Xylem Inc. ... 1524472 NaN
501 YUM Yum! Brands Inc ... 1041061 NaN
502 ZBH Zimmer Biomet Holdings ... 1136869 NaN
503 ZION Zions Bancorp ... 109380 NaN
504 ZTS Zoetis ... 1555280 NaN
[505 rows x 9 columns]
As you can see in both examples the table conveniently omits the coloums in the middle and only displays the first and last 2.
~EDIT#2~
Making this change to the code now displays all coloumns but it does so in two seperate tables instead. Any idea as to why it does this?
fileName = "yahooFinance_Pandas"
with pd.option_context('display.max_columns', None): # more options can be specified also
with open(fileName + ".csv", mode='w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([earnings])
OUTPUT
" Symbol Company Earnings Call Time \
0 WUBA 58.com Inc Before Market Open
1 ARMK Aramark Before Market Open
2 AFMD Affimed NV TAS
3 NJR New Jersey Resources Corp Before Market Open
4 ECCB Eagle Point Credit Company Inc Before Market Open
5 TOUR Tuniu Corp Before Market Open
6 EIC Eagle Point Income Company Inc Before Market Open
7 KSS Kohls Corp Before Market Open
8 JKS JinkoSolar Holding Co Ltd Before Market Open
9 DL China Distance Education Holdings Ltd After Market Close
10 TJX TJX Companies Inc Before Market Open
11 HD Home Depot Inc Before Market Open
12 PAGS PagSeguro Digital Ltd TAS
13 ESE ESCO Technologies Inc After Market Close
14 RADA Rada Electronic Industries Ltd TAS
15 RADA Rada Electronic Industries Ltd Before Market Open
16 DAVA Endava PLC TAS
17 FALC FalconStor Software Inc After Market Close
18 GVP GSE Systems Inc TAS
19 TDG TransDigm Group Inc Before Market Open
20 PPDF PPDAI Group Inc Before Market Open
21 GRBX Greenbox Pos Time Not Supplied
22 THMO Thermogenesis Holdings Inc After Market Close
23 MMS Maximus Inc TAS
24 NXTD NXT-ID Inc TAS
25 URBN Urban Outfitters Inc After Market Close
26 SINT SINTX Technologies Inc Time Not Supplied
27 ORNC Oranco Inc Time Not Supplied
28 LAIX LAIX Inc After Market Close
29 MDT Medtronic PLC TAS
EPS Estimate Reported EPS Surprise(%)
0 0.82 NaN NaN
1 0.69 NaN NaN
2 -0.17 NaN NaN
3 0.28 NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 0.86 NaN NaN
8 0.83 NaN NaN
9 0.33 NaN NaN
10 0.66 NaN NaN
11 2.52 NaN NaN
12 0.29 NaN NaN
13 1.06 NaN NaN
14 -0.02 NaN NaN
15 -0.02 NaN NaN
16 21.21 NaN NaN
17 NaN NaN NaN
18 0.03 NaN NaN
19 5.16 NaN NaN
20 0.26 NaN NaN
21 NaN NaN NaN
22 -0.12 NaN NaN
23 0.94 NaN NaN
24 NaN NaN NaN
25 0.57 NaN NaN
26 NaN NaN NaN
27 NaN NaN NaN
28 -0.32 NaN NaN
29 1.28 NaN NaN "
~EDIT#3~
Made this change as you requested #Alex
earnings.to_csv(r'C:\Users\akkir\Desktop\pythonSelenium\export_dataframe.csv', index = None)
OUTPUT
Symbol,Company,Earnings Call Time,EPS Estimate,Reported EPS,Surprise(%)
ATTO,Atento SA,TAS,0.09,0.03,-66.67
ALPN,Alpine Immune Sciences Inc,TAS,-0.68,-0.62,8.82
ALPN,Alpine Immune Sciences Inc,Time Not Supplied,-0.68,-0.62,8.82
HOLI,Hollysys Automation Technologies Ltd,TAS,0.48,0.49,2.08
IDSA,Industrial Services of America Inc,After Market Close,,,
AGRO,Adecoagro SA,TAS,-0.01,,
ATOS,Atossa Genetics Inc,TAS,-0.52,-0.36,30.77
AXAS,Abraxas Petroleum Corp,TAS,0.03,0.02,-33.33
ACIU,AC Immune SA,TAS,0.17,0.25,47.06
ARCO,Arcos Dorados Holdings Inc,TAS,0.08,0.13,62.5
WTER,Alkaline Water Company Inc,Time Not Supplied,-0.07,-0.07,
ALNA,Allena Pharmaceuticals Inc,Before Market Open,-0.49,-0.57,-16.33
AEYE,AudioEye Inc,TAS,-0.26,-0.27,-3.85
APLT,Applied Therapeutics Inc,Before Market Open,-0.49,-0.63,-28.57
ALT,Altimmune Inc,TAS,-0.19,-0.73,-284.21
ABEOW,Abeona Therapeutics Inc,TAS,,,
ACER,Acer Therapeutics Inc,After Market Close,-0.57,-0.52,8.77
SRNN,Southern Banc Company Inc,Time Not Supplied,,,
SPB,Spectrum Brands Holdings Inc,Before Market Open,1.11,1.13,1.8
BIOC,Biocept Inc,TAS,-0.27,-0.25,7.41
IDXG,Interpace Biosciences Inc,TAS,-0.19,-0.19,
GTBP,GT Biopharma Inc,After Market Close,,,
MTNB,Matinas BioPharma Holdings Inc,Time Not Supplied,-0.03,-0.03,
MTNB,Matinas BioPharma Holdings Inc,TAS,-0.03,-0.03,
XELB,Xcel Brands Inc,After Market Close,0.12,0.06,-50.0
BBI,Brickell Biotech Inc,After Market Close,,,
SNBP,Sun Biopharma Inc,Before Market Open,,,
BZH,Beazer Homes USA Inc,TAS,0.51,0.08,-84.31
SELB,Selecta Biosciences Inc,TAS,-0.33,-0.26,21.21
BEST,BEST Inc,Before Market Open,,0.01,
CBPO,China Biologic Products Holdings Inc,TAS,0.88,1.4,59.09
TPCS,TechPrecision Corp,TAS,,,
LK,Luckin Coffee Inc,Before Market Open,-0.37,-0.32,13.51
CYD,China Yuchai International Ltd,Before Market Open,0.45,0.17,-62.22
CCF,Chase Corp,After Market Close,,,
SMCI,Super Micro Computer Inc,After Market Close,,,
AUMN,Golden Minerals Co,TAS,,,
PGR,Progressive Corp,Before Market Open,1.3,1.33,2.31
PUMP,ProPetro Holding Corp,TAS,0.51,0.33,-35.29
CPLG,CorePoint Lodging Inc,TAS,-0.44,-0.22,50.0
CHNG,Change Healthcare Inc,After Market Close,0.27,0.27,
NOVC,Novation Companies Inc,Time Not Supplied,,,
WFCF,Where Food Comes From Inc,Before Market Open,,,
CYCCP,Cyclacel Pharmaceuticals Inc,After Market Close,,,
ISCO,International Stem Cell Corp,Before Market Open,,,
CPA,Copa Holdings SA,TAS,2.23,2.45,9.87
CSCO,Cisco Systems Inc,TAS,0.81,0.84,3.7
GMDA,Gamida Cell Ltd,TAS,-0.36,-0.3,16.67
CHRA,Charah Solutions Inc,TAS,-0.05,-0.11,-120.0
MNI,McClatchy Co,TAS,-1.01,-0.16,84.16
ENSV,Enservco Corp,TAS,-0.06,-0.1,-66.67
TK,Teekay Corp,TAS,,,
SANW,S&W Seed Co,TAS,-0.15,-0.15,
SANW,S&W Seed Co,Before Market Open,-0.15,-0.15,
CMCM,Cheetah Mobile Inc,TAS,0.14,0.49,250.0
CYRN,Cyren Ltd,TAS,-0.07,-0.06,14.29
CATS,Catasys Inc,TAS,-0.32,-0.52,-62.5
GLAD,Gladstone Capital Corp,TAS,0.21,0.21,
PING,Ping Identity Holding Corp,After Market Close,0.01,0.13,1200.0
CRWS,Crown Crafts Inc,Before Market Open,0.18,0.18,
CTRP,Ctrip.Com International Ltd,After Market Close,0.29,,
GFF,Griffon Corp,After Market Close,0.33,0.4,21.21
CLIR,Clearsign Technologies Corp,After Market Close,,,
DMAC,DiaMedica Therapeutics Inc,After Market Close,,,
DSSI,Diamond S Shipping Inc,Time Not Supplied,-0.12,-0.19,-58.33
DSSI,Diamond S Shipping Inc,TAS,-0.12,-0.19,-58.33
DYAI,Dyadic International Inc,After Market Close,,,
ONE,OneSmart International Education Group Ltd,Before Market Open,,,
EFOI,Energy Focus Inc,Before Market Open,-0.15,-0.08,46.67
EDAP,Edap Tms SA,TAS,0.04,0.03,-25.0
EYEN,Eyenovia Inc,Before Market Open,-0.34,-0.29,14.71
EQS,EQUUS Total Return Inc,After Market Close,,,
SENR,Strategic Environmental & Energy Resources Inc,Before Market Open,,,
EPSN,Epsilon Energy Ltd,TAS,,,
GRMM,Grom Social Enterprises Inc,Before Market Open,,,
ECOR,"electroCore, Inc.",TAS,-0.31,-0.36,-16.13
SD,SandRidge Energy Inc,TAS,,,
ENR,Energizer Holdings Inc,TAS,0.81,0.93,14.81
ELMD,Electromed Inc,TAS,0.01,0.12,1100.0
EVK,Ever-Glory International Group Inc,TAS,,,
FTEK,Fuel Tech Inc,After Market Close,-0.03,-0.05,-66.67
FVRR,Fiverr International Ltd,Before Market Open,-0.19,-0.12,36.84
SGRP,SPAR Group Inc,TAS,,,
NSEC,National Security Group Inc,Time Not Supplied,,,
SNDL,Sundial Growers Inc,TAS,-0.08,,
SNDL,Sundial Growers Inc,Before Market Open,-0.08,,
TCOM,Trip.com Group Ltd,TAS,,,
RAVE,Rave Restaurant Group Inc,TAS,,,
SLGG,Super League Gaming Inc,After Market Close,-0.36,-0.43,-19.44
HI,Hillenbrand Inc,After Market Close,0.73,0.76,4.11
HROW,Harrow Health Inc,TAS,-0.24,-0.29,-20.83
NVGS,Navigator Holdings Ltd,TAS,-0.07,-0.01,85.71
INFU,InfuSystem Holdings Inc,Before Market Open,,,
OSW,OneSpaWorld Holdings Ltd,Before Market Open,0.12,0.11,-8.33
VIPS,Vipshop Holdings Ltd,TAS,0.17,0.25,47.06
PRTH,Priority Technology Holdings Inc,After Market Close,-0.12,-0.08,33.33
TGC,Tengasco Inc,TAS,,,
PRSP,Perspecta Inc,After Market Close,0.51,0.54,5.88
REED,Reed's Inc,After Market Close,-0.11,-0.14,-27.27
WSTL,Westell Technologies Inc,After Market Close,,,
As far as I can tell this nothing to do with the data and everything to do with the representation. Only the first and last columns are printed so as to keep the output from being massive and difficult to read. You can even see at the end of your output that your DataFrame has 9 columns.
Take a look here if you want to print the entire thing. You could also use .info to get some general information on your columns.
Thanks to #AlexanderCĂ©cile for the help regarding this issue.
For those interested in how he fixed my issue the code is below.
import pandas as pd
from datetime import date
pd.option_context('display.max_rows', None, 'display.max_columns', None)
earnings = pd.read_html('https://finance.yahoo.com/calendar/earnings?day=2019-11-13')[0]
earnings.to_csv(r'C:\Users\<user>\Desktop\earnings_{}.csv'.format(date.today()), index=None)

pandas add columns conditions with groupby and on another column values

I have pandas.DataFrame called companysubset like below, but actual data is much longer.
conm fyear dvpayout industry firmycount ipodate
46078 CAESARS ENTERTAINMENT CORP 2003 0.226813 Services 22 19891213.0
46079 CAESARS ENTERTAINMENT CORP 2004 0.226813 Services 22 19891213.0
46080 CAESARS ENTERTAINMENT CORP 2005 0.226813 Services 22 19891213.0
46091 CAESARS ENTERTAINMENT CORP 2016 0.226813 Services 22 19891213.0
114620 CAESARSTONE LTD 2010 0.487543 Manufacturing 10 20120322.0
114621 CAESARSTONE LTD 2011 0.487543 Manufacturing 10 20120322.0
114622 CAESARSTONE LTD 2012 0.487543 Manufacturing 10 20120322.0
114623 CAESARSTONE LTD 2013 0.487543 Manufacturing 10 20120322.0
114624 CAESARSTONE LTD 2014 0.487543 Manufacturing 10 20120322.0
114625 CAESARSTONE LTD 2015 0.487543 Manufacturing 10 20120322.0
114626 CAESARSTONE LTD 2016 0.487543 Manufacturing 10 20120322.0
132524 CAFEPRESS INC 2010 0.000000 Retail Trade 7 20120329.0
132525 CAFEPRESS INC 2011 0.000000 Retail Trade 7 20120329.0
132526 CAFEPRESS INC 2012 -0.000000 Retail Trade 7 20120329.0
132527 CAFEPRESS INC 2013 -0.000000 Retail Trade 7 20120329.0
132528 CAFEPRESS INC 2014 -0.000000 Retail Trade 7 20120329.0
132529 CAFEPRESS INC 2015 -0.000000 Retail Trade 7 20120329.0
132530 CAFEPRESS INC 2016 -0.000000 Retail Trade 7 20120329.0
120049 CAI INTERNATIONAL INC 2005 0.000000 Services 12 20070516.0
120050 CAI INTERNATIONAL INC 2006 0.000000 Services 12 20070516.0
3896 CALAMP CORP 1999 -0.000000 Manufacturing 23 NaN
3897 CALAMP CORP 2000 0.000000 Manufacturing 23 NaN
3898 CALAMP CORP 2001 0.000000 Manufacturing 23 NaN
3899 CALAMP CORP 2002 0.000000 Manufacturing 23 NaN
21120 CALATLANTIC GROUP INC 1995 -0.133648 Construction 22 NaN
21121 CALATLANTIC GROUP INC 1996 -0.133648 Construction 22 NaN
21122 CALATLANTIC GROUP INC 1997 -0.133648 Construction 22 NaN
21123 CALATLANTIC GROUP INC 1998 -0.133648 Construction 22 NaN
21124 CALATLANTIC GROUP INC 1999 -0.133648 Construction 22 NaN
21125 CALATLANTIC GROUP INC 2000 -0.133648 Construction 22 NaN
21126 CALATLANTIC GROUP INC 2001 -0.133648 Construction 22 NaN
21127 CALATLANTIC GROUP INC 2002 -0.133648 Construction 22 NaN
21128 CALATLANTIC GROUP INC 2003 -0.133648 Construction 22 NaN
1) I want to calculate quartile of dvpayout of company by industry and add column called dv and indicate that it is in Q1, Q2, Q3 or Q4.
I came up with this code, but it does not work.
pd.cut(companysubset['dvpayout'].mean(), bins=[0,25,75,100], labels=False)
2) I want to add column called age if there is an ipodate. The value will be the largest fyear - ipodate of year. (ex. 2016 - 1989 for CAESARS ENTERTAINMENT COR)
The results data frame I want to see is like below.
conm fyear dvpayout industry firmycount ipodate dv age
46078 CAESARS ... 2003 0.226813 Services 22 19891213.0 Q2 27
46079 CAESARS ... 2004 0.226813 Services 22 19891213.0 Q2 27
46080 CAESARS ... 2005 0.226813 Services 22 19891213.0 Q2 27
46091 CAESARS ... 2016 0.226813 Services 22 19891213.0 Q2 27
114620 CAESARSTONE LTD 2010 0.487543 Manufacturing 10 20120322.0 Q3 4
114621 CAESARSTONE LTD 2011 0.487543 Manufacturing 10 20120322.0 Q3 4
114622 CAESARSTONE LTD 2012 0.487543 Manufacturing 10 20120322.0 Q3 4
114623 CAESARSTONE LTD 2013 0.487543 Manufacturing 10 20120322.0 Q3 4
114624 CAESARSTONE LTD 2014 0.487543 Manufacturing 10 20120322.0 Q3 4
114625 CAESARSTONE LTD 2015 0.487543 Manufacturing 10 20120322.0 Q3 4
114626 CAESARSTONE LTD 2016 0.487543 Manufacturing 10 20120322.0 Q3 4
132524 CAFEPRESS INC 2010 0.000000 Retail Trade 7 20120329.0 Q1 4
132525 CAFEPRESS INC 2011 0.000000 Retail Trade 7 20120329.0 Q1 4
132526 CAFEPRESS INC 2012 -0.000000 Retail Trade 7 20120329.0 Q1 4
132527 CAFEPRESS INC 2013 -0.000000 Retail Trade 7 20120329.0 Q1 4
132528 CAFEPRESS INC 2014 -0.000000 Retail Trade 7 20120329.0 Q1 4
132529 CAFEPRESS INC 2015 -0.000000 Retail Trade 7 20120329.0 Q1 4
132530 CAFEPRESS INC 2016 -0.000000 Retail Trade 7 20120329.0 Q1 4
120049 CAI INTERNATIONAL INC 2006 0.000000 Services 12 20070516.0 Q1 0
120050 CAI INTERNATIONAL INC 2007 0.000000 Services 12 20070516.0 Q1 0
3896 CALAMP CORP 1999 -0.000000 Manufacturing 23 NaN Q1 Nan
3897 CALAMP CORP 2000 0.000000 Manufacturing 23 NaN Q1 Nan
3898 CALAMP CORP 2001 0.000000 Manufacturing 23 NaN Q1 Nan
3899 CALAMP CORP 2002 0.000000 Manufacturing 23 NaN Q1 Nan
21120 CALATLANTIC GROUP INC 1995 -0.133648 Construction 22 NaN Q1 Nan
21121 CALATLANTIC GROUP INC 1996 -0.133648 Construction 22 NaN Q1 Nan
21122 CALATLANTIC GROUP INC 1997 -0.133648 Construction 22 NaN Q1 Nan
21123 CALATLANTIC GROUP INC 1998 -0.133648 Construction 22 NaN Q1 Nan
21124 CALATLANTIC GROUP INC 1999 -0.133648 Construction 22 NaN Q1 Nan
21125 CALATLANTIC GROUP INC 2000 -0.133648 Construction 22 NaN Q1 Nan
21126 CALATLANTIC GROUP INC 2001 -0.133648 Construction 22 NaN Q1 Nan
21127 CALATLANTIC GROUP INC 2002 -0.133648 Construction 22 NaN Q1 Nan
21128 CALATLANTIC GROUP INC 2003 -0.133648 Construction 22 NaN Q1 Nan
Thanks in advance!!!!
The age column can be generated with:
Code
df.set_index(['conm'], inplace=True)
df['age'] = df.groupby(level=0).apply(
lambda x: max(x.fyear) - round(x.ipodate.iloc[0]/10000-0.5))
Test Code:
df = pd.read_fwf(StringIO(
u"""
ID conm fyear ipodate
46078 CAESARS ENTERTAINMENT 2003 19891213.0
46079 CAESARS ENTERTAINMENT 2004 19891213.0
46080 CAESARS ENTERTAINMENT 2005 19891213.0
46091 CAESARS ENTERTAINMENT 2016 19891213.0
114620 CAESARSTONE LTD 2010 20120322.0
114621 CAESARSTONE LTD 2011 20120322.0
114622 CAESARSTONE LTD 2012 20120322.0
114623 CAESARSTONE LTD 2013 20120322.0
114624 CAESARSTONE LTD 2014 20120322.0
114625 CAESARSTONE LTD 2015 20120322.0
114626 CAESARSTONE LTD 2016 20120322.0
132524 CAFEPRESS INC 2010 20120329.0
132525 CAFEPRESS INC 2011 20120329.0
132526 CAFEPRESS INC 2012 20120329.0
132527 CAFEPRESS INC 2013 20120329.0
132528 CAFEPRESS INC 2014 20120329.0
132529 CAFEPRESS INC 2015 20120329.0
132530 CAFEPRESS INC 2016 20120329.0
120049 CAI INTERNATIONAL INC 2005 20070516.0
120050 CAI INTERNATIONAL INC 2006 20070516.0
3897 CALAMP CORP 2000 NaN
3898 CALAMP CORP 2001 NaN
3896 CALAMP CORP 1999 NaN
3899 CALAMP CORP 2002 NaN
21120 CALATLANTIC GROUP INC 1995 NaN
21121 CALATLANTIC GROUP INC 1996 NaN
21122 CALATLANTIC GROUP INC 1997 NaN
21123 CALATLANTIC GROUP INC 1998 NaN
21124 CALATLANTIC GROUP INC 1999 NaN
21125 CALATLANTIC GROUP INC 2000 NaN
21126 CALATLANTIC GROUP INC 2001 NaN
21127 CALATLANTIC GROUP INC 2002 NaN
21128 CALATLANTIC GROUP INC 2003 NaN"""),
header=1)
df.set_index(['conm'], inplace=True)
df['age'] = df.groupby(level=0).apply(
lambda x: max(x.fyear) - round(x.ipodate.iloc[0]/10000-0.5))
print(df)
Results:
ID fyear ipodate age
conm
CAESARS ENTERTAINMENT 46078 2003 19891213.0 27.0
CAESARS ENTERTAINMENT 46079 2004 19891213.0 27.0
CAESARS ENTERTAINMENT 46080 2005 19891213.0 27.0
CAESARS ENTERTAINMENT 46091 2016 19891213.0 27.0
CAESARSTONE LTD 114620 2010 20120322.0 4.0
CAESARSTONE LTD 114621 2011 20120322.0 4.0
CAESARSTONE LTD 114622 2012 20120322.0 4.0
CAESARSTONE LTD 114623 2013 20120322.0 4.0
CAESARSTONE LTD 114624 2014 20120322.0 4.0
CAESARSTONE LTD 114625 2015 20120322.0 4.0
CAESARSTONE LTD 114626 2016 20120322.0 4.0
CAFEPRESS INC 132524 2010 20120329.0 4.0
CAFEPRESS INC 132525 2011 20120329.0 4.0
CAFEPRESS INC 132526 2012 20120329.0 4.0
CAFEPRESS INC 132527 2013 20120329.0 4.0
CAFEPRESS INC 132528 2014 20120329.0 4.0
CAFEPRESS INC 132529 2015 20120329.0 4.0
CAFEPRESS INC 132530 2016 20120329.0 4.0
CAI INTERNATIONAL INC 120049 2005 20070516.0 -1.0
CAI INTERNATIONAL INC 120050 2006 20070516.0 -1.0
CALAMP CORP 3897 2000 NaN NaN
CALAMP CORP 3898 2001 NaN NaN
CALAMP CORP 3896 1999 NaN NaN
CALAMP CORP 3899 2002 NaN NaN
CALATLANTIC GROUP INC 21120 1995 NaN NaN
CALATLANTIC GROUP INC 21121 1996 NaN NaN
CALATLANTIC GROUP INC 21122 1997 NaN NaN
CALATLANTIC GROUP INC 21123 1998 NaN NaN
CALATLANTIC GROUP INC 21124 1999 NaN NaN
CALATLANTIC GROUP INC 21125 2000 NaN NaN
CALATLANTIC GROUP INC 21126 2001 NaN NaN
CALATLANTIC GROUP INC 21127 2002 NaN NaN
CALATLANTIC GROUP INC 21128 2003 NaN NaN

Python Pandas pivot with values equal to simple function of specific column

import pandas as pd
olympics = pd.read_csv('olympics.csv')
Edition NOC Medal
0 1896 AUT Silver
1 1896 FRA Gold
2 1896 GER Gold
3 1900 HUN Bronze
4 1900 GBR Gold
5 1900 DEN Bronze
6 1900 USA Gold
7 1900 FRA Bronze
8 1900 FRA Silver
9 1900 USA Gold
10 1900 FRA Silver
11 1900 GBR Gold
12 1900 SUI Silver
13 1900 ZZX Gold
14 1904 HUN Gold
15 1904 USA Bronze
16 1904 USA Gold
17 1904 USA Silver
18 1904 CAN Gold
19 1904 USA Silver
I can pivot the data frame to have some aggregate value
pivot = olympics.pivot_table(index='Edition', columns='NOC', values='Medal', aggfunc='count')
NOC AUT CAN DEN FRA GBR GER HUN SUI USA ZZX
Edition
1896 1.0 NaN NaN 1.0 NaN 1.0 NaN NaN NaN NaN
1900 NaN NaN 1.0 3.0 2.0 NaN 1.0 1.0 2.0 1.0
1904 NaN 1.0 NaN NaN NaN NaN 1.0 NaN 4.0 NaN
Rather than having the total number of medals in values= , I am interested to have a tuple (a triple) with (#Gold, #Silver, #Bronze), (0,0,0) for NaN
How do I do that succinctly and elegantly?
No need to use pivot_table, as pivot is perfectly fine with tuple for a value
value_counts to count all medals
create multi-index for all combinations of countries, dates, medals
reindex with fill_values=0
counts = df.groupby(['Edition', 'NOC']).Medal.value_counts()
mux = pd.MultiIndex.from_product(
[c.values for c in counts.index.levels], names=counts.index.names)
counts = counts.reindex(mux, fill_value=0).unstack('Medal')
counts = counts[['Bronze', 'Silver', 'Gold']]
pd.Series([tuple(l) for l in counts.values.tolist()], counts.index).unstack()

Categories