pandas add columns conditions with groupby and on another column values - python
I have pandas.DataFrame called companysubset like below, but actual data is much longer.
conm fyear dvpayout industry firmycount ipodate
46078 CAESARS ENTERTAINMENT CORP 2003 0.226813 Services 22 19891213.0
46079 CAESARS ENTERTAINMENT CORP 2004 0.226813 Services 22 19891213.0
46080 CAESARS ENTERTAINMENT CORP 2005 0.226813 Services 22 19891213.0
46091 CAESARS ENTERTAINMENT CORP 2016 0.226813 Services 22 19891213.0
114620 CAESARSTONE LTD 2010 0.487543 Manufacturing 10 20120322.0
114621 CAESARSTONE LTD 2011 0.487543 Manufacturing 10 20120322.0
114622 CAESARSTONE LTD 2012 0.487543 Manufacturing 10 20120322.0
114623 CAESARSTONE LTD 2013 0.487543 Manufacturing 10 20120322.0
114624 CAESARSTONE LTD 2014 0.487543 Manufacturing 10 20120322.0
114625 CAESARSTONE LTD 2015 0.487543 Manufacturing 10 20120322.0
114626 CAESARSTONE LTD 2016 0.487543 Manufacturing 10 20120322.0
132524 CAFEPRESS INC 2010 0.000000 Retail Trade 7 20120329.0
132525 CAFEPRESS INC 2011 0.000000 Retail Trade 7 20120329.0
132526 CAFEPRESS INC 2012 -0.000000 Retail Trade 7 20120329.0
132527 CAFEPRESS INC 2013 -0.000000 Retail Trade 7 20120329.0
132528 CAFEPRESS INC 2014 -0.000000 Retail Trade 7 20120329.0
132529 CAFEPRESS INC 2015 -0.000000 Retail Trade 7 20120329.0
132530 CAFEPRESS INC 2016 -0.000000 Retail Trade 7 20120329.0
120049 CAI INTERNATIONAL INC 2005 0.000000 Services 12 20070516.0
120050 CAI INTERNATIONAL INC 2006 0.000000 Services 12 20070516.0
3896 CALAMP CORP 1999 -0.000000 Manufacturing 23 NaN
3897 CALAMP CORP 2000 0.000000 Manufacturing 23 NaN
3898 CALAMP CORP 2001 0.000000 Manufacturing 23 NaN
3899 CALAMP CORP 2002 0.000000 Manufacturing 23 NaN
21120 CALATLANTIC GROUP INC 1995 -0.133648 Construction 22 NaN
21121 CALATLANTIC GROUP INC 1996 -0.133648 Construction 22 NaN
21122 CALATLANTIC GROUP INC 1997 -0.133648 Construction 22 NaN
21123 CALATLANTIC GROUP INC 1998 -0.133648 Construction 22 NaN
21124 CALATLANTIC GROUP INC 1999 -0.133648 Construction 22 NaN
21125 CALATLANTIC GROUP INC 2000 -0.133648 Construction 22 NaN
21126 CALATLANTIC GROUP INC 2001 -0.133648 Construction 22 NaN
21127 CALATLANTIC GROUP INC 2002 -0.133648 Construction 22 NaN
21128 CALATLANTIC GROUP INC 2003 -0.133648 Construction 22 NaN
1) I want to calculate quartile of dvpayout of company by industry and add column called dv and indicate that it is in Q1, Q2, Q3 or Q4.
I came up with this code, but it does not work.
pd.cut(companysubset['dvpayout'].mean(), bins=[0,25,75,100], labels=False)
2) I want to add column called age if there is an ipodate. The value will be the largest fyear - ipodate of year. (ex. 2016 - 1989 for CAESARS ENTERTAINMENT COR)
The results data frame I want to see is like below.
conm fyear dvpayout industry firmycount ipodate dv age
46078 CAESARS ... 2003 0.226813 Services 22 19891213.0 Q2 27
46079 CAESARS ... 2004 0.226813 Services 22 19891213.0 Q2 27
46080 CAESARS ... 2005 0.226813 Services 22 19891213.0 Q2 27
46091 CAESARS ... 2016 0.226813 Services 22 19891213.0 Q2 27
114620 CAESARSTONE LTD 2010 0.487543 Manufacturing 10 20120322.0 Q3 4
114621 CAESARSTONE LTD 2011 0.487543 Manufacturing 10 20120322.0 Q3 4
114622 CAESARSTONE LTD 2012 0.487543 Manufacturing 10 20120322.0 Q3 4
114623 CAESARSTONE LTD 2013 0.487543 Manufacturing 10 20120322.0 Q3 4
114624 CAESARSTONE LTD 2014 0.487543 Manufacturing 10 20120322.0 Q3 4
114625 CAESARSTONE LTD 2015 0.487543 Manufacturing 10 20120322.0 Q3 4
114626 CAESARSTONE LTD 2016 0.487543 Manufacturing 10 20120322.0 Q3 4
132524 CAFEPRESS INC 2010 0.000000 Retail Trade 7 20120329.0 Q1 4
132525 CAFEPRESS INC 2011 0.000000 Retail Trade 7 20120329.0 Q1 4
132526 CAFEPRESS INC 2012 -0.000000 Retail Trade 7 20120329.0 Q1 4
132527 CAFEPRESS INC 2013 -0.000000 Retail Trade 7 20120329.0 Q1 4
132528 CAFEPRESS INC 2014 -0.000000 Retail Trade 7 20120329.0 Q1 4
132529 CAFEPRESS INC 2015 -0.000000 Retail Trade 7 20120329.0 Q1 4
132530 CAFEPRESS INC 2016 -0.000000 Retail Trade 7 20120329.0 Q1 4
120049 CAI INTERNATIONAL INC 2006 0.000000 Services 12 20070516.0 Q1 0
120050 CAI INTERNATIONAL INC 2007 0.000000 Services 12 20070516.0 Q1 0
3896 CALAMP CORP 1999 -0.000000 Manufacturing 23 NaN Q1 Nan
3897 CALAMP CORP 2000 0.000000 Manufacturing 23 NaN Q1 Nan
3898 CALAMP CORP 2001 0.000000 Manufacturing 23 NaN Q1 Nan
3899 CALAMP CORP 2002 0.000000 Manufacturing 23 NaN Q1 Nan
21120 CALATLANTIC GROUP INC 1995 -0.133648 Construction 22 NaN Q1 Nan
21121 CALATLANTIC GROUP INC 1996 -0.133648 Construction 22 NaN Q1 Nan
21122 CALATLANTIC GROUP INC 1997 -0.133648 Construction 22 NaN Q1 Nan
21123 CALATLANTIC GROUP INC 1998 -0.133648 Construction 22 NaN Q1 Nan
21124 CALATLANTIC GROUP INC 1999 -0.133648 Construction 22 NaN Q1 Nan
21125 CALATLANTIC GROUP INC 2000 -0.133648 Construction 22 NaN Q1 Nan
21126 CALATLANTIC GROUP INC 2001 -0.133648 Construction 22 NaN Q1 Nan
21127 CALATLANTIC GROUP INC 2002 -0.133648 Construction 22 NaN Q1 Nan
21128 CALATLANTIC GROUP INC 2003 -0.133648 Construction 22 NaN Q1 Nan
Thanks in advance!!!!
The age column can be generated with:
Code
df.set_index(['conm'], inplace=True)
df['age'] = df.groupby(level=0).apply(
lambda x: max(x.fyear) - round(x.ipodate.iloc[0]/10000-0.5))
Test Code:
df = pd.read_fwf(StringIO(
u"""
ID conm fyear ipodate
46078 CAESARS ENTERTAINMENT 2003 19891213.0
46079 CAESARS ENTERTAINMENT 2004 19891213.0
46080 CAESARS ENTERTAINMENT 2005 19891213.0
46091 CAESARS ENTERTAINMENT 2016 19891213.0
114620 CAESARSTONE LTD 2010 20120322.0
114621 CAESARSTONE LTD 2011 20120322.0
114622 CAESARSTONE LTD 2012 20120322.0
114623 CAESARSTONE LTD 2013 20120322.0
114624 CAESARSTONE LTD 2014 20120322.0
114625 CAESARSTONE LTD 2015 20120322.0
114626 CAESARSTONE LTD 2016 20120322.0
132524 CAFEPRESS INC 2010 20120329.0
132525 CAFEPRESS INC 2011 20120329.0
132526 CAFEPRESS INC 2012 20120329.0
132527 CAFEPRESS INC 2013 20120329.0
132528 CAFEPRESS INC 2014 20120329.0
132529 CAFEPRESS INC 2015 20120329.0
132530 CAFEPRESS INC 2016 20120329.0
120049 CAI INTERNATIONAL INC 2005 20070516.0
120050 CAI INTERNATIONAL INC 2006 20070516.0
3897 CALAMP CORP 2000 NaN
3898 CALAMP CORP 2001 NaN
3896 CALAMP CORP 1999 NaN
3899 CALAMP CORP 2002 NaN
21120 CALATLANTIC GROUP INC 1995 NaN
21121 CALATLANTIC GROUP INC 1996 NaN
21122 CALATLANTIC GROUP INC 1997 NaN
21123 CALATLANTIC GROUP INC 1998 NaN
21124 CALATLANTIC GROUP INC 1999 NaN
21125 CALATLANTIC GROUP INC 2000 NaN
21126 CALATLANTIC GROUP INC 2001 NaN
21127 CALATLANTIC GROUP INC 2002 NaN
21128 CALATLANTIC GROUP INC 2003 NaN"""),
header=1)
df.set_index(['conm'], inplace=True)
df['age'] = df.groupby(level=0).apply(
lambda x: max(x.fyear) - round(x.ipodate.iloc[0]/10000-0.5))
print(df)
Results:
ID fyear ipodate age
conm
CAESARS ENTERTAINMENT 46078 2003 19891213.0 27.0
CAESARS ENTERTAINMENT 46079 2004 19891213.0 27.0
CAESARS ENTERTAINMENT 46080 2005 19891213.0 27.0
CAESARS ENTERTAINMENT 46091 2016 19891213.0 27.0
CAESARSTONE LTD 114620 2010 20120322.0 4.0
CAESARSTONE LTD 114621 2011 20120322.0 4.0
CAESARSTONE LTD 114622 2012 20120322.0 4.0
CAESARSTONE LTD 114623 2013 20120322.0 4.0
CAESARSTONE LTD 114624 2014 20120322.0 4.0
CAESARSTONE LTD 114625 2015 20120322.0 4.0
CAESARSTONE LTD 114626 2016 20120322.0 4.0
CAFEPRESS INC 132524 2010 20120329.0 4.0
CAFEPRESS INC 132525 2011 20120329.0 4.0
CAFEPRESS INC 132526 2012 20120329.0 4.0
CAFEPRESS INC 132527 2013 20120329.0 4.0
CAFEPRESS INC 132528 2014 20120329.0 4.0
CAFEPRESS INC 132529 2015 20120329.0 4.0
CAFEPRESS INC 132530 2016 20120329.0 4.0
CAI INTERNATIONAL INC 120049 2005 20070516.0 -1.0
CAI INTERNATIONAL INC 120050 2006 20070516.0 -1.0
CALAMP CORP 3897 2000 NaN NaN
CALAMP CORP 3898 2001 NaN NaN
CALAMP CORP 3896 1999 NaN NaN
CALAMP CORP 3899 2002 NaN NaN
CALATLANTIC GROUP INC 21120 1995 NaN NaN
CALATLANTIC GROUP INC 21121 1996 NaN NaN
CALATLANTIC GROUP INC 21122 1997 NaN NaN
CALATLANTIC GROUP INC 21123 1998 NaN NaN
CALATLANTIC GROUP INC 21124 1999 NaN NaN
CALATLANTIC GROUP INC 21125 2000 NaN NaN
CALATLANTIC GROUP INC 21126 2001 NaN NaN
CALATLANTIC GROUP INC 21127 2002 NaN NaN
CALATLANTIC GROUP INC 21128 2003 NaN NaN
Related
Pandas DataFrame: Dropping rows after meeting conditions in columns
I have a large panel data in a pandas DataFrame: import pandas as pd df = pd.read_csv('Qs_example_data.csv') df.head() ID Year DOB status YOD 223725 1991 1975.0 No 2021 223725 1992 1975.0 No 2021 223725 1993 1975.0 No 2021 223725 1994 1975.0 No 2021 223725 1995 1975.0 No 2021 I want to drop the rows based on the following condition: If the value in YOD matches the value in Year then all rows after that matching row for that ID are dropped, or if a Yes is observed in the column status for that ID. For example in the DataFrame, ID 68084329 has the values 2012 in the DOB and YOD columns on row 221930. All rows after 221930 for 68084329 should be dropped. df.loc[x['ID'] == 68084329] ID Year DOB status YOD 221910 68084329 1991 1942.0 No 2012 221911 68084329 1992 1942.0 No 2012 221912 68084329 1993 1942.0 No 2012 221913 68084329 1994 1942.0 No 2012 221914 68084329 1995 1942.0 No 2012 221915 68084329 1996 1942.0 No 2012 221916 68084329 1997 1942.0 No 2012 221917 68084329 1998 1942.0 No 2012 221918 68084329 1999 1942.0 No 2012 221919 68084329 2000 1942.0 No 2012 221920 68084329 2001 1942.0 No 2012 221921 68084329 2002 1942.0 No 2012 221922 68084329 2003 1942.0 No 2012 221923 68084329 2004 1942.0 No 2012 221924 68084329 2005 1942.0 No 2012 221925 68084329 2006 1942.0 No 2012 221926 68084329 2007 1942.0 No 2012 221927 68084329 2008 1942.0 No 2012 221928 68084329 2010 1942.0 No 2012 221929 68084329 2011 1942.0 No 2012 221930 68084329 2012 1942.0 Yes 2012 221931 68084329 2013 1942.0 No 2012 221932 68084329 2014 1942.0 No 2012 221933 68084329 2015 1942.0 No 2012 221934 68084329 2016 1942.0 No 2012 221935 68084329 2017 1942.0 No 2012 I have a lot of IDs that have rows which need to be dropped in accordance with the above condition. How do I do this?
The following code should also work: result=df[0:0] ids=[] for i in df.ID: if i not in ids: ids.append(i) for k in ids: temp=df[df.ID==k] for j in range(len(temp)): result=pd.concat([result, temp.iloc[j:j+1, :]]) if temp.iloc[j, :]['status']=='Yes': break print(result)
This should do. From your wording, it wasn't clear whether you need to "drop all the rows after you encounter a Yes for that ID", or "just the rows you encounter a Yes in". I assumed that you need to "drop all the rows after you encounter a Yes for that ID". import pandas as pd def __get_nos__(df): return df.iloc[0:(df['Status'] != 'Yes').values.argmin(), :] df = pd.DataFrame() df['ID'] = [12345678]*10 + [13579]*10 df['Year'] = list(range(2000, 2010))*2 df['DOB'] = list(range(2000, 2010))*2 df['YOD'] = list(range(2000, 2010))*2 df['Status'] = ['No']*5 + ['Yes']*5 + ['No']*7 + ['Yes']*3 """ df ID Year DOB YOD Status 0 12345678 2000 2000 2000 No 1 12345678 2001 2001 2001 No 2 12345678 2002 2002 2002 No 3 12345678 2003 2003 2003 No 4 12345678 2004 2004 2004 No 5 12345678 2005 2005 2005 Yes 6 12345678 2006 2006 2006 Yes 7 12345678 2007 2007 2007 Yes 8 12345678 2008 2008 2008 Yes 9 12345678 2009 2009 2009 Yes 10 13579 2000 2000 2000 No 11 13579 2001 2001 2001 No 12 13579 2002 2002 2002 No 13 13579 2003 2003 2003 No 14 13579 2004 2004 2004 No 15 13579 2005 2005 2005 No 16 13579 2006 2006 2006 No 17 13579 2007 2007 2007 Yes 18 13579 2008 2008 2008 Yes 19 13579 2009 2009 2009 Yes """ df.groupby('ID').apply(lambda x: __get_nos__(x)).reset_index(drop=True) """ Output ID Year DOB YOD Status 0 13579 2000 2000 2000 No 1 13579 2001 2001 2001 No 2 13579 2002 2002 2002 No 3 13579 2003 2003 2003 No 4 13579 2004 2004 2004 No 5 13579 2005 2005 2005 No 6 13579 2006 2006 2006 No 7 12345678 2000 2000 2000 No 8 12345678 2001 2001 2001 No 9 12345678 2002 2002 2002 No 10 12345678 2003 2003 2003 No 11 12345678 2004 2004 2004 No """
Set "Year" column to individual columns to create a panel
I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column. Out[34]: Award Year 0 State Alabama 2003 89 Alabama 2004 92 Alabama 2005 108 Alabama 2006 81 Alabama 2007 71 ... ... Wyoming 2011 4 Wyoming 2012 2 Wyoming 2013 1 Wyoming 2014 4 Wyoming 2015 3 [648 rows x 2 columns] I want the years to each be individual columns, this is an example, Out[48]: State 2003 2004 2005 2006 0 NewYork 10 10 10 10 1 Alabama 15 15 15 15 2 Washington 20 20 20 20 I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for. If anyone can help that would be great!
Use set_index with append=True then select the column 0 and use unstack to reshape: df = df.set_index('Award Year', append=True)['0'].unstack() Result: Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015 State Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
Pivot Table can help. df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State']) df2 Result: AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015 State Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
Why is that when I use pandas to scrape a table from a website it skips the middle columns and only prints the first 2 and last 2
I am currently working on a program that scrapes Yahoo Finance Earnings Calendar Page and stores the data in a file. I am able to scrape the data but I am confused as to why it only scrapes the first 2 and last 2 columns. I also tried to do the same with a table on Wikipedia for List of S&P 500 Companies and am running into the same problem. Any help is appreciated. Yahoo Finance Code import csv import pandas as pd earnings = pd.read_html('https://finance.yahoo.com/calendar/earnings?day=2019-11-19')[0] fileName = "testFile" with open(fileName + ".csv", mode='w') as csv_file: writer = csv.writer(csv_file) writer.writerow([earnings]) print(earnings) Wikipedia Code import pandas as pd url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies' tables = pd.read_html(url) # Returns list of all tables on page sp500_table = tables[0] # Select table of interest print(sp500_table) ~EDIT~ Here is the output I get from the Yahoo Finance Code " Symbol Company ... Reported EPS Surprise(%) 0 WUBA 58.com Inc ... NaN NaN 1 ARMK Aramark ... NaN NaN 2 AFMD Affimed NV ... NaN NaN 3 NJR New Jersey Resources Corp ... NaN NaN 4 ECCB Eagle Point Credit Company Inc ... NaN NaN 5 TOUR Tuniu Corp ... NaN NaN 6 EIC Eagle Point Income Company Inc ... NaN NaN 7 KSS Kohls Corp ... NaN NaN 8 JKS JinkoSolar Holding Co Ltd ... NaN NaN 9 DL China Distance Education Holdings Ltd ... NaN NaN 10 TJX TJX Companies Inc ... NaN NaN 11 HD Home Depot Inc ... NaN NaN 12 PAGS PagSeguro Digital Ltd ... NaN NaN 13 ESE ESCO Technologies Inc ... NaN NaN 14 RADA Rada Electronic Industries Ltd ... NaN NaN 15 RADA Rada Electronic Industries Ltd ... NaN NaN 16 DAVA Endava PLC ... NaN NaN 17 FALC FalconStor Software Inc ... NaN NaN 18 GVP GSE Systems Inc ... NaN NaN 19 TDG TransDigm Group Inc ... NaN NaN 20 PPDF PPDAI Group Inc ... NaN NaN 21 GRBX Greenbox Pos ... NaN NaN 22 THMO Thermogenesis Holdings Inc ... NaN NaN 23 MMS Maximus Inc ... NaN NaN 24 NXTD NXT-ID Inc ... NaN NaN 25 URBN Urban Outfitters Inc ... NaN NaN 26 SINT SINTX Technologies Inc ... NaN NaN 27 ORNC Oranco Inc ... NaN NaN 28 LAIX LAIX Inc ... NaN NaN 29 MDT Medtronic PLC ... NaN NaN [30 rows x 6 columns]" Here is the output I get from Wikipedia Code Symbol Security ... CIK Founded 0 MMM 3M Company ... 66740 1902 1 ABT Abbott Laboratories ... 1800 1888 2 ABBV AbbVie Inc. ... 1551152 2013 (1888) 3 ABMD ABIOMED Inc ... 815094 1981 4 ACN Accenture plc ... 1467373 1989 5 ATVI Activision Blizzard ... 718877 2008 6 ADBE Adobe Systems Inc ... 796343 1982 7 AMD Advanced Micro Devices Inc ... 2488 1969 8 AAP Advance Auto Parts ... 1158449 1932 9 AES AES Corp ... 874761 1981 10 AMG Affiliated Managers Group Inc ... 1004434 1993 11 AFL AFLAC Inc ... 4977 1955 12 A Agilent Technologies Inc ... 1090872 1999 13 APD Air Products & Chemicals Inc ... 2969 1940 14 AKAM Akamai Technologies Inc ... 1086222 1998 15 ALK Alaska Air Group Inc ... 766421 1985 16 ALB Albemarle Corp ... 915913 1994 17 ARE Alexandria Real Estate Equities ... 1035443 1994 18 ALXN Alexion Pharmaceuticals ... 899866 1992 19 ALGN Align Technology ... 1097149 1997 20 ALLE Allegion ... 1579241 1908 21 AGN Allergan, Plc ... 1578845 1983 22 ADS Alliance Data Systems ... 1101215 1996 23 LNT Alliant Energy Corp ... 352541 1917 24 ALL Allstate Corp ... 899051 1931 25 GOOGL Alphabet Inc Class A ... 1652044 1998 26 GOOG Alphabet Inc Class C ... 1652044 1998 27 MO Altria Group Inc ... 764180 1985 28 AMZN Amazon.com Inc. ... 1018724 1994 29 AMCR Amcor plc ... 1748790 NaN .. ... ... ... ... ... 475 VIAB Viacom Inc. ... 1339947 NaN 476 V Visa Inc. ... 1403161 NaN 477 VNO Vornado Realty Trust ... 899689 NaN 478 VMC Vulcan Materials ... 1396009 NaN 479 WAB Wabtec Corporation ... 943452 NaN 480 WMT Walmart ... 104169 NaN 481 WBA Walgreens Boots Alliance ... 1618921 NaN 482 DIS The Walt Disney Company ... 1001039 NaN 483 WM Waste Management Inc. ... 823768 1968 484 WAT Waters Corporation ... 1000697 1958 485 WEC Wec Energy Group Inc ... 783325 NaN 486 WCG WellCare ... 1279363 NaN 487 WFC Wells Fargo ... 72971 NaN 488 WELL Welltower Inc. ... 766704 NaN 489 WDC Western Digital ... 106040 NaN 490 WU Western Union Co ... 1365135 1851 491 WRK WestRock ... 1636023 NaN 492 WY Weyerhaeuser ... 106535 NaN 493 WHR Whirlpool Corp. ... 106640 1911 494 WMB Williams Cos. ... 107263 NaN 495 WLTW Willis Towers Watson ... 1140536 NaN 496 WYNN Wynn Resorts Ltd ... 1174922 NaN 497 XEL Xcel Energy Inc ... 72903 1909 498 XRX Xerox ... 108772 1906 499 XLNX Xilinx ... 743988 NaN 500 XYL Xylem Inc. ... 1524472 NaN 501 YUM Yum! Brands Inc ... 1041061 NaN 502 ZBH Zimmer Biomet Holdings ... 1136869 NaN 503 ZION Zions Bancorp ... 109380 NaN 504 ZTS Zoetis ... 1555280 NaN [505 rows x 9 columns] As you can see in both examples the table conveniently omits the coloums in the middle and only displays the first and last 2. ~EDIT#2~ Making this change to the code now displays all coloumns but it does so in two seperate tables instead. Any idea as to why it does this? fileName = "yahooFinance_Pandas" with pd.option_context('display.max_columns', None): # more options can be specified also with open(fileName + ".csv", mode='w') as csv_file: writer = csv.writer(csv_file) writer.writerow([earnings]) OUTPUT " Symbol Company Earnings Call Time \ 0 WUBA 58.com Inc Before Market Open 1 ARMK Aramark Before Market Open 2 AFMD Affimed NV TAS 3 NJR New Jersey Resources Corp Before Market Open 4 ECCB Eagle Point Credit Company Inc Before Market Open 5 TOUR Tuniu Corp Before Market Open 6 EIC Eagle Point Income Company Inc Before Market Open 7 KSS Kohls Corp Before Market Open 8 JKS JinkoSolar Holding Co Ltd Before Market Open 9 DL China Distance Education Holdings Ltd After Market Close 10 TJX TJX Companies Inc Before Market Open 11 HD Home Depot Inc Before Market Open 12 PAGS PagSeguro Digital Ltd TAS 13 ESE ESCO Technologies Inc After Market Close 14 RADA Rada Electronic Industries Ltd TAS 15 RADA Rada Electronic Industries Ltd Before Market Open 16 DAVA Endava PLC TAS 17 FALC FalconStor Software Inc After Market Close 18 GVP GSE Systems Inc TAS 19 TDG TransDigm Group Inc Before Market Open 20 PPDF PPDAI Group Inc Before Market Open 21 GRBX Greenbox Pos Time Not Supplied 22 THMO Thermogenesis Holdings Inc After Market Close 23 MMS Maximus Inc TAS 24 NXTD NXT-ID Inc TAS 25 URBN Urban Outfitters Inc After Market Close 26 SINT SINTX Technologies Inc Time Not Supplied 27 ORNC Oranco Inc Time Not Supplied 28 LAIX LAIX Inc After Market Close 29 MDT Medtronic PLC TAS EPS Estimate Reported EPS Surprise(%) 0 0.82 NaN NaN 1 0.69 NaN NaN 2 -0.17 NaN NaN 3 0.28 NaN NaN 4 NaN NaN NaN 5 NaN NaN NaN 6 NaN NaN NaN 7 0.86 NaN NaN 8 0.83 NaN NaN 9 0.33 NaN NaN 10 0.66 NaN NaN 11 2.52 NaN NaN 12 0.29 NaN NaN 13 1.06 NaN NaN 14 -0.02 NaN NaN 15 -0.02 NaN NaN 16 21.21 NaN NaN 17 NaN NaN NaN 18 0.03 NaN NaN 19 5.16 NaN NaN 20 0.26 NaN NaN 21 NaN NaN NaN 22 -0.12 NaN NaN 23 0.94 NaN NaN 24 NaN NaN NaN 25 0.57 NaN NaN 26 NaN NaN NaN 27 NaN NaN NaN 28 -0.32 NaN NaN 29 1.28 NaN NaN " ~EDIT#3~ Made this change as you requested #Alex earnings.to_csv(r'C:\Users\akkir\Desktop\pythonSelenium\export_dataframe.csv', index = None) OUTPUT Symbol,Company,Earnings Call Time,EPS Estimate,Reported EPS,Surprise(%) ATTO,Atento SA,TAS,0.09,0.03,-66.67 ALPN,Alpine Immune Sciences Inc,TAS,-0.68,-0.62,8.82 ALPN,Alpine Immune Sciences Inc,Time Not Supplied,-0.68,-0.62,8.82 HOLI,Hollysys Automation Technologies Ltd,TAS,0.48,0.49,2.08 IDSA,Industrial Services of America Inc,After Market Close,,, AGRO,Adecoagro SA,TAS,-0.01,, ATOS,Atossa Genetics Inc,TAS,-0.52,-0.36,30.77 AXAS,Abraxas Petroleum Corp,TAS,0.03,0.02,-33.33 ACIU,AC Immune SA,TAS,0.17,0.25,47.06 ARCO,Arcos Dorados Holdings Inc,TAS,0.08,0.13,62.5 WTER,Alkaline Water Company Inc,Time Not Supplied,-0.07,-0.07, ALNA,Allena Pharmaceuticals Inc,Before Market Open,-0.49,-0.57,-16.33 AEYE,AudioEye Inc,TAS,-0.26,-0.27,-3.85 APLT,Applied Therapeutics Inc,Before Market Open,-0.49,-0.63,-28.57 ALT,Altimmune Inc,TAS,-0.19,-0.73,-284.21 ABEOW,Abeona Therapeutics Inc,TAS,,, ACER,Acer Therapeutics Inc,After Market Close,-0.57,-0.52,8.77 SRNN,Southern Banc Company Inc,Time Not Supplied,,, SPB,Spectrum Brands Holdings Inc,Before Market Open,1.11,1.13,1.8 BIOC,Biocept Inc,TAS,-0.27,-0.25,7.41 IDXG,Interpace Biosciences Inc,TAS,-0.19,-0.19, GTBP,GT Biopharma Inc,After Market Close,,, MTNB,Matinas BioPharma Holdings Inc,Time Not Supplied,-0.03,-0.03, MTNB,Matinas BioPharma Holdings Inc,TAS,-0.03,-0.03, XELB,Xcel Brands Inc,After Market Close,0.12,0.06,-50.0 BBI,Brickell Biotech Inc,After Market Close,,, SNBP,Sun Biopharma Inc,Before Market Open,,, BZH,Beazer Homes USA Inc,TAS,0.51,0.08,-84.31 SELB,Selecta Biosciences Inc,TAS,-0.33,-0.26,21.21 BEST,BEST Inc,Before Market Open,,0.01, CBPO,China Biologic Products Holdings Inc,TAS,0.88,1.4,59.09 TPCS,TechPrecision Corp,TAS,,, LK,Luckin Coffee Inc,Before Market Open,-0.37,-0.32,13.51 CYD,China Yuchai International Ltd,Before Market Open,0.45,0.17,-62.22 CCF,Chase Corp,After Market Close,,, SMCI,Super Micro Computer Inc,After Market Close,,, AUMN,Golden Minerals Co,TAS,,, PGR,Progressive Corp,Before Market Open,1.3,1.33,2.31 PUMP,ProPetro Holding Corp,TAS,0.51,0.33,-35.29 CPLG,CorePoint Lodging Inc,TAS,-0.44,-0.22,50.0 CHNG,Change Healthcare Inc,After Market Close,0.27,0.27, NOVC,Novation Companies Inc,Time Not Supplied,,, WFCF,Where Food Comes From Inc,Before Market Open,,, CYCCP,Cyclacel Pharmaceuticals Inc,After Market Close,,, ISCO,International Stem Cell Corp,Before Market Open,,, CPA,Copa Holdings SA,TAS,2.23,2.45,9.87 CSCO,Cisco Systems Inc,TAS,0.81,0.84,3.7 GMDA,Gamida Cell Ltd,TAS,-0.36,-0.3,16.67 CHRA,Charah Solutions Inc,TAS,-0.05,-0.11,-120.0 MNI,McClatchy Co,TAS,-1.01,-0.16,84.16 ENSV,Enservco Corp,TAS,-0.06,-0.1,-66.67 TK,Teekay Corp,TAS,,, SANW,S&W Seed Co,TAS,-0.15,-0.15, SANW,S&W Seed Co,Before Market Open,-0.15,-0.15, CMCM,Cheetah Mobile Inc,TAS,0.14,0.49,250.0 CYRN,Cyren Ltd,TAS,-0.07,-0.06,14.29 CATS,Catasys Inc,TAS,-0.32,-0.52,-62.5 GLAD,Gladstone Capital Corp,TAS,0.21,0.21, PING,Ping Identity Holding Corp,After Market Close,0.01,0.13,1200.0 CRWS,Crown Crafts Inc,Before Market Open,0.18,0.18, CTRP,Ctrip.Com International Ltd,After Market Close,0.29,, GFF,Griffon Corp,After Market Close,0.33,0.4,21.21 CLIR,Clearsign Technologies Corp,After Market Close,,, DMAC,DiaMedica Therapeutics Inc,After Market Close,,, DSSI,Diamond S Shipping Inc,Time Not Supplied,-0.12,-0.19,-58.33 DSSI,Diamond S Shipping Inc,TAS,-0.12,-0.19,-58.33 DYAI,Dyadic International Inc,After Market Close,,, ONE,OneSmart International Education Group Ltd,Before Market Open,,, EFOI,Energy Focus Inc,Before Market Open,-0.15,-0.08,46.67 EDAP,Edap Tms SA,TAS,0.04,0.03,-25.0 EYEN,Eyenovia Inc,Before Market Open,-0.34,-0.29,14.71 EQS,EQUUS Total Return Inc,After Market Close,,, SENR,Strategic Environmental & Energy Resources Inc,Before Market Open,,, EPSN,Epsilon Energy Ltd,TAS,,, GRMM,Grom Social Enterprises Inc,Before Market Open,,, ECOR,"electroCore, Inc.",TAS,-0.31,-0.36,-16.13 SD,SandRidge Energy Inc,TAS,,, ENR,Energizer Holdings Inc,TAS,0.81,0.93,14.81 ELMD,Electromed Inc,TAS,0.01,0.12,1100.0 EVK,Ever-Glory International Group Inc,TAS,,, FTEK,Fuel Tech Inc,After Market Close,-0.03,-0.05,-66.67 FVRR,Fiverr International Ltd,Before Market Open,-0.19,-0.12,36.84 SGRP,SPAR Group Inc,TAS,,, NSEC,National Security Group Inc,Time Not Supplied,,, SNDL,Sundial Growers Inc,TAS,-0.08,, SNDL,Sundial Growers Inc,Before Market Open,-0.08,, TCOM,Trip.com Group Ltd,TAS,,, RAVE,Rave Restaurant Group Inc,TAS,,, SLGG,Super League Gaming Inc,After Market Close,-0.36,-0.43,-19.44 HI,Hillenbrand Inc,After Market Close,0.73,0.76,4.11 HROW,Harrow Health Inc,TAS,-0.24,-0.29,-20.83 NVGS,Navigator Holdings Ltd,TAS,-0.07,-0.01,85.71 INFU,InfuSystem Holdings Inc,Before Market Open,,, OSW,OneSpaWorld Holdings Ltd,Before Market Open,0.12,0.11,-8.33 VIPS,Vipshop Holdings Ltd,TAS,0.17,0.25,47.06 PRTH,Priority Technology Holdings Inc,After Market Close,-0.12,-0.08,33.33 TGC,Tengasco Inc,TAS,,, PRSP,Perspecta Inc,After Market Close,0.51,0.54,5.88 REED,Reed's Inc,After Market Close,-0.11,-0.14,-27.27 WSTL,Westell Technologies Inc,After Market Close,,,
As far as I can tell this nothing to do with the data and everything to do with the representation. Only the first and last columns are printed so as to keep the output from being massive and difficult to read. You can even see at the end of your output that your DataFrame has 9 columns. Take a look here if you want to print the entire thing. You could also use .info to get some general information on your columns.
Thanks to #AlexanderCĂ©cile for the help regarding this issue. For those interested in how he fixed my issue the code is below. import pandas as pd from datetime import date pd.option_context('display.max_rows', None, 'display.max_columns', None) earnings = pd.read_html('https://finance.yahoo.com/calendar/earnings?day=2019-11-13')[0] earnings.to_csv(r'C:\Users\<user>\Desktop\earnings_{}.csv'.format(date.today()), index=None)
pandas DataFrame .stack(dropna=False) but keeping existing combinations of levels
My data looks like this import numpy as np import pandas as pd # My Data enroll_year = np.arange(2010, 2015) grad_year = enroll_year + 4 n_students = [[100, 100, 110, 110, np.nan]] df = pd.DataFrame( n_students, columns=pd.MultiIndex.from_arrays( [enroll_year, grad_year], names=['enroll_year', 'grad_year'])) print(df) # enroll_year 2010 2011 2012 2013 2014 # grad_year 2014 2015 2016 2017 2018 # 0 100 100 110 110 NaN What I am trying to do is to stack the data, one column/index level for year of enrollment, one for year of graduation and one for the numbers of students, which should look like # enroll_year grad_year n # 2010 2014 100.0 # . . . # . . . # . . . # 2014 2018 NaN The data produced by .stack() is very close, but the missing record(s) is dropped, df1 = df.stack(['enroll_year', 'grad_year']) df1.index = df1.index.droplevel(0) print(df1) # enroll_year grad_year # 2010 2014 100.0 # 2011 2015 100.0 # 2012 2016 110.0 # 2013 2017 110.0 # dtype: float64 So, .stack(dropna=False) is tried, but it will expand the index levels to all combinations of enrollment and graduation years df2 = df.stack(['enroll_year', 'grad_year'], dropna=False) df2.index = df2.index.droplevel(0) print(df2) # enroll_year grad_year # 2010 2014 100.0 # 2015 NaN # 2016 NaN # 2017 NaN # 2018 NaN # 2011 2014 NaN # 2015 100.0 # 2016 NaN # 2017 NaN # 2018 NaN # 2012 2014 NaN # 2015 NaN # 2016 110.0 # 2017 NaN # 2018 NaN # 2013 2014 NaN # 2015 NaN # 2016 NaN # 2017 110.0 # 2018 NaN # 2014 2014 NaN # 2015 NaN # 2016 NaN # 2017 NaN # 2018 NaN # dtype: float64 And I need to subset df2 to get my desired data set. existing_combn = list(zip( df.columns.levels[0][df.columns.labels[0]], df.columns.levels[1][df.columns.labels[1]])) df3 = df2.loc[existing_combn] print(df3) # enroll_year grad_year # 2010 2014 100.0 # 2011 2015 100.0 # 2012 2016 110.0 # 2013 2017 110.0 # 2014 2018 NaN # dtype: float64 Although it only adds a few more extra lines to my code, I wonder if there are any better and neater approaches.
Use unstack with pd.DataFrame then reset_index and drop unnecessary columns and rename the column as: pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'}) enroll_year grad_year n 0 2010 2014 100.0 1 2011 2015 100.0 2 2012 2016 110.0 3 2013 2017 110.0 4 2014 2018 NaN Or: df.unstack().reset_index(level=2, drop=True) enroll_year grad_year 2010 2014 100.0 2011 2015 100.0 2012 2016 110.0 2013 2017 110.0 2014 2018 NaN dtype: float64 Or: df.unstack().reset_index(level=2, drop=True).reset_index().rename(columns={0:'n'}) enroll_year grad_year n 0 2010 2014 100.0 1 2011 2015 100.0 2 2012 2016 110.0 3 2013 2017 110.0 4 2014 2018 NaN Explanation : print(pd.DataFrame(df.unstack())) 0 enroll_year grad_year 2010 2014 0 100.0 2011 2015 0 100.0 2012 2016 0 110.0 2013 2017 0 110.0 2014 2018 0 NaN print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1)) enroll_year grad_year 0 0 2010 2014 100.0 1 2011 2015 100.0 2 2012 2016 110.0 3 2013 2017 110.0 4 2014 2018 NaN print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'})) enroll_year grad_year n 0 2010 2014 100.0 1 2011 2015 100.0 2 2012 2016 110.0 3 2013 2017 110.0 4 2014 2018 NaN
pandas data frame add columns with function by year and company
I have pandas dataframe called 'sm' like below with many lines, and I want to calculate sales growth rate by company and year and insert it into new column name 'sg'. fyear ch conm sale ipodate 0 1996 51.705 AAR CORP 589.328 NaN 1 1997 17.222 AAR CORP 782.123 NaN 2 1998 8.250 AAR CORP 918.036 NaN 3 1999 1.241 AAR CORP 1024.333 NaN 4 2000 13.809 AAR CORP 874.255 NaN 5 2001 34.522 AAR CORP 638.721 NaN 6 2002 29.154 AAR CORP 606.337 NaN 7 2003 41.010 AAR CORP 651.958 NaN 8 2004 40.508 AAR CORP 747.848 NaN 9 2005 121.738 AAR CORP 897.284 NaN 10 2006 83.317 AAR CORP 1061.169 NaN 11 2007 109.391 AAR CORP 1384.919 NaN 12 2008 112.505 AAR CORP 1423.976 NaN 13 2009 79.370 AAR CORP 1352.151 NaN 14 2010 57.433 AAR CORP 1775.782 NaN 15 2011 67.720 AAR CORP 2074.498 NaN 16 2012 75.300 AAR CORP 2167.100 NaN 17 2013 89.200 AAR CORP 2035.000 NaN 18 2014 54.700 AAR CORP 1594.300 NaN 19 2015 31.200 AAR CORP 1662.600 NaN 20 1997 64.000 AMERICAN AIRLINES GROUP INC 18570.000 NaN 21 1998 95.000 AMERICAN AIRLINES GROUP INC 19205.000 NaN 22 1999 85.000 AMERICAN AIRLINES GROUP INC 17730.000 NaN 23 2000 89.000 AMERICAN AIRLINES GROUP INC 19703.000 NaN 24 2001 120.000 AMERICAN AIRLINES GROUP INC 18963.000 NaN 115466 2014 290.500 ALLEGION PLC 2118.300 NaN 115467 2015 199.700 ALLEGION PLC 2068.100 NaN 115468 2016 312.400 ALLEGION PLC 2238.000 NaN 115470 2013 2.063 AGILITY HEALTH INC 63.052 NaN 115471 2014 1.301 AGILITY HEALTH INC 62.105 NaN 115472 2015 1.307 AGILITY HEALTH INC 62.328 NaN 115473 2013 109.819 NORDIC AMERICAN OFFSHORE NaN NaN 115474 2014 46.398 NORDIC AMERICAN OFFSHORE 52.789 NaN 115475 2015 5.339 NORDIC AMERICAN OFFSHORE 36.372 NaN 115476 2016 2.953 NORDIC AMERICAN OFFSHORE 16.249 NaN 115477 2011 2.040 DORIAN LPG LTD 34.571 20140508.0 115478 2012 1.042 DORIAN LPG LTD 38.662 20140508.0 115479 2013 279.132 DORIAN LPG LTD 29.634 20140508.0 115480 2014 204.821 DORIAN LPG LTD 104.129 20140508.0 115481 2015 46.412 DORIAN LPG LTD 289.208 20140508.0 115482 2013 948.684 NOMAD FOODS LTD 2074.842 NaN 115483 2014 855.541 NOMAD FOODS LTD 1816.239 NaN 115484 2015 671.846 NOMAD FOODS LTD 971.013 NaN 115485 2016 347.688 NOMAD FOODS LTD 2034.109 NaN 115487 2014 2638.000 ATHENE HOLDING LTD 4100.000 20161209.0 115488 2015 2720.000 ATHENE HOLDING LTD 2616.000 20161209.0 115489 2016 2459.000 ATHENE HOLDING LTD 4107.000 20161209.0 115490 2013 3.956 MIDATECH PHARMA PLC 0.244 NaN 115491 2014 47.240 MIDATECH PHARMA PLC 0.245 NaN 115492 2015 23.852 MIDATECH PHARMA PLC 2.028 NaN 115493 2016 21.723 MIDATECH PHARMA PLC 8.541 NaN I somewhat implement codes like this, d = sm.loc[sm['conm'] == 'AAR CORP'] dt = d.loc[d.fyear == 1996,'sale'].values[0] dtp1 = d.loc[d.fyear == 1997,'sale'].values[0] sg = (dtp1-dt)/ dt * 100 d.ix[d.fyear == 1997,'sg'] = sg and it gives me the column at the end. fyear ch conm sale ipodate sg 0 1996 51.705 AAR CORP 589.328 NaN Nan 1 1997 17.222 AAR CORP 782.123 NaN 32.71438 2 1998 8.250 AAR CORP 918.036 NaN Nan I want 'sg' column to be next to 'sale' column and I want to calculate this sales growth rate for every company for every years (1996-2015) and insert into same row with given year t. Now I sliced by company name into small data frame and then calculate sales growth rate but since my unique company names are more 9000, so my current method seems to inefficient. Can I do this without slicing by all company names? Thank you in advance.
Take a look at pct_change df['sg'] = df[['sale']].pct_change()
#sort df first df.sort_values(by=['conm','fyear'],inplace=True) #you can use groupby first by company, get the rolling increase for each company and then insert it to the dataframe. df.insert(df.columns.tolist().index('sale')+1, 'sm',df.groupby(by=['conm'])['sale']\ .apply(lambda x: x.rolling(2).apply(lambda y: (y[1]-y[0])/y[0]*100))) df Out[296]: fyear ch conm sale sm ipodate 0 1996 51.705 AAR CORP 589.328 NaN NaN 1 1997 17.222 AAR CORP 782.123 32.714380 NaN 2 1998 8.250 AAR CORP 918.036 17.377446 NaN 3 1999 1.241 AAR CORP 1024.333 11.578740 NaN 4 2000 13.809 AAR CORP 874.255 -14.651290 NaN 5 2001 34.522 AAR CORP 638.721 -26.941110 NaN 6 2002 29.154 AAR CORP 606.337 -5.070132 NaN 7 2003 41.010 AAR CORP 651.958 7.524034 NaN 8 2004 40.508 AAR CORP 747.848 14.708003 NaN 9 2005 121.738 AAR CORP 897.284 19.982135 NaN 10 2006 83.317 AAR CORP 1061.169 18.264563 NaN 11 2007 109.391 AAR CORP 1384.919 30.508807 NaN 12 2008 112.505 AAR CORP 1423.976 2.820165 NaN 13 2009 79.370 AAR CORP 1352.151 -5.043975 NaN 14 2010 57.433 AAR CORP 1775.782 31.330155 NaN 15 2011 67.720 AAR CORP 2074.498 16.821659 NaN 16 2012 75.300 AAR CORP 2167.100 4.463827 NaN 17 2013 89.200 AAR CORP 2035.000 -6.095704 NaN 18 2014 54.700 AAR CORP 1594.300 -21.656020 NaN 19 2015 31.200 AAR CORP 1662.600 4.284012 NaN 20 1997 64.000 AMERICAN AIRLINES GROUP INC 18570.000 NaN NaN 21 1998 95.000 AMERICAN AIRLINES GROUP INC 19205.000 3.419494 NaN 22 1999 85.000 AMERICAN AIRLINES GROUP INC 17730.000 -7.680292 NaN 23 2000 89.000 AMERICAN AIRLINES GROUP INC 19703.000 11.128032 NaN 24 2001 120.000 AMERICAN AIRLINES GROUP INC 18963.000 -3.755773 NaN
Another possible solution (since it does not use apply(), it is potentially faster): df['sm'] = (df.sort_values(['conm', 'fyear'])\ .groupby('conm')['sale']\ .diff()\ .shift(-1) / df['sale']).shift() * 100 This solution assumes that there is always a 1-year difference between consecuitive years.