pandas add columns conditions with groupby and on another column values - python

I have pandas.DataFrame called companysubset like below, but actual data is much longer.
conm fyear dvpayout industry firmycount ipodate
46078 CAESARS ENTERTAINMENT CORP 2003 0.226813 Services 22 19891213.0
46079 CAESARS ENTERTAINMENT CORP 2004 0.226813 Services 22 19891213.0
46080 CAESARS ENTERTAINMENT CORP 2005 0.226813 Services 22 19891213.0
46091 CAESARS ENTERTAINMENT CORP 2016 0.226813 Services 22 19891213.0
114620 CAESARSTONE LTD 2010 0.487543 Manufacturing 10 20120322.0
114621 CAESARSTONE LTD 2011 0.487543 Manufacturing 10 20120322.0
114622 CAESARSTONE LTD 2012 0.487543 Manufacturing 10 20120322.0
114623 CAESARSTONE LTD 2013 0.487543 Manufacturing 10 20120322.0
114624 CAESARSTONE LTD 2014 0.487543 Manufacturing 10 20120322.0
114625 CAESARSTONE LTD 2015 0.487543 Manufacturing 10 20120322.0
114626 CAESARSTONE LTD 2016 0.487543 Manufacturing 10 20120322.0
132524 CAFEPRESS INC 2010 0.000000 Retail Trade 7 20120329.0
132525 CAFEPRESS INC 2011 0.000000 Retail Trade 7 20120329.0
132526 CAFEPRESS INC 2012 -0.000000 Retail Trade 7 20120329.0
132527 CAFEPRESS INC 2013 -0.000000 Retail Trade 7 20120329.0
132528 CAFEPRESS INC 2014 -0.000000 Retail Trade 7 20120329.0
132529 CAFEPRESS INC 2015 -0.000000 Retail Trade 7 20120329.0
132530 CAFEPRESS INC 2016 -0.000000 Retail Trade 7 20120329.0
120049 CAI INTERNATIONAL INC 2005 0.000000 Services 12 20070516.0
120050 CAI INTERNATIONAL INC 2006 0.000000 Services 12 20070516.0
3896 CALAMP CORP 1999 -0.000000 Manufacturing 23 NaN
3897 CALAMP CORP 2000 0.000000 Manufacturing 23 NaN
3898 CALAMP CORP 2001 0.000000 Manufacturing 23 NaN
3899 CALAMP CORP 2002 0.000000 Manufacturing 23 NaN
21120 CALATLANTIC GROUP INC 1995 -0.133648 Construction 22 NaN
21121 CALATLANTIC GROUP INC 1996 -0.133648 Construction 22 NaN
21122 CALATLANTIC GROUP INC 1997 -0.133648 Construction 22 NaN
21123 CALATLANTIC GROUP INC 1998 -0.133648 Construction 22 NaN
21124 CALATLANTIC GROUP INC 1999 -0.133648 Construction 22 NaN
21125 CALATLANTIC GROUP INC 2000 -0.133648 Construction 22 NaN
21126 CALATLANTIC GROUP INC 2001 -0.133648 Construction 22 NaN
21127 CALATLANTIC GROUP INC 2002 -0.133648 Construction 22 NaN
21128 CALATLANTIC GROUP INC 2003 -0.133648 Construction 22 NaN
1) I want to calculate quartile of dvpayout of company by industry and add column called dv and indicate that it is in Q1, Q2, Q3 or Q4.
I came up with this code, but it does not work.
pd.cut(companysubset['dvpayout'].mean(), bins=[0,25,75,100], labels=False)
2) I want to add column called age if there is an ipodate. The value will be the largest fyear - ipodate of year. (ex. 2016 - 1989 for CAESARS ENTERTAINMENT COR)
The results data frame I want to see is like below.
conm fyear dvpayout industry firmycount ipodate dv age
46078 CAESARS ... 2003 0.226813 Services 22 19891213.0 Q2 27
46079 CAESARS ... 2004 0.226813 Services 22 19891213.0 Q2 27
46080 CAESARS ... 2005 0.226813 Services 22 19891213.0 Q2 27
46091 CAESARS ... 2016 0.226813 Services 22 19891213.0 Q2 27
114620 CAESARSTONE LTD 2010 0.487543 Manufacturing 10 20120322.0 Q3 4
114621 CAESARSTONE LTD 2011 0.487543 Manufacturing 10 20120322.0 Q3 4
114622 CAESARSTONE LTD 2012 0.487543 Manufacturing 10 20120322.0 Q3 4
114623 CAESARSTONE LTD 2013 0.487543 Manufacturing 10 20120322.0 Q3 4
114624 CAESARSTONE LTD 2014 0.487543 Manufacturing 10 20120322.0 Q3 4
114625 CAESARSTONE LTD 2015 0.487543 Manufacturing 10 20120322.0 Q3 4
114626 CAESARSTONE LTD 2016 0.487543 Manufacturing 10 20120322.0 Q3 4
132524 CAFEPRESS INC 2010 0.000000 Retail Trade 7 20120329.0 Q1 4
132525 CAFEPRESS INC 2011 0.000000 Retail Trade 7 20120329.0 Q1 4
132526 CAFEPRESS INC 2012 -0.000000 Retail Trade 7 20120329.0 Q1 4
132527 CAFEPRESS INC 2013 -0.000000 Retail Trade 7 20120329.0 Q1 4
132528 CAFEPRESS INC 2014 -0.000000 Retail Trade 7 20120329.0 Q1 4
132529 CAFEPRESS INC 2015 -0.000000 Retail Trade 7 20120329.0 Q1 4
132530 CAFEPRESS INC 2016 -0.000000 Retail Trade 7 20120329.0 Q1 4
120049 CAI INTERNATIONAL INC 2006 0.000000 Services 12 20070516.0 Q1 0
120050 CAI INTERNATIONAL INC 2007 0.000000 Services 12 20070516.0 Q1 0
3896 CALAMP CORP 1999 -0.000000 Manufacturing 23 NaN Q1 Nan
3897 CALAMP CORP 2000 0.000000 Manufacturing 23 NaN Q1 Nan
3898 CALAMP CORP 2001 0.000000 Manufacturing 23 NaN Q1 Nan
3899 CALAMP CORP 2002 0.000000 Manufacturing 23 NaN Q1 Nan
21120 CALATLANTIC GROUP INC 1995 -0.133648 Construction 22 NaN Q1 Nan
21121 CALATLANTIC GROUP INC 1996 -0.133648 Construction 22 NaN Q1 Nan
21122 CALATLANTIC GROUP INC 1997 -0.133648 Construction 22 NaN Q1 Nan
21123 CALATLANTIC GROUP INC 1998 -0.133648 Construction 22 NaN Q1 Nan
21124 CALATLANTIC GROUP INC 1999 -0.133648 Construction 22 NaN Q1 Nan
21125 CALATLANTIC GROUP INC 2000 -0.133648 Construction 22 NaN Q1 Nan
21126 CALATLANTIC GROUP INC 2001 -0.133648 Construction 22 NaN Q1 Nan
21127 CALATLANTIC GROUP INC 2002 -0.133648 Construction 22 NaN Q1 Nan
21128 CALATLANTIC GROUP INC 2003 -0.133648 Construction 22 NaN Q1 Nan
Thanks in advance!!!!

The age column can be generated with:
Code
df.set_index(['conm'], inplace=True)
df['age'] = df.groupby(level=0).apply(
lambda x: max(x.fyear) - round(x.ipodate.iloc[0]/10000-0.5))
Test Code:
df = pd.read_fwf(StringIO(
u"""
ID conm fyear ipodate
46078 CAESARS ENTERTAINMENT 2003 19891213.0
46079 CAESARS ENTERTAINMENT 2004 19891213.0
46080 CAESARS ENTERTAINMENT 2005 19891213.0
46091 CAESARS ENTERTAINMENT 2016 19891213.0
114620 CAESARSTONE LTD 2010 20120322.0
114621 CAESARSTONE LTD 2011 20120322.0
114622 CAESARSTONE LTD 2012 20120322.0
114623 CAESARSTONE LTD 2013 20120322.0
114624 CAESARSTONE LTD 2014 20120322.0
114625 CAESARSTONE LTD 2015 20120322.0
114626 CAESARSTONE LTD 2016 20120322.0
132524 CAFEPRESS INC 2010 20120329.0
132525 CAFEPRESS INC 2011 20120329.0
132526 CAFEPRESS INC 2012 20120329.0
132527 CAFEPRESS INC 2013 20120329.0
132528 CAFEPRESS INC 2014 20120329.0
132529 CAFEPRESS INC 2015 20120329.0
132530 CAFEPRESS INC 2016 20120329.0
120049 CAI INTERNATIONAL INC 2005 20070516.0
120050 CAI INTERNATIONAL INC 2006 20070516.0
3897 CALAMP CORP 2000 NaN
3898 CALAMP CORP 2001 NaN
3896 CALAMP CORP 1999 NaN
3899 CALAMP CORP 2002 NaN
21120 CALATLANTIC GROUP INC 1995 NaN
21121 CALATLANTIC GROUP INC 1996 NaN
21122 CALATLANTIC GROUP INC 1997 NaN
21123 CALATLANTIC GROUP INC 1998 NaN
21124 CALATLANTIC GROUP INC 1999 NaN
21125 CALATLANTIC GROUP INC 2000 NaN
21126 CALATLANTIC GROUP INC 2001 NaN
21127 CALATLANTIC GROUP INC 2002 NaN
21128 CALATLANTIC GROUP INC 2003 NaN"""),
header=1)
df.set_index(['conm'], inplace=True)
df['age'] = df.groupby(level=0).apply(
lambda x: max(x.fyear) - round(x.ipodate.iloc[0]/10000-0.5))
print(df)
Results:
ID fyear ipodate age
conm
CAESARS ENTERTAINMENT 46078 2003 19891213.0 27.0
CAESARS ENTERTAINMENT 46079 2004 19891213.0 27.0
CAESARS ENTERTAINMENT 46080 2005 19891213.0 27.0
CAESARS ENTERTAINMENT 46091 2016 19891213.0 27.0
CAESARSTONE LTD 114620 2010 20120322.0 4.0
CAESARSTONE LTD 114621 2011 20120322.0 4.0
CAESARSTONE LTD 114622 2012 20120322.0 4.0
CAESARSTONE LTD 114623 2013 20120322.0 4.0
CAESARSTONE LTD 114624 2014 20120322.0 4.0
CAESARSTONE LTD 114625 2015 20120322.0 4.0
CAESARSTONE LTD 114626 2016 20120322.0 4.0
CAFEPRESS INC 132524 2010 20120329.0 4.0
CAFEPRESS INC 132525 2011 20120329.0 4.0
CAFEPRESS INC 132526 2012 20120329.0 4.0
CAFEPRESS INC 132527 2013 20120329.0 4.0
CAFEPRESS INC 132528 2014 20120329.0 4.0
CAFEPRESS INC 132529 2015 20120329.0 4.0
CAFEPRESS INC 132530 2016 20120329.0 4.0
CAI INTERNATIONAL INC 120049 2005 20070516.0 -1.0
CAI INTERNATIONAL INC 120050 2006 20070516.0 -1.0
CALAMP CORP 3897 2000 NaN NaN
CALAMP CORP 3898 2001 NaN NaN
CALAMP CORP 3896 1999 NaN NaN
CALAMP CORP 3899 2002 NaN NaN
CALATLANTIC GROUP INC 21120 1995 NaN NaN
CALATLANTIC GROUP INC 21121 1996 NaN NaN
CALATLANTIC GROUP INC 21122 1997 NaN NaN
CALATLANTIC GROUP INC 21123 1998 NaN NaN
CALATLANTIC GROUP INC 21124 1999 NaN NaN
CALATLANTIC GROUP INC 21125 2000 NaN NaN
CALATLANTIC GROUP INC 21126 2001 NaN NaN
CALATLANTIC GROUP INC 21127 2002 NaN NaN
CALATLANTIC GROUP INC 21128 2003 NaN NaN

Related

Pandas DataFrame: Dropping rows after meeting conditions in columns

I have a large panel data in a pandas DataFrame:
import pandas as pd
df = pd.read_csv('Qs_example_data.csv')
df.head()
ID Year DOB status YOD
223725 1991 1975.0 No 2021
223725 1992 1975.0 No 2021
223725 1993 1975.0 No 2021
223725 1994 1975.0 No 2021
223725 1995 1975.0 No 2021
I want to drop the rows based on the following condition:
If the value in YOD matches the value in Year then all rows after that matching row for that ID are dropped, or if a Yes is observed in the column status for that ID.
For example in the DataFrame, ID 68084329 has the values 2012 in the DOB and YOD columns on row 221930. All rows after 221930 for 68084329 should be dropped.
df.loc[x['ID'] == 68084329]
ID Year DOB status YOD
221910 68084329 1991 1942.0 No 2012
221911 68084329 1992 1942.0 No 2012
221912 68084329 1993 1942.0 No 2012
221913 68084329 1994 1942.0 No 2012
221914 68084329 1995 1942.0 No 2012
221915 68084329 1996 1942.0 No 2012
221916 68084329 1997 1942.0 No 2012
221917 68084329 1998 1942.0 No 2012
221918 68084329 1999 1942.0 No 2012
221919 68084329 2000 1942.0 No 2012
221920 68084329 2001 1942.0 No 2012
221921 68084329 2002 1942.0 No 2012
221922 68084329 2003 1942.0 No 2012
221923 68084329 2004 1942.0 No 2012
221924 68084329 2005 1942.0 No 2012
221925 68084329 2006 1942.0 No 2012
221926 68084329 2007 1942.0 No 2012
221927 68084329 2008 1942.0 No 2012
221928 68084329 2010 1942.0 No 2012
221929 68084329 2011 1942.0 No 2012
221930 68084329 2012 1942.0 Yes 2012
221931 68084329 2013 1942.0 No 2012
221932 68084329 2014 1942.0 No 2012
221933 68084329 2015 1942.0 No 2012
221934 68084329 2016 1942.0 No 2012
221935 68084329 2017 1942.0 No 2012
I have a lot of IDs that have rows which need to be dropped in accordance with the above condition. How do I do this?
The following code should also work:
result=df[0:0]
ids=[]
for i in df.ID:
if i not in ids:
ids.append(i)
for k in ids:
temp=df[df.ID==k]
for j in range(len(temp)):
result=pd.concat([result, temp.iloc[j:j+1, :]])
if temp.iloc[j, :]['status']=='Yes':
break
print(result)
This should do. From your wording, it wasn't clear whether you need to "drop all the rows after you encounter a Yes for that ID", or "just the rows you encounter a Yes in". I assumed that you need to "drop all the rows after you encounter a Yes for that ID".
import pandas as pd
def __get_nos__(df):
return df.iloc[0:(df['Status'] != 'Yes').values.argmin(), :]
df = pd.DataFrame()
df['ID'] = [12345678]*10 + [13579]*10
df['Year'] = list(range(2000, 2010))*2
df['DOB'] = list(range(2000, 2010))*2
df['YOD'] = list(range(2000, 2010))*2
df['Status'] = ['No']*5 + ['Yes']*5 + ['No']*7 + ['Yes']*3
""" df
ID Year DOB YOD Status
0 12345678 2000 2000 2000 No
1 12345678 2001 2001 2001 No
2 12345678 2002 2002 2002 No
3 12345678 2003 2003 2003 No
4 12345678 2004 2004 2004 No
5 12345678 2005 2005 2005 Yes
6 12345678 2006 2006 2006 Yes
7 12345678 2007 2007 2007 Yes
8 12345678 2008 2008 2008 Yes
9 12345678 2009 2009 2009 Yes
10 13579 2000 2000 2000 No
11 13579 2001 2001 2001 No
12 13579 2002 2002 2002 No
13 13579 2003 2003 2003 No
14 13579 2004 2004 2004 No
15 13579 2005 2005 2005 No
16 13579 2006 2006 2006 No
17 13579 2007 2007 2007 Yes
18 13579 2008 2008 2008 Yes
19 13579 2009 2009 2009 Yes
"""
df.groupby('ID').apply(lambda x: __get_nos__(x)).reset_index(drop=True)
""" Output
ID Year DOB YOD Status
0 13579 2000 2000 2000 No
1 13579 2001 2001 2001 No
2 13579 2002 2002 2002 No
3 13579 2003 2003 2003 No
4 13579 2004 2004 2004 No
5 13579 2005 2005 2005 No
6 13579 2006 2006 2006 No
7 12345678 2000 2000 2000 No
8 12345678 2001 2001 2001 No
9 12345678 2002 2002 2002 No
10 12345678 2003 2003 2003 No
11 12345678 2004 2004 2004 No
"""

Set "Year" column to individual columns to create a panel

I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column.
Out[34]:
Award Year 0
State
Alabama 2003 89
Alabama 2004 92
Alabama 2005 108
Alabama 2006 81
Alabama 2007 71
... ...
Wyoming 2011 4
Wyoming 2012 2
Wyoming 2013 1
Wyoming 2014 4
Wyoming 2015 3
[648 rows x 2 columns]
I want the years to each be individual columns, this is an example,
Out[48]:
State 2003 2004 2005 2006
0 NewYork 10 10 10 10
1 Alabama 15 15 15 15
2 Washington 20 20 20 20
I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for.
If anyone can help that would be great!
Use set_index with append=True then select the column 0 and use unstack to reshape:
df = df.set_index('Award Year', append=True)['0'].unstack()
Result:
Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
Pivot Table can help.
df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State'])
df2
Result:
AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0

Why is that when I use pandas to scrape a table from a website it skips the middle columns and only prints the first 2 and last 2

I am currently working on a program that scrapes Yahoo Finance Earnings Calendar Page and stores the data in a file. I am able to scrape the data but I am confused as to why it only scrapes the first 2 and last 2 columns. I also tried to do the same with a table on Wikipedia for List of S&P 500 Companies and am running into the same problem. Any help is appreciated.
Yahoo Finance Code
import csv
import pandas as pd
earnings = pd.read_html('https://finance.yahoo.com/calendar/earnings?day=2019-11-19')[0]
fileName = "testFile"
with open(fileName + ".csv", mode='w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([earnings])
print(earnings)
Wikipedia Code
import pandas as pd
url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest
print(sp500_table)
~EDIT~
Here is the output I get from the Yahoo Finance Code
" Symbol Company ... Reported EPS Surprise(%)
0 WUBA 58.com Inc ... NaN NaN
1 ARMK Aramark ... NaN NaN
2 AFMD Affimed NV ... NaN NaN
3 NJR New Jersey Resources Corp ... NaN NaN
4 ECCB Eagle Point Credit Company Inc ... NaN NaN
5 TOUR Tuniu Corp ... NaN NaN
6 EIC Eagle Point Income Company Inc ... NaN NaN
7 KSS Kohls Corp ... NaN NaN
8 JKS JinkoSolar Holding Co Ltd ... NaN NaN
9 DL China Distance Education Holdings Ltd ... NaN NaN
10 TJX TJX Companies Inc ... NaN NaN
11 HD Home Depot Inc ... NaN NaN
12 PAGS PagSeguro Digital Ltd ... NaN NaN
13 ESE ESCO Technologies Inc ... NaN NaN
14 RADA Rada Electronic Industries Ltd ... NaN NaN
15 RADA Rada Electronic Industries Ltd ... NaN NaN
16 DAVA Endava PLC ... NaN NaN
17 FALC FalconStor Software Inc ... NaN NaN
18 GVP GSE Systems Inc ... NaN NaN
19 TDG TransDigm Group Inc ... NaN NaN
20 PPDF PPDAI Group Inc ... NaN NaN
21 GRBX Greenbox Pos ... NaN NaN
22 THMO Thermogenesis Holdings Inc ... NaN NaN
23 MMS Maximus Inc ... NaN NaN
24 NXTD NXT-ID Inc ... NaN NaN
25 URBN Urban Outfitters Inc ... NaN NaN
26 SINT SINTX Technologies Inc ... NaN NaN
27 ORNC Oranco Inc ... NaN NaN
28 LAIX LAIX Inc ... NaN NaN
29 MDT Medtronic PLC ... NaN NaN
[30 rows x 6 columns]"
Here is the output I get from Wikipedia Code
Symbol Security ... CIK Founded
0 MMM 3M Company ... 66740 1902
1 ABT Abbott Laboratories ... 1800 1888
2 ABBV AbbVie Inc. ... 1551152 2013 (1888)
3 ABMD ABIOMED Inc ... 815094 1981
4 ACN Accenture plc ... 1467373 1989
5 ATVI Activision Blizzard ... 718877 2008
6 ADBE Adobe Systems Inc ... 796343 1982
7 AMD Advanced Micro Devices Inc ... 2488 1969
8 AAP Advance Auto Parts ... 1158449 1932
9 AES AES Corp ... 874761 1981
10 AMG Affiliated Managers Group Inc ... 1004434 1993
11 AFL AFLAC Inc ... 4977 1955
12 A Agilent Technologies Inc ... 1090872 1999
13 APD Air Products & Chemicals Inc ... 2969 1940
14 AKAM Akamai Technologies Inc ... 1086222 1998
15 ALK Alaska Air Group Inc ... 766421 1985
16 ALB Albemarle Corp ... 915913 1994
17 ARE Alexandria Real Estate Equities ... 1035443 1994
18 ALXN Alexion Pharmaceuticals ... 899866 1992
19 ALGN Align Technology ... 1097149 1997
20 ALLE Allegion ... 1579241 1908
21 AGN Allergan, Plc ... 1578845 1983
22 ADS Alliance Data Systems ... 1101215 1996
23 LNT Alliant Energy Corp ... 352541 1917
24 ALL Allstate Corp ... 899051 1931
25 GOOGL Alphabet Inc Class A ... 1652044 1998
26 GOOG Alphabet Inc Class C ... 1652044 1998
27 MO Altria Group Inc ... 764180 1985
28 AMZN Amazon.com Inc. ... 1018724 1994
29 AMCR Amcor plc ... 1748790 NaN
.. ... ... ... ... ...
475 VIAB Viacom Inc. ... 1339947 NaN
476 V Visa Inc. ... 1403161 NaN
477 VNO Vornado Realty Trust ... 899689 NaN
478 VMC Vulcan Materials ... 1396009 NaN
479 WAB Wabtec Corporation ... 943452 NaN
480 WMT Walmart ... 104169 NaN
481 WBA Walgreens Boots Alliance ... 1618921 NaN
482 DIS The Walt Disney Company ... 1001039 NaN
483 WM Waste Management Inc. ... 823768 1968
484 WAT Waters Corporation ... 1000697 1958
485 WEC Wec Energy Group Inc ... 783325 NaN
486 WCG WellCare ... 1279363 NaN
487 WFC Wells Fargo ... 72971 NaN
488 WELL Welltower Inc. ... 766704 NaN
489 WDC Western Digital ... 106040 NaN
490 WU Western Union Co ... 1365135 1851
491 WRK WestRock ... 1636023 NaN
492 WY Weyerhaeuser ... 106535 NaN
493 WHR Whirlpool Corp. ... 106640 1911
494 WMB Williams Cos. ... 107263 NaN
495 WLTW Willis Towers Watson ... 1140536 NaN
496 WYNN Wynn Resorts Ltd ... 1174922 NaN
497 XEL Xcel Energy Inc ... 72903 1909
498 XRX Xerox ... 108772 1906
499 XLNX Xilinx ... 743988 NaN
500 XYL Xylem Inc. ... 1524472 NaN
501 YUM Yum! Brands Inc ... 1041061 NaN
502 ZBH Zimmer Biomet Holdings ... 1136869 NaN
503 ZION Zions Bancorp ... 109380 NaN
504 ZTS Zoetis ... 1555280 NaN
[505 rows x 9 columns]
As you can see in both examples the table conveniently omits the coloums in the middle and only displays the first and last 2.
~EDIT#2~
Making this change to the code now displays all coloumns but it does so in two seperate tables instead. Any idea as to why it does this?
fileName = "yahooFinance_Pandas"
with pd.option_context('display.max_columns', None): # more options can be specified also
with open(fileName + ".csv", mode='w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([earnings])
OUTPUT
" Symbol Company Earnings Call Time \
0 WUBA 58.com Inc Before Market Open
1 ARMK Aramark Before Market Open
2 AFMD Affimed NV TAS
3 NJR New Jersey Resources Corp Before Market Open
4 ECCB Eagle Point Credit Company Inc Before Market Open
5 TOUR Tuniu Corp Before Market Open
6 EIC Eagle Point Income Company Inc Before Market Open
7 KSS Kohls Corp Before Market Open
8 JKS JinkoSolar Holding Co Ltd Before Market Open
9 DL China Distance Education Holdings Ltd After Market Close
10 TJX TJX Companies Inc Before Market Open
11 HD Home Depot Inc Before Market Open
12 PAGS PagSeguro Digital Ltd TAS
13 ESE ESCO Technologies Inc After Market Close
14 RADA Rada Electronic Industries Ltd TAS
15 RADA Rada Electronic Industries Ltd Before Market Open
16 DAVA Endava PLC TAS
17 FALC FalconStor Software Inc After Market Close
18 GVP GSE Systems Inc TAS
19 TDG TransDigm Group Inc Before Market Open
20 PPDF PPDAI Group Inc Before Market Open
21 GRBX Greenbox Pos Time Not Supplied
22 THMO Thermogenesis Holdings Inc After Market Close
23 MMS Maximus Inc TAS
24 NXTD NXT-ID Inc TAS
25 URBN Urban Outfitters Inc After Market Close
26 SINT SINTX Technologies Inc Time Not Supplied
27 ORNC Oranco Inc Time Not Supplied
28 LAIX LAIX Inc After Market Close
29 MDT Medtronic PLC TAS
EPS Estimate Reported EPS Surprise(%)
0 0.82 NaN NaN
1 0.69 NaN NaN
2 -0.17 NaN NaN
3 0.28 NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 0.86 NaN NaN
8 0.83 NaN NaN
9 0.33 NaN NaN
10 0.66 NaN NaN
11 2.52 NaN NaN
12 0.29 NaN NaN
13 1.06 NaN NaN
14 -0.02 NaN NaN
15 -0.02 NaN NaN
16 21.21 NaN NaN
17 NaN NaN NaN
18 0.03 NaN NaN
19 5.16 NaN NaN
20 0.26 NaN NaN
21 NaN NaN NaN
22 -0.12 NaN NaN
23 0.94 NaN NaN
24 NaN NaN NaN
25 0.57 NaN NaN
26 NaN NaN NaN
27 NaN NaN NaN
28 -0.32 NaN NaN
29 1.28 NaN NaN "
~EDIT#3~
Made this change as you requested #Alex
earnings.to_csv(r'C:\Users\akkir\Desktop\pythonSelenium\export_dataframe.csv', index = None)
OUTPUT
Symbol,Company,Earnings Call Time,EPS Estimate,Reported EPS,Surprise(%)
ATTO,Atento SA,TAS,0.09,0.03,-66.67
ALPN,Alpine Immune Sciences Inc,TAS,-0.68,-0.62,8.82
ALPN,Alpine Immune Sciences Inc,Time Not Supplied,-0.68,-0.62,8.82
HOLI,Hollysys Automation Technologies Ltd,TAS,0.48,0.49,2.08
IDSA,Industrial Services of America Inc,After Market Close,,,
AGRO,Adecoagro SA,TAS,-0.01,,
ATOS,Atossa Genetics Inc,TAS,-0.52,-0.36,30.77
AXAS,Abraxas Petroleum Corp,TAS,0.03,0.02,-33.33
ACIU,AC Immune SA,TAS,0.17,0.25,47.06
ARCO,Arcos Dorados Holdings Inc,TAS,0.08,0.13,62.5
WTER,Alkaline Water Company Inc,Time Not Supplied,-0.07,-0.07,
ALNA,Allena Pharmaceuticals Inc,Before Market Open,-0.49,-0.57,-16.33
AEYE,AudioEye Inc,TAS,-0.26,-0.27,-3.85
APLT,Applied Therapeutics Inc,Before Market Open,-0.49,-0.63,-28.57
ALT,Altimmune Inc,TAS,-0.19,-0.73,-284.21
ABEOW,Abeona Therapeutics Inc,TAS,,,
ACER,Acer Therapeutics Inc,After Market Close,-0.57,-0.52,8.77
SRNN,Southern Banc Company Inc,Time Not Supplied,,,
SPB,Spectrum Brands Holdings Inc,Before Market Open,1.11,1.13,1.8
BIOC,Biocept Inc,TAS,-0.27,-0.25,7.41
IDXG,Interpace Biosciences Inc,TAS,-0.19,-0.19,
GTBP,GT Biopharma Inc,After Market Close,,,
MTNB,Matinas BioPharma Holdings Inc,Time Not Supplied,-0.03,-0.03,
MTNB,Matinas BioPharma Holdings Inc,TAS,-0.03,-0.03,
XELB,Xcel Brands Inc,After Market Close,0.12,0.06,-50.0
BBI,Brickell Biotech Inc,After Market Close,,,
SNBP,Sun Biopharma Inc,Before Market Open,,,
BZH,Beazer Homes USA Inc,TAS,0.51,0.08,-84.31
SELB,Selecta Biosciences Inc,TAS,-0.33,-0.26,21.21
BEST,BEST Inc,Before Market Open,,0.01,
CBPO,China Biologic Products Holdings Inc,TAS,0.88,1.4,59.09
TPCS,TechPrecision Corp,TAS,,,
LK,Luckin Coffee Inc,Before Market Open,-0.37,-0.32,13.51
CYD,China Yuchai International Ltd,Before Market Open,0.45,0.17,-62.22
CCF,Chase Corp,After Market Close,,,
SMCI,Super Micro Computer Inc,After Market Close,,,
AUMN,Golden Minerals Co,TAS,,,
PGR,Progressive Corp,Before Market Open,1.3,1.33,2.31
PUMP,ProPetro Holding Corp,TAS,0.51,0.33,-35.29
CPLG,CorePoint Lodging Inc,TAS,-0.44,-0.22,50.0
CHNG,Change Healthcare Inc,After Market Close,0.27,0.27,
NOVC,Novation Companies Inc,Time Not Supplied,,,
WFCF,Where Food Comes From Inc,Before Market Open,,,
CYCCP,Cyclacel Pharmaceuticals Inc,After Market Close,,,
ISCO,International Stem Cell Corp,Before Market Open,,,
CPA,Copa Holdings SA,TAS,2.23,2.45,9.87
CSCO,Cisco Systems Inc,TAS,0.81,0.84,3.7
GMDA,Gamida Cell Ltd,TAS,-0.36,-0.3,16.67
CHRA,Charah Solutions Inc,TAS,-0.05,-0.11,-120.0
MNI,McClatchy Co,TAS,-1.01,-0.16,84.16
ENSV,Enservco Corp,TAS,-0.06,-0.1,-66.67
TK,Teekay Corp,TAS,,,
SANW,S&W Seed Co,TAS,-0.15,-0.15,
SANW,S&W Seed Co,Before Market Open,-0.15,-0.15,
CMCM,Cheetah Mobile Inc,TAS,0.14,0.49,250.0
CYRN,Cyren Ltd,TAS,-0.07,-0.06,14.29
CATS,Catasys Inc,TAS,-0.32,-0.52,-62.5
GLAD,Gladstone Capital Corp,TAS,0.21,0.21,
PING,Ping Identity Holding Corp,After Market Close,0.01,0.13,1200.0
CRWS,Crown Crafts Inc,Before Market Open,0.18,0.18,
CTRP,Ctrip.Com International Ltd,After Market Close,0.29,,
GFF,Griffon Corp,After Market Close,0.33,0.4,21.21
CLIR,Clearsign Technologies Corp,After Market Close,,,
DMAC,DiaMedica Therapeutics Inc,After Market Close,,,
DSSI,Diamond S Shipping Inc,Time Not Supplied,-0.12,-0.19,-58.33
DSSI,Diamond S Shipping Inc,TAS,-0.12,-0.19,-58.33
DYAI,Dyadic International Inc,After Market Close,,,
ONE,OneSmart International Education Group Ltd,Before Market Open,,,
EFOI,Energy Focus Inc,Before Market Open,-0.15,-0.08,46.67
EDAP,Edap Tms SA,TAS,0.04,0.03,-25.0
EYEN,Eyenovia Inc,Before Market Open,-0.34,-0.29,14.71
EQS,EQUUS Total Return Inc,After Market Close,,,
SENR,Strategic Environmental & Energy Resources Inc,Before Market Open,,,
EPSN,Epsilon Energy Ltd,TAS,,,
GRMM,Grom Social Enterprises Inc,Before Market Open,,,
ECOR,"electroCore, Inc.",TAS,-0.31,-0.36,-16.13
SD,SandRidge Energy Inc,TAS,,,
ENR,Energizer Holdings Inc,TAS,0.81,0.93,14.81
ELMD,Electromed Inc,TAS,0.01,0.12,1100.0
EVK,Ever-Glory International Group Inc,TAS,,,
FTEK,Fuel Tech Inc,After Market Close,-0.03,-0.05,-66.67
FVRR,Fiverr International Ltd,Before Market Open,-0.19,-0.12,36.84
SGRP,SPAR Group Inc,TAS,,,
NSEC,National Security Group Inc,Time Not Supplied,,,
SNDL,Sundial Growers Inc,TAS,-0.08,,
SNDL,Sundial Growers Inc,Before Market Open,-0.08,,
TCOM,Trip.com Group Ltd,TAS,,,
RAVE,Rave Restaurant Group Inc,TAS,,,
SLGG,Super League Gaming Inc,After Market Close,-0.36,-0.43,-19.44
HI,Hillenbrand Inc,After Market Close,0.73,0.76,4.11
HROW,Harrow Health Inc,TAS,-0.24,-0.29,-20.83
NVGS,Navigator Holdings Ltd,TAS,-0.07,-0.01,85.71
INFU,InfuSystem Holdings Inc,Before Market Open,,,
OSW,OneSpaWorld Holdings Ltd,Before Market Open,0.12,0.11,-8.33
VIPS,Vipshop Holdings Ltd,TAS,0.17,0.25,47.06
PRTH,Priority Technology Holdings Inc,After Market Close,-0.12,-0.08,33.33
TGC,Tengasco Inc,TAS,,,
PRSP,Perspecta Inc,After Market Close,0.51,0.54,5.88
REED,Reed's Inc,After Market Close,-0.11,-0.14,-27.27
WSTL,Westell Technologies Inc,After Market Close,,,
As far as I can tell this nothing to do with the data and everything to do with the representation. Only the first and last columns are printed so as to keep the output from being massive and difficult to read. You can even see at the end of your output that your DataFrame has 9 columns.
Take a look here if you want to print the entire thing. You could also use .info to get some general information on your columns.
Thanks to #AlexanderCĂ©cile for the help regarding this issue.
For those interested in how he fixed my issue the code is below.
import pandas as pd
from datetime import date
pd.option_context('display.max_rows', None, 'display.max_columns', None)
earnings = pd.read_html('https://finance.yahoo.com/calendar/earnings?day=2019-11-13')[0]
earnings.to_csv(r'C:\Users\<user>\Desktop\earnings_{}.csv'.format(date.today()), index=None)

pandas DataFrame .stack(dropna=False) but keeping existing combinations of levels

My data looks like this
import numpy as np
import pandas as pd
# My Data
enroll_year = np.arange(2010, 2015)
grad_year = enroll_year + 4
n_students = [[100, 100, 110, 110, np.nan]]
df = pd.DataFrame(
n_students,
columns=pd.MultiIndex.from_arrays(
[enroll_year, grad_year],
names=['enroll_year', 'grad_year']))
print(df)
# enroll_year 2010 2011 2012 2013 2014
# grad_year 2014 2015 2016 2017 2018
# 0 100 100 110 110 NaN
What I am trying to do is to stack the data, one column/index level for year of enrollment, one for year of graduation and one for the numbers of students, which should look like
# enroll_year grad_year n
# 2010 2014 100.0
# . . .
# . . .
# . . .
# 2014 2018 NaN
The data produced by .stack() is very close, but the missing record(s) is dropped,
df1 = df.stack(['enroll_year', 'grad_year'])
df1.index = df1.index.droplevel(0)
print(df1)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# dtype: float64
So, .stack(dropna=False) is tried, but it will expand the index levels to all combinations of enrollment and graduation years
df2 = df.stack(['enroll_year', 'grad_year'], dropna=False)
df2.index = df2.index.droplevel(0)
print(df2)
# enroll_year grad_year
# 2010 2014 100.0
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2011 2014 NaN
# 2015 100.0
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2012 2014 NaN
# 2015 NaN
# 2016 110.0
# 2017 NaN
# 2018 NaN
# 2013 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 110.0
# 2018 NaN
# 2014 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# dtype: float64
And I need to subset df2 to get my desired data set.
existing_combn = list(zip(
df.columns.levels[0][df.columns.labels[0]],
df.columns.levels[1][df.columns.labels[1]]))
df3 = df2.loc[existing_combn]
print(df3)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# 2014 2018 NaN
# dtype: float64
Although it only adds a few more extra lines to my code, I wonder if there are any better and neater approaches.
Use unstack with pd.DataFrame then reset_index and drop unnecessary columns and rename the column as:
pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Or:
df.unstack().reset_index(level=2, drop=True)
enroll_year grad_year
2010 2014 100.0
2011 2015 100.0
2012 2016 110.0
2013 2017 110.0
2014 2018 NaN
dtype: float64
Or:
df.unstack().reset_index(level=2, drop=True).reset_index().rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Explanation :
print(pd.DataFrame(df.unstack()))
0
enroll_year grad_year
2010 2014 0 100.0
2011 2015 0 100.0
2012 2016 0 110.0
2013 2017 0 110.0
2014 2018 0 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1))
enroll_year grad_year 0
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'}))
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN

pandas data frame add columns with function by year and company

I have pandas dataframe called 'sm' like below with many lines, and I want to calculate sales growth rate by company and year and insert it into new column name 'sg'.
fyear ch conm sale ipodate
0 1996 51.705 AAR CORP 589.328 NaN
1 1997 17.222 AAR CORP 782.123 NaN
2 1998 8.250 AAR CORP 918.036 NaN
3 1999 1.241 AAR CORP 1024.333 NaN
4 2000 13.809 AAR CORP 874.255 NaN
5 2001 34.522 AAR CORP 638.721 NaN
6 2002 29.154 AAR CORP 606.337 NaN
7 2003 41.010 AAR CORP 651.958 NaN
8 2004 40.508 AAR CORP 747.848 NaN
9 2005 121.738 AAR CORP 897.284 NaN
10 2006 83.317 AAR CORP 1061.169 NaN
11 2007 109.391 AAR CORP 1384.919 NaN
12 2008 112.505 AAR CORP 1423.976 NaN
13 2009 79.370 AAR CORP 1352.151 NaN
14 2010 57.433 AAR CORP 1775.782 NaN
15 2011 67.720 AAR CORP 2074.498 NaN
16 2012 75.300 AAR CORP 2167.100 NaN
17 2013 89.200 AAR CORP 2035.000 NaN
18 2014 54.700 AAR CORP 1594.300 NaN
19 2015 31.200 AAR CORP 1662.600 NaN
20 1997 64.000 AMERICAN AIRLINES GROUP INC 18570.000 NaN
21 1998 95.000 AMERICAN AIRLINES GROUP INC 19205.000 NaN
22 1999 85.000 AMERICAN AIRLINES GROUP INC 17730.000 NaN
23 2000 89.000 AMERICAN AIRLINES GROUP INC 19703.000 NaN
24 2001 120.000 AMERICAN AIRLINES GROUP INC 18963.000 NaN
115466 2014 290.500 ALLEGION PLC 2118.300 NaN
115467 2015 199.700 ALLEGION PLC 2068.100 NaN
115468 2016 312.400 ALLEGION PLC 2238.000 NaN
115470 2013 2.063 AGILITY HEALTH INC 63.052 NaN
115471 2014 1.301 AGILITY HEALTH INC 62.105 NaN
115472 2015 1.307 AGILITY HEALTH INC 62.328 NaN
115473 2013 109.819 NORDIC AMERICAN OFFSHORE NaN NaN
115474 2014 46.398 NORDIC AMERICAN OFFSHORE 52.789 NaN
115475 2015 5.339 NORDIC AMERICAN OFFSHORE 36.372 NaN
115476 2016 2.953 NORDIC AMERICAN OFFSHORE 16.249 NaN
115477 2011 2.040 DORIAN LPG LTD 34.571 20140508.0
115478 2012 1.042 DORIAN LPG LTD 38.662 20140508.0
115479 2013 279.132 DORIAN LPG LTD 29.634 20140508.0
115480 2014 204.821 DORIAN LPG LTD 104.129 20140508.0
115481 2015 46.412 DORIAN LPG LTD 289.208 20140508.0
115482 2013 948.684 NOMAD FOODS LTD 2074.842 NaN
115483 2014 855.541 NOMAD FOODS LTD 1816.239 NaN
115484 2015 671.846 NOMAD FOODS LTD 971.013 NaN
115485 2016 347.688 NOMAD FOODS LTD 2034.109 NaN
115487 2014 2638.000 ATHENE HOLDING LTD 4100.000 20161209.0
115488 2015 2720.000 ATHENE HOLDING LTD 2616.000 20161209.0
115489 2016 2459.000 ATHENE HOLDING LTD 4107.000 20161209.0
115490 2013 3.956 MIDATECH PHARMA PLC 0.244 NaN
115491 2014 47.240 MIDATECH PHARMA PLC 0.245 NaN
115492 2015 23.852 MIDATECH PHARMA PLC 2.028 NaN
115493 2016 21.723 MIDATECH PHARMA PLC 8.541 NaN
I somewhat implement codes like this,
d = sm.loc[sm['conm'] == 'AAR CORP']
dt = d.loc[d.fyear == 1996,'sale'].values[0]
dtp1 = d.loc[d.fyear == 1997,'sale'].values[0]
sg = (dtp1-dt)/ dt * 100
d.ix[d.fyear == 1997,'sg'] = sg
and it gives me the column at the end.
fyear ch conm sale ipodate sg
0 1996 51.705 AAR CORP 589.328 NaN Nan
1 1997 17.222 AAR CORP 782.123 NaN 32.71438
2 1998 8.250 AAR CORP 918.036 NaN Nan
I want 'sg' column to be next to 'sale' column and I want to calculate this sales growth rate for every company for every years (1996-2015) and insert into same row with given year t. Now I sliced by company name into small data frame and then calculate sales growth rate but since my unique company names are more 9000, so my current method seems to inefficient. Can I do this without slicing by all company names? Thank you in advance.
Take a look at pct_change
df['sg'] = df[['sale']].pct_change()
#sort df first
df.sort_values(by=['conm','fyear'],inplace=True)
#you can use groupby first by company, get the rolling increase for each company and then insert it to the dataframe.
df.insert(df.columns.tolist().index('sale')+1,
'sm',df.groupby(by=['conm'])['sale']\
.apply(lambda x: x.rolling(2).apply(lambda y: (y[1]-y[0])/y[0]*100)))
df
Out[296]:
fyear ch conm sale sm ipodate
0 1996 51.705 AAR CORP 589.328 NaN NaN
1 1997 17.222 AAR CORP 782.123 32.714380 NaN
2 1998 8.250 AAR CORP 918.036 17.377446 NaN
3 1999 1.241 AAR CORP 1024.333 11.578740 NaN
4 2000 13.809 AAR CORP 874.255 -14.651290 NaN
5 2001 34.522 AAR CORP 638.721 -26.941110 NaN
6 2002 29.154 AAR CORP 606.337 -5.070132 NaN
7 2003 41.010 AAR CORP 651.958 7.524034 NaN
8 2004 40.508 AAR CORP 747.848 14.708003 NaN
9 2005 121.738 AAR CORP 897.284 19.982135 NaN
10 2006 83.317 AAR CORP 1061.169 18.264563 NaN
11 2007 109.391 AAR CORP 1384.919 30.508807 NaN
12 2008 112.505 AAR CORP 1423.976 2.820165 NaN
13 2009 79.370 AAR CORP 1352.151 -5.043975 NaN
14 2010 57.433 AAR CORP 1775.782 31.330155 NaN
15 2011 67.720 AAR CORP 2074.498 16.821659 NaN
16 2012 75.300 AAR CORP 2167.100 4.463827 NaN
17 2013 89.200 AAR CORP 2035.000 -6.095704 NaN
18 2014 54.700 AAR CORP 1594.300 -21.656020 NaN
19 2015 31.200 AAR CORP 1662.600 4.284012 NaN
20 1997 64.000 AMERICAN AIRLINES GROUP INC 18570.000 NaN NaN
21 1998 95.000 AMERICAN AIRLINES GROUP INC 19205.000 3.419494 NaN
22 1999 85.000 AMERICAN AIRLINES GROUP INC 17730.000 -7.680292 NaN
23 2000 89.000 AMERICAN AIRLINES GROUP INC 19703.000 11.128032 NaN
24 2001 120.000 AMERICAN AIRLINES GROUP INC 18963.000 -3.755773 NaN
Another possible solution (since it does not use apply(), it is potentially faster):
df['sm'] = (df.sort_values(['conm', 'fyear'])\
.groupby('conm')['sale']\
.diff()\
.shift(-1) / df['sale']).shift() * 100
This solution assumes that there is always a 1-year difference between consecuitive years.

Categories