webscraping error, it says out of index, whats up

webscraping error, it says out of index, whats up - python

I'm trying to make an app that scrapes my top ten favorite space related stock prices. but
List item I have some trouble with my code and I'm new to scraping. Once I get this to work, I want to put it into a csv file and make a bar graph with it, I would love some help and suggestions. Also Im doing this in Anaconda :
#import libraries
import bs4
from bs4 import BeautifulSoup
#grequests is a unique library that allows you to use many urls with ease
#must install qrequest in annacode use : conda install -c conda-forge grequests
#if you know a better way to do this, please let me know
import grequests
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
unsent_request = (grequests.get(url) for url in urls)
results = grequests.map(unsent_request)
def parsePrice(r):
soup = bs4.BeautifulSoup(r.text,"html")
price=soup.find_all('div',{'class':'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="52">4.1500'})[0].find('span').text
return price
for r in results:
parsePrice(r)
SO what code is bringing this error:
IndexError Traceback (most recent call last)
<ipython-input-6-9ac8cb94b6fb> in <module>
5
6 for r in results:
----> 7 parsePrice(r)
<ipython-input-6-9ac8cb94b6fb> in parsePrice(r)
1 def parsePrice(r):
2 soup = bs4.BeautifulSoup(r.text,"html")
----> 3 price=soup.find_all('div',{'class':'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="52">4.1500'})[0].find('span').text
4 return price
5
IndexError: list index out of range
whats up?

The data on the page is with in <table> tags. Use pandas' .read_html(), as it uses BeautifulSoup under the hood. That way you can grab more.
That data is also available through API/XHR, but won't get into that, as that'll be slightly more complex.
import pandas as pd
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
def parsePrice(r):
df = pd.read_html(r)[0].T
cols = list(df.iloc[0,:])
temp_df = pd.DataFrame([list(df.iloc[1,:])], columns=cols)
temp_df['url'] = r
return temp_df
df = pd.DataFrame()
for r in urls:
df = df.append(parsePrice(r), sort=True).reset_index(drop=True)
df.to_csv('path/filename.csv', index=False)
Output:
print (df.to_string())
52 Week Range Ask Avg. Volume Bid Day's Range Open Previous Close Volume url
0 7.32 - 9.87 8.09 x 800 23415 8.06 x 800 8.01 - 8.11 8.10 8.01 6337 https://finance.yahoo.com/quote/GILT/
1 32.14 - 42.77 32.74 x 1100 41759 32.59 x 1000 32.28 - 32.75 32.32 32.28 14685 https://finance.yahoo.com/quote/LORL?p=LORL&.t...
2 5.55 - 27.29 6.64 x 800 5746553 6.63 x 2900 6.51 - 6.68 6.64 6.65 995245 https://finance.yahoo.com/quote/I?p=I&.tsrc=fi...
3 55.93 - 97.31 72.21 x 800 281600 72.16 x 1000 71.51 - 72.80 72.26 72.32 74758 https://finance.yahoo.com/quote/VSAT?p=VSAT&.t...
4 144.27 - 220.03 215.54 x 1000 1560562 215.37 x 800 214.87 - 217.45 215.85 214.86 203957 https://finance.yahoo.com/quote/RTN?p=RTN&.tsr...
5 100.48 - 149.81 145.03 x 800 2749725 144.96 x 800 144.41 - 145.56 145.49 144.52 489169 https://finance.yahoo.com/quote/UTX?ltr=1
6 189.35 - 351.53 343.34 x 800 280325 342.80 x 800 342.84 - 346.29 344.16 343.58 42326 https://finance.yahoo.com/quote/TDY?ltr=1
7 3.5800 - 9.7900 4.1400 x 1300 778343 4.1300 x 800 4.1200 - 4.2000 4.1700 4.1500 62335 https://finance.yahoo.com/quote/ORBC?ltr=1
8 6.90 - 12.09 7.37 x 900 2280333 7.38 x 800 7.24 - 7.48 7.30 7.22 539082 https://finance.yahoo.com/quote/SPCE?p=SPCE&.t...
9 292.47 - 446.01 348.73 x 800 4420225 348.79 x 800 345.70 - 350.42 350.22 348.84 1258813 https://finance.yahoo.com/quote/BA?p=BA&.tsrc=...
But IF you must go the route of BeautifulSoup, your find_all() is incorrect. First the class is strictly the text between the double quotes after class=. You've included other attributes of the element such as datareact-id, and the actual content/text that you are wanting to pull. Secondly, that class is a child of the <span> tag, not the div tag. If you pull the div tag, that's fine, but then you'd still need to go inside THAT element to get the text.
Give this a try:
import bs4
import requests
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
def parsePrice(r):
resp = requests.get(r)
soup = bs4.BeautifulSoup(resp.text,"html")
price=soup.find_all('span',{'class':'Trsdu(0.3s) Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(b)'})[0].text
return price
for r in urls:
print (parsePrice(r))
Output:
8.06
32.76
6.60
72.22
215.54
145.14
343.28
4.1550
7.43
348.32

Related

Struggling to grab data from baseball reference

I'm trying to grab the tables for all pitchers batting against found on this page.
I believe the problems lies with the data being behind a comment.
For the sake of the example, I'd like to find, say, Sandy Alcantara's home runs allowed.
import requests
from bs4 import BeautifulSoup as bs
url="https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page=requests.get(url)
soup=bs(page.content,"html.parser")
for tr in soup.find_all('tr'):
td=tr.find_all('td')
print(td)
This prints a lot of team data, but doesn't print the pitcher data.
How can I cleverly get it to print the pitcher data? Ideally, I'd have it in a list or something.
object[0]=Rk
object[1]=Name
object[4]=IP
object[13]=HR

The problem related to the extraction of the table content is that the table itself is stored inside a comment string.
After you have fetched your web page, and loaded it into BeautifulSoup, you can solve this web page scraping issue following these steps:
gather the div tagged id = 'all_players_batting_pitching', which contains your table
extract the table from the comments using the decode_contents function, then reloading your text into a soup
extract each record of your table by looking for the tr tag, then each value by looking for the td tag, if that value is included in your indices [1, 4, 13]
load your values into a Pandas.DataFrame, ready to be used
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
# fetching web page
url = "https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page = requests.get(url)
# extracting table from html
soup = bs(page.content,"html.parser")
table = soup.find(id = 'all_players_batting_pitching')
tab_text = table.decode_contents().split('--')[1].strip()
tab_soup = bs(tab_text,"html.parser")
# extracting records from table
records = []
for i, row in enumerate(tab_soup.find_all('tr')):
record = [ele.text.strip() for j, ele in enumerate(row.find_all('td')) if j in [0, 3, 12]]
if record != []:
records.append([row.a['href']] + [i] + record)
Output:
href Rk Name IP HR
0 /players/a/abbotco01.shtml 1 Cory Abbott 48.0 12
1 /players/a/abreual01.shtml 2 Albert Abreu 38.2 5
2 /players/a/abreual01.shtml 3 Albert Abreu 8.2 2
3 /players/a/abreual01.shtml 4 Albert Abreu 4.1 1
4 /players/a/abreual01.shtml 5 Albert Abreu 25.2 2
... ... ... ... ... ...
1063 /players/z/zastrro01.shtml 1106 Rob Zastryzny* 1.0 0
1064 /players/z/zastrro01.shtml 1107 Rob Zastryzny* 3.0 0
1065 /players/z/zerpaan01.shtml 1108 Angel Zerpa* 11.0 2
1066 /players/z/zeuchtj01.shtml 1109 T.J. Zeuch 10.2 5
1067 /players/z/zimmebr02.shtml 1110 Bruce Zimmermann* 73.2 21

How do I use Python and BeautifulSoup to scrape data from an html table?

if you look at this page https://metals-api.com/currencies there is an html table with 2 columns. I would like to extract all the rows from column1 into a list/array. How do I go about this?
import requests
from bs4 import BeautifulSoup
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
with open('outpu2t.txt', 'w', encoding='utf-8') as f:
f.write(soup.text)
To clarify I am not looking to run some fetch price commands against these tickers, I'm trying to compile a list of tickers so I can add them to a dropdown menu for my app

If I understand the question, then you can try the next example
import requests
from bs4 import BeautifulSoup
import pandas as pd
data=[]
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for code in soup.select('.table tbody tr td:nth-child(1)'):
code =code.text
data.append(code)
df=pd.DataFrame(data,columns=['code'])
#df.to_csv('code.csv',index=False)# to store data
print(df)
Output:
code
0 XAU
1 XAG
2 XPT
3 XPD
4 XCU
.. ...
209 LINK
210 XLM
211 ADA
212 BCH
213 LTC
[214 rows x 1 columns]

I sit corrected, I initially just tried pd.read_html("https://metals-api.com/currencies") which normally works, but apparently with a very slight work around it can still work just fine.
import pandas as pd
import requests
URL = "https://metals-api.com/currencies"
page = requests.get(URL)
df = pd.read_html(page.content)[0]
print(df)
Output:
Code Name
0 XAU 1 Ounce of 24K Gold. Use Carat endpoint to dis...
1 XAG Silver
2 XPT Platinum
3 XPD Palladium
4 XCU Copper
.. ... ...
209 LINK Chainlink
210 XLM Stellar
211 ADA Cardano
212 BCH Bitcoin Cash
213 LTC Litecoin
[214 rows x 2 columns]

Python web scraping on tables with customizable dropdowns

I am a novice who is trying to put together my first webpage scraping function, and running into some HTTP problems that I don't fully understand. For reference, I am trying to scrape from FanGraphs ZIPS projection page.
Their page defaults to displaying a table with 30 players, but you can edit the webpage such that the table displays 10, 20, 30, 50, 100, 500 or 1000 results per page. Ideally I would select the option for 500 results per page, but when I change the dropdown menu, the URL remains unchanged, so I'm trying write a script to get the HTML code and edit the attributes of the HTML page before trying to request that variation of the webpage. (If you view the page source, it's on line 1247)
Alternatively, you can cycle through additional pages with arrow button hyperlinks on the page, but they do not change the URL either.
I got around this problem for positional players since I can scrape each position separately and then concatenate the separate Pandas data frames, but I'm pretty oblivious about how HTTP requests work and was hoping for some help / guidance.
I tried something like:
from requests import Request, Session
url = 'https://www.fangraphs.com/projections.aspx?pos=all&stats=pit&type=zips&team=0&lg=al&players=0'
page = requests.get(url)
soup=BeautifulSoup(page.text, 'html.parser')
resultSet = soup.findAll(value=30)
for result in resultSet:
result["value"]="500"
s = Session()
req = requests.Request('POST', url, data = {'ProjectionBoard1_dg1_ctl00_ctl03_ctl01_PageSizeComboBox_Input value': 500})
resp = s.send(prepped)
print(resp.status_code)
My thought was to scrape the structure, edit the configuration to match my desired webpage, and then regurgitate the fixed structure to scrape the data. Then I realized that I had no idea what I was doing.
A few things:
Am I "not even wrong" here? Is this possible with requests or would I need something like Selenium? Am I improperly using POST?
Side note: If context helps, for positional players I did this (would be doing the same thing for pitchers):
base_url = 'https://www.fangraphs.com/projections.aspx?pos=&stats=bat&type=zips&team=0&lg=all&players=0'
positions = ['c', '1b', '2b', '3b', 'ss', 'lf', 'cf', 'rf']
def generate_positional_urls(urlRoot=base_url, pos=positions):
index = base_url.index('pos=') + 4 position after
urlList = []
numPositions = len(positions)
for i in range(numPositions):
position = pos[i]
tempURL = urlRoot[:index] + position + urlRoot[index:]
urlList.append(tempURL)
return urlList
Which feeds into this:
def generate_df(url):
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
masterTable = soup.findAll('table', {"class": "rgMasterTable"}, {"id": "ProjectionBoard1_dg1_ct100"})
table_rows = masterTable[0].find_all('tr')
data = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
data.append(row)
headers = masterTable[0].find_all("thead")
colElements = headers[0].find_all("tr")
soupColResults = colElements[1].findChildren("a")
colStrings = [element.text for element in soupColResults]
df = pd.DataFrame(data)
df.columns = colStrings
df.drop(columns=[''], inplace=True)
df = df.iloc[3:, :]
return df
Then it's easy enough to loop through and concatenate data frames, but for this HTTP problem, I really don't know what I am doing.
Note: I am doing this to try to learn and practice Python. I can easily avoid this by using my web browser to click on the hyperlink to export the data to a csv, so don't exert too much effort figuring out unnecessary tedium.

First of all, you are trying to scrape an html table into a pandas dataframe. This can be done without BeautifulSoup using pd.read_html.
Second, always keep track of what is happening in the network tab of Chrome or Firefox. You can see what gets loaded when you click to change the number of results. You can copy the request as curl and then convert this to a python request, for example using this tool. You can now load the data and pass them to pandas directly. Note that I had to remove __VIEWSTATE and __EVENTVALIDATION for readability, but you can get them from the page source code or from the curl command as indicated above.
import requests
params = (
('pos', 'all'),
('stats', 'pit'),
('type', 'zips'),
('team', '0'),
('lg', 'al'),
('players', '0'),
)
data = {
'RadScriptManager1_TSM': '',
'__EVENTTARGET': 'ProjectionBoard1$dg1',
'__EVENTARGUMENT': 'FireCommand:ProjectionBoard1$dg1$ctl00;PageSize;1000',
'__VIEWSTATE': '', #text removed for readability
'__VIEWSTATEGENERATOR': 'C239D6F0',
'__SCROLLPOSITIONX': '0',
'__SCROLLPOSITIONY': '0',
'__EVENTVALIDATION': '', #text removed for readability
'ProjectionBoard1_tsStats_ClientState': '{"selectedIndexes":["1"],"logEntries":[],"scrollState":{}}',
'ProjectionBoard1_tsPosition_ClientState': '{"selectedIndexes":["0"],"logEntries":[],"scrollState":{}}',
'ProjectionBoard1$rcbTeam': 'All Teams',
'ProjectionBoard1_rcbTeam_ClientState': '',
'ProjectionBoard1$rcbLeague': 'AL',
'ProjectionBoard1_rcbLeague_ClientState': '',
'ProjectionBoard1_tsProj_ClientState': '{"selectedIndexes":["0"],"logEntries":[],"scrollState":{}}',
'ProjectionBoard1_tsUpdate_ClientState': '{"selectedIndexes":[],"logEntries":[],"scrollState":{}}',
'ProjectionBoard1$dg1$ctl00$ctl02$ctl00$PageSizeComboBox': '1000',
'ProjectionBoard1_dg1_ctl00_ctl02_ctl00_PageSizeComboBox_ClientState': '{"logEntries":[],"value":"1000","text":"1000","enabled":true,"checkedIndices":[],"checkedItemsTextOverflows":false}',
'ProjectionBoard1$dg1$ctl00$ctl03$ctl01$PageSizeComboBox': '30',
'ProjectionBoard1_dg1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState': '',
'ProjectionBoard1_dg1_ClientState': ''
}
response = requests.post('https://www.fangraphs.com/projections.aspx', params=params, data=data)
df = pd.read_html(response.text)
read_html returns a list of all tables as dataframes, the table you want can be accessed as df[8]. Output df[8].head():
Name
Unnamed: 1
Team
W
L
ERA
GS
G
IP
H
ER
HR
SO
BB
WHIP
K/9
BB/9
FIP
WAR
ADP
0
Lucas Giolito
nan
CHW
16
7
3
30
30
180
129
60
21
248
56
1.03
12.4
2.8
2.98
5.8
18.3
1
Gerrit Cole
nan
NYY
16
8
3.21
32
32
193.3
155
69
27
260
49
1.06
12.1
2.28
3.18
5.7
7.4
2
Shane Bieber
nan
CLE
14
9
3.51
32
32
197.7
177
77
26
230
36
1.08
10.47
1.64
3.23
5.4
10
3
Jose Berrios
nan
MIN
13
10
3.89
31
31
180.3
167
78
24
187
56
1.24
9.33
2.79
3.98
3.5
82.9
4
Chris Sale
nan
BOS
9
5
3.35
20
20
121
103
45
14
147
28
1.08
10.93
2.08
3.23
3.4
297

web scraping issue, trying to acquire info into csv and than charts

Here what is up my code. It gives me very complete information. I am scraping the stock prices for my top 10 favorite space tech companies. I want to get the stock prices for the course of 10 hours, or I might just run the code ten different times. I can not use api's. This is for a school project. I then want to combine all the data into ten one big chart using matplotlib that would show these stock prices. Or ten charts for each stock. I want to use this type of chart.
Any advice would be awesome. Here is my current code:
#import libraries
import pandas as pd
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
def parsePrice(r):
df = pd.read_html(r)[0].T
cols = list(df.iloc[0,:])
temp_df = pd.DataFrame([list(df.iloc[1,:])], columns=cols)
temp_df['url'] = r
return temp_df
df = pd.DataFrame()
for r in urls:
df = df.append(parsePrice(r), sort=True).reset_index(drop=True)
df.to_csv('C:/Users/n_gor/Desktop/webscape/Nicholas Final Projects/spacestocklisting.csv', index=False)
print (df.to_string())
CSV File output:
52 Week Range Ask Avg. Volume Bid Day's Range Open Previous Close Volume url
0 7.32 - 9.87 8.09 x 800 23415 8.06 x 800 8.01 - 8.11 8.10 8.01 6337 https://finance.yahoo.com/quote/GILT/
1 32.14 - 42.77 32.74 x 1100 41759 32.59 x 1000 32.28 - 32.75 32.32 32.28 14685 https://finance.yahoo.com/quote/LORL?p=LORL&.t...
2 5.55 - 27.29 6.64 x 800 5746553 6.63 x 2900 6.51 - 6.68 6.64 6.65 995245 https://finance.yahoo.com/quote/I?p=I&.tsrc=fi...
3 55.93 - 97.31 72.21 x 800 281600 72.16 x 1000 71.51 - 72.80 72.26 72.32 74758 https://finance.yahoo.com/quote/VSAT?p=VSAT&.t...
4 144.27 - 220.03 215.54 x 1000 1560562 215.37 x 800 214.87 - 217.45 215.85 214.86 203957 https://finance.yahoo.com/quote/RTN?p=RTN&.tsr...
5 100.48 - 149.81 145.03 x 800 2749725 144.96 x 800 144.41 - 145.56 145.49 144.52 489169 https://finance.yahoo.com/quote/UTX?ltr=1
6 189.35 - 351.53 343.34 x 800 280325 342.80 x 800 342.84 - 346.29 344.16 343.58 42326 https://finance.yahoo.com/quote/TDY?ltr=1
7 3.5800 - 9.7900 4.1400 x 1300 778343 4.1300 x 800 4.1200 - 4.2000 4.1700 4.1500 62335 https://finance.yahoo.com/quote/ORBC?ltr=1
8 6.90 - 12.09 7.37 x 900 2280333 7.38 x 800 7.24 - 7.48 7.30 7.22 539082 https://finance.yahoo.com/quote/SPCE?p=SPCE&.t...
9 292.47 - 446.01 348.73 x 800 4420225 348.79 x 800 345.70 - 350.42 350.22 348.84 1258813 https://finance.yahoo.com/quote/BA?p=BA&.tsrc=...
Can I add the stock names to this? any advice on how to complete this project? Im a bit lost.

Just need to parse the title header:
#import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
def parsePrice(r):
response = requests.get(r)
soup = BeautifulSoup(response.text, 'html.parser')
titleHeader = soup.find('div', {'id':'quote-header-info'})
title = titleHeader.find('h1').text
comp = title.split('-')[-1].strip()
abr = title.split('-')[0].strip()
print (title)
df = pd.read_html(response.text)[0].T
cols = list(df.iloc[0,:])
temp_df = pd.DataFrame([list(df.iloc[1,:])], columns=cols)
temp_df['url'] = r
temp_df['company name'] = comp
temp_df['stock name'] = abr
return temp_df
df = pd.DataFrame()
for r in urls:
df = df.append(parsePrice(r), sort=True).reset_index(drop=True)
df.to_csv('C:/Users/n_gor/Desktop/webscape/Nicholas Final Projects/spacestocklisting.csv', index=False)
print (df.to_string())

you can use pandas.DataFrame.insert ,
if you have all stock-names in a list,
stock_names = ['GILT', 'LORL', 'I', 'VSAT', 'RTN', 'UTX', 'TDY', 'ORBC', 'SPCE', 'BA']
# insert to the begining(column at index 0) of the dataFrame
df.insert(0, "column_heading", stock_names)
or you can get all stock names from urls using regular expressions and add it to your df
import re
stock_names= [re.findall('[A-Z]+',x)[0] for x in urls]
# insert to the begining(column at index 0) of the dataFrame
df.insert(0, "column_heading", stock_names)

How to extract table from websitewhen the numbers do not show in the html?

I am trying extract the tables in the website below for all industries and periods. When I download the html however, the numbers displayed in the website are not found anywhere. How can I retrieve the entries of the table?
https://csimarket.com/Industry/industry_Efficiency.php?ind=102
My code below extracts the html. Upon inspection, the numbers in the table do not appear anywhere, so I can't extract them. The question is then where are they? And how can they be accessed and extracted?
Please, note that I am new to requests and beautiful soup!
Many thanks!
import requests
my_target='https://csimarket.com/Industry/industry_Efficiency.php?ind=102'
r = requests.get(my_target)
data = r.text

You can use requests but you, need to use r.content instead of r.text:
import requests
my_target='https://csimarket.com/Industry/industry_Efficiency.php?ind=102'
r = requests.get(my_target)
data = r.content
You may also use BeautifulSoup module to parse the html, like this:
import requests
#load beautifullsoup
from bs4 import BeautifulSoup
my_target='https://csimarket.com/Industry/industry_Efficiency.php?ind=102'
r = requests.get(my_target)
#get raw html
data = r.content
#soup the content
soup = BeautifulSoup(data, 'html.parser')
#find table element
table_element = soup.find('table',{"class":"osnovna_tablica_bez_gifa"})
#print text version of table element
print table_element.text
Which give you:
Chemicals - Plastics & Rubber Industry
3 Q
2 Q
1 Q
4 Q
3 Q
 
2018
2018
2018
2017
2017
Revenue/Employee (TTM) $
428,075
327,852
323,322
338,175
325,069
Sales/Employee Ranking
# 22
# 78
# 79
# 68
# 74
Net Income/Employee (TTM) $
37,510
18,571
20,953
27,151
18,810
Net Income/Employee
Ranking
# 16
# 72
# 69
# 58
# 64
Receivable Turnover Ratio (TTM)
7.53
5.17
5.07
5.17
5.11
Receivable Turnover Ranking
# 31
# 88
# 90
# 87
# 89
Inventory Turnover Ratio (TTM) Sales
8.1
5.56
5.65
6.13
6.45
Inventory Turnover (Sales)
Ranking
# 31
# 90
# 90
# 86
# 85
Inventory Turnover Ratio (TTM) COS
5.77
3.83
3.81
4.16
4.37
Inventory Turnover (COS)
Ranking
# 24
# 79
# 81
# 75
# 77
Asset Turnover Ratio (TTM)
0.92
0.47
0.52
0.6
0.69
Asset Turnover Ranking
# 31
# 72
# 68
# 63
# 49

You can use the urllib package and then extract your numbers using regular expressions. Do:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
url = "https://csimarket.com/Industry/industry_Efficiency.php?ind=102"
response = http.request('GET', url)
soup = BeautifulSoup(response.data)
spans = soup.find_all("span")
for span in spans:
print(span)
This gives:
<span class="">428,075</span>
<span class="">327,852</span>
<span class="">323,322</span>
...
<span class="siva3">31</span>
<span class="siva3"># 5</span>
<span class="siva3"># 31</span>

Much tidier would be to use pandas read_html
import pandas as pd
tables = pd.read_html('https://csimarket.com/Industry/industry_Efficiency.php?ind=102')
print(tables[6].fillna(''))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

webscraping error, it says out of index, whats up - python

Related

Struggling to grab data from baseball reference

How do I use Python and BeautifulSoup to scrape data from an html table?

Python web scraping on tables with customizable dropdowns

web scraping issue, trying to acquire info into csv and than charts

How to extract table from websitewhen the numbers do not show in the html?

Categories

Resources