Python - List out of range error in web scraping - python

I have been running this python code and it gives me an error saying
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-9-6ff1d459c8bd> in <module>
6 soup = BeautifulSoup(data, 'html5lib')
7 df = pd.DataFrame(columns=["Name", "Sector", "Price", "Price/Earnings", "Dividend_Yield", "Earnings/Share", "52_Week_Low", "52_Week_High", "Market_Cap", "EBITDA"])
----> 8 for row in soup.find_all('tbody')[1].find_all('tr'):
9 col = row.find_all("td")
10 Name = col[0].text
IndexError: list index out of range
The code i have user do do the python web scraping is,
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.kaggle.com/priteshraj10/sp-500-companies"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
df = pd.DataFrame(columns=["Name", "Sector", "Price", "Price/Earnings", "Dividend_Yield", "Earnings/Share", "52_Week_Low", "52_Week_High", "Market_Cap", "EBITDA"])
for row in soup.find_all('tbody')[1].find_all('tr'):
col = row.find_all("td")
Name = col[0].text
Sector = col[1].text
Price = col[2].text
Price_Earnings = col[3].text
Dividend_Yield = col[4].text
Earnings_Share = col[5].text
Week_Low = col[6].text
Week_High = col[7].text
Market_Cap = col[8].text
EBITDA = col[9].text
df = df.append({"Name":Name,"Sector":Sector,"Price":Price,"Price_Earnings":Price_Earnings,"Dividend_Yield":Dividend_Yield,"Earnings_Share":Earnings_Share,"Week_Low":Week_Low,"Week_High":Week_High,"Market_Cap":Market_Cap,"EBITDA":EBITDA}, ignore_index=True)
Can you help me on this?

If you try to print the variable soup you will see that the HTML returned does not contain the information you want probably because the site has a block to avoid web-scraping

Apparently this line of code is returning you a list
for row in soup.find_all('tbody')[1]
And it expects at least to have two items (python list indexing starts from 0), and it's not.
What you could do is print this item:
print(soup.find_all('tbody'))
To see what are you trying to access the position index 1, and why is not there.
Additionally if you want to check the length:
print(len(soup.find_all('tbody')))
It should be smaller than 1, hence the error.
I'd recommend you, instead of printing using a debugger to find out what's up with your code.
The issue seems to be that the website you are trying to scrape , probably changed their HTML code somehow.

Related

Scraping yahoo finance in a for loop with beautifulsoup

I am learning to scrape websites with Beautifulsoup, and was trying to fetch data from yahoo finance. As I advance, I am stuck wondering if there would be a reason why it is successfully fetching what I want when I am not in a for loop (searing for a specific ticker), but as soon as I try to make it use a csv file to search for more than one ticker, the .find() method returns an error instead of the tag I am looking for.
Here is the code when it runs well,
```
import requests
import csv
from bs4 import BeautifulSoup
> ------ FOR LOOP THAT MESSES THINGS UP ----- <
# with open('s&p500_tickers.csv', 'r') as tickers:
# for ticker in tickers:
ticker = 'AAPL' > ------ TEMPORARY TICKER TO TEST CODE
web = requests.get(f'https://ca.finance.yahoo.com/quote/{ticker}/financials?p={ticker}').text
soup = BeautifulSoup(web, 'lxml')
section = soup.find('section', class_='smartphone_Px(20px) Mb(30px)')
tbl = section.find('div', class_='M(0) Whs(n) BdEnd Bdc($seperatorColor) D(itb)')
headerRow = tbl.find("div", class_="D(tbr) C($primaryColor)")
> ------ CODE I USED TO VISUALIZE THE RESULT ------ <
breakdownHead = headerRow.text[0:9]
ttmHead = headerRow.text[9:12]
lastYear = headerRow.text[12:22]
twoYears = headerRow.text[22:32]
threeYears = headerRow.text[32:42]
fourYears = headerRow.text[42:52]
print(breakdownHead, ttmHead, lastYear, twoYears, threeYears, fourYears)
```
It returns this:
```
Breakdown ttm 2019-09-30 2018-09-30 2017-09-30 2016-09-30
Process finished with exit code 0
```
Here is the code that does not work
```
import requests
import csv
from bs4 import BeautifulSoup
with open('s&p500_tickers.csv', 'r') as tickers:
for ticker in tickers:
web = requests.get(f'https://ca.finance.yahoo.com/quote/{ticker}/financials?p={ticker}').text
soup = BeautifulSoup(web, 'lxml')
section = soup.find('section', class_='smartphone_Px(20px) Mb(30px)')
tbl = section.find('div', class_='M(0) Whs(n) BdEnd Bdc($seperatorColor) D(itb)')
headerRow = tbl.find("div", class_="D(tbr) C($primaryColor)")
breakdownHead = headerRow.text[0:9]
ttmHead = headerRow.text[9:12]
lastYear = headerRow.text[12:22]
twoYears = headerRow.text[22:32]
threeYears = headerRow.text[32:42]
fourYears = headerRow.text[42:52]
print(breakdownHead, ttmHead, lastYear, twoYears, threeYears, fourYears)
```
I welcome any feedback on my code as I am always trying to get better.
Thank you very much
So I have resolved the problem.
I realized that the .writerow() method of the csv module adds '\n' at the end of the string.(Ex:'MMM\n').
Somehow, the new line was keeping the .find() method to be executed in the for loop. (Still don't know why)
Afterward, it worked for the first line but since there was empty spaces I had to get python to pass the empty spaces with an If statement.
I replaced the '\n' with a '' and it worked.
Here's what it looks like:
'''
for ticker in tickers.readlines():
ticker = ticker.replace('\n', '')
if ticker == '':
pass
else:
web = requests.get(f'https://ca.finance.yahoo.com/quote/{ticker}/financials?p={ticker}').text
soup = BeautifulSoup(web, 'lxml')
headerRow = soup.find("div", class_="D(tbr) C($primaryColor)")
'''
If any of you see a better way to do it, I would be pleased to have some of your feedback.
I am new to programming and would really like to know what I am doing wrong!

How to only scrape the first item in a row using Beautiful Soup

I am currently running the following python script:
import requests
from bs4 import BeautifulSoup
origin= ["USD","GBP","EUR"]
i=0
while i < len(origin):
page = requests.get("https://www.x-rates.com/table/?from="+origin[i]+"&amount=1")
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.findChildren('table')
my_table = tables[0]
rows = my_table.findChildren(['td'])
i = i +1
for rows in rows:
cells = rows.findChildren('a')
for cell in cells:
value = cell.string
print(value)
To scrape data from this HTML:
https://i.stack.imgur.com/DkX83.png
The problem I have is that I'm struggling to only scrape the first column without scraping the second one as well because they are both under tags and in the same table row as each other. The href is the only thing which differentiates between the two tags and I have tried filtering using this but it doesn't seem to work and returns a blank value. Also when i try to sort the data manually the output is amended vertically and not horizontally, I am new to coding so any help would be appreciated :)
There is another way you might wanna try as well to achieve the same:
import requests
from bs4 import BeautifulSoup
keywords = ["USD","GBP","EUR"]
for keyword in keywords:
page = requests.get("https://www.x-rates.com/table/?from={}&amount=1".format(keyword))
soup = BeautifulSoup(page.content, "html.parser")
for items in soup.select_one(".ratesTable tbody").find_all("tr"):
data = [item.text for item in items.find_all("td")[1:2]]
print(data)
It is easier to follow what happens when you print every item you got from the top e.g. in this case from table item. The idea is to go one by one so you can follow.
import requests
from bs4 import BeautifulSoup
origin= ["USD","GBP","EUR"]
i=0
while i < len(origin):
page = requests.get("https://www.x-rates.com/table/?from="+origin[i]+"&amount=1")
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.findChildren('table')
my_table = tables[0]
i = i +1
rows = my_table.findChildren('tr')
for row in rows:
cells = row.findAll('td',class_='rtRates')
if len(cells) > 0:
first_item = cells[0].find('a')
value = first_item.string
print(value)

Python 3.6 - BeautifulSoup4, parse table AttributeError: ResultSet object has no attribute 'findAll'

I am trying to parse a table containing all the cities in California using bs4, but I am getting the below error
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I've tried using find_all, findAll (as suggested on other posts on this forum), but it's throwing the same error. From what I understand, I can't do that as my program return an object resultSet which is not iterable.
Also, I am not able to access the table class="wikitable plainrowheaders sortable jquery-tablesorter", as it contains multiple class name, and when I run it omitting the table_wiki_rows = table_wiki.findAll('tr') it will return only the first line of the table 400+ time.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
## Python 3.6
##BeautifulSoup4
def get_all_cities(html_soup):
"""Scrape WIkipedia page for cities in California (https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California)
and returns a structured list of city names, county, population, area, and incorporation date"""
cities_in_california = []
table_wiki = html_soup.findAll('table', attrs={"class":"wikitable"})
table_wiki_rows = table_wiki.findAll('tr')
for rows in table_wiki:
table_rows_header = html_soup.findAll('th')
table_rows = html_soup.findAll('td')
city_entry = {
'City_name' : table_rows_header[0].text,
'City_type' : table_rows[1].text,
'County' : table_rows[2].text,
'Population' : table_rows[3].text,
'Area_sqr_miles' : table_rows[4].text,
'Area_sqr_km' : table_rows[5].text,
'Incorporation_Date' : table_rows[6].text
}
cities_in_california.append(city_entry)
return cities_in_california
html = urlopen('https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California')
html_soup = BeautifulSoup(html, 'html.parser')
city_list = get_all_cities(html_soup)
df = pd.DataFrame(city_list )
df.head(7)
EDIT - Working Code for reference
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
## Python 3.6
##BeautifulSoup4
def get_all_cities(html_soup):
"""Scrape WIkipedia page for cities in California (https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California)
and returns a structured list of city names, county, population, area, and incorporation date"""
cities_in_california = []
table_wiki = html_soup.find('table', {'class': 'wikitable plainrowheaders sortable'})
table_wiki_rows = table_wiki.findAll('tr')
for rows in table_wiki_rows:
table_rows_header = rows.findAll('th')
table_rows = rows.findAll('td')
if len(table_rows_header) > 0 and len(table_rows) > 0:
city_entry = {
'City_name' : table_rows_header[0].text,
'City_type' : table_rows[0].text,
'County' : table_rows[1].text,
'Population' : table_rows[2].text,
'Area_sqr_miles' : table_rows[3].text,
'Area_sqr_km' : table_rows[4].text,
'Incorporation_Date' : table_rows[5].text
}
cities_in_california.append(city_entry)
return cities_in_california
html = urlopen('https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California')
html_soup = BeautifulSoup(html, 'html.parser')
city_list = get_all_cities(html_soup)
df = pd.DataFrame(city_list)
df.head(7)
html_soup.findAll('table', attrs={"class":"wikitable"}) will return a list of all tables with a class value of wikitable. You can't call .findAll on such a list.
The data resides in a table that has multiple class attributes:
class="wikitable plainrowheaders sortable"
So, use this instead of just "wikitable":
html_soup.find('table', {'class': 'wikitable plainrowheaders sortable'})
Using html_soup.find('table', attrs={"class":"wikitable"}) won't work, either, because there's another table before the main data table. Since that's the first one, using find will retrieve that and not the main table.
Edit:
The problem is with these two lines:
table_rows_header = html_soup.findAll('th')
table_rows = html_soup.findAll('td')
For each iteration of the loop, you're assigning all the th and td elements found inside the entire html_soup object. You should be using this instead:
table_rows_header = rows.findAll('th')
table_rows = rows.findAll('td')

Python Embedded For range Loops

I would like the following commands to grab the date from the address in this range but I can't seem to get it to run more than once. I am using Python 3. As you can see below the the url for the site is appended with i as to be read http://zinc.docking.org/substance/10 ; http://zinc.docking.org/substance/11 ... and so on. Here is the code:
import bs4 as bs
import urllib.request
site = "http://zinc.docking.org/substance/"
for i in range(10, 16):
site1 = str("%s%i" % (site, i))
sauce = urllib.request.urlopen(site1).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
table1 = soup.find("table", attrs={"class": "substance-properties"})
for row in table1.findAll('tr'):
row1 = row.findAll('td')
ate = row1[0].getText()
print(ate)
This is my output:
$python3 Date.py
November 11th, 2005
The script should however give me 3 dates. This code works so I know that row[0] does in fact contain a value.I feel like there is some sort of simple formatting error but I am not sure where to begin troubleshooting. When I format it "Correctly" this is the code:
import bs4 as bs
import urllib.request
import pandas as pd
import csv
site = "http://zinc.docking.org/substance/"
for i in range(10, 16):
site1 = str("%s%i" % (site, i))
sauce = urllib.request.urlopen(site1).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
table1 = soup.find("table", attrs={"class": "substance-properties"})
table2 = soup.find("table", attrs={"class": "protomers"})
for row in table1.findAll('tr'):
row1 = row.findAll('td')
ate = row1[0].getText()
print(ate)
The error I get is as follows:
Traceback (most recent call last):
File "Stack.py", line 11, in <module>
ate = row1[1].getText()
IndexError: list index out of range
The first code works so I know that row[0] does in fact contain a value. Any ideas?
You might want to fix your indentation:
import bs4 as bs
import urllib.request
site = "http://zinc.docking.org/substance/"
for i in range(10, 16):
site1 = str("%s%i" % (site, i))
sauce = urllib.request.urlopen(site1).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
table1 = soup.find("table", attrs={"class": "substance-properties"})
for row in table1.findAll('tr'):
row1 = row.findAll('td')
Date = row1[0].getText()
print(Date)
Edit: You should rename your Date variable, that is a reserved name. Also, by convention Python vars are lower case.

IndexError: list index out of range in a function using BeautifulSoup

I am making a GUI based project in Python with the module tkinter. It fetches basic data from online judge, eg, SPOJ using Beautiful Soup. I am a newbie in Python, so most of the things I have written is with basic tutorials from internet. However, for a particular piece of code, I am totally stuck up.
import sys
import urllib.request
from bs4 import BeautifulSoup
import re
userName = 'xilinx'
spojUrl = 'http://www.spoj.com/users/'+userName
with urllib.request.urlopen(spojUrl) as x:
html = x.read()
soup = BeautifulSoup(html, 'html.parser')
# li - list of successful submissions
li = soup.find_all('table', class_='table table-condensed')[0].find_all('td')
listOfSolvedProblemCodes = []
for submission in li:
problemCode = submission.get_text()
if problemCode:
listOfSolvedProblemCodes.append(problemCode)
print (userName+ ' has solved',len(listOfSolvedProblemCodes),'problems on Spoj.')
This part of code works fine when I run it with python submissions.py
After testing this part, I try to incorporate it into the larger code, where the problem arises. I am including here the relevant section of the code:
In frame.py:
def compStats ():
if ch == "SPOJ":
stats.show(ch, userName)
B2 = tkinter.Button(root, text="My Statistics", command=compStats)
B2.place(anchor = W, x = 30, y = 220, width=200)
In stats.py:
def show(ch, userName):
if ch == 'SPOJ':
spojUrl = 'http://www.spoj.com/users/'+userName
with urllib.request.urlopen(spojUrl) as x:
html = x.read()
soup = BeautifulSoup(html, 'html.parser')
li = soup.find_all('table', class_='table table-condensed')[0].find_all('td')
listOfSolvedProblemCodes = []
for submission in li:
problemCode = submission.get_text()
if problemCode:
listOfSolvedProblemCodes.append(problemCode)
# then collect more information from soup and print it through labels in another window
det = tkinter.Tk()
det.title("Statistics")
det.geometry("800x600")
But the problem of IndexError occurs in stats.py in the line:
li = soup.find_all('table', class_='table table-condensed')[0].find_all('td')
Exception in Tkinter callback
Traceback (most recent call last):
File
"C:\Users\Aa\AppData\Local\Programs\Python\Python35-32\lib\tkinter__init
.py", line 1550, in __call
return self.func(*args)
File "frame.py", line 34, in compStats
stats.show(ch, userName)
File "C:\Users\Aa\AppData\Local\Programs\Python\Python35-32\stats.py",
line 17, in show
li = soup.find_all('table', class_='table
table-condensed')[0].find_all('td')
IndexError: list index out of range
I am unable to understand why the code is unable to work here. Please help!
The first step in debugging this is to take the complex line that is throwing an error and make it simpler. You can then inspect the intermediate values so see if the assumptions you're making about the code are true. In this case, your assumption is that soup.find_all('table', ...) is actually finding something.
For example, change this:
li = soup.find_all('table', class_='table table-condensed')[0].find_all('td')
to this:
tables = soup.find_all('table', class_='table table-condensed')
li = tables[0].find_all('td')
Next, add a print statement to examine tmp:
print("tables is", tables)
You will find that tables is likely empty, so when you try to do tables[0] you get an error since index 0 is out of range.

Categories