Python Embedded For range Loops - python

I would like the following commands to grab the date from the address in this range but I can't seem to get it to run more than once. I am using Python 3. As you can see below the the url for the site is appended with i as to be read http://zinc.docking.org/substance/10 ; http://zinc.docking.org/substance/11 ... and so on. Here is the code:
import bs4 as bs
import urllib.request
site = "http://zinc.docking.org/substance/"
for i in range(10, 16):
site1 = str("%s%i" % (site, i))
sauce = urllib.request.urlopen(site1).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
table1 = soup.find("table", attrs={"class": "substance-properties"})
for row in table1.findAll('tr'):
row1 = row.findAll('td')
ate = row1[0].getText()
print(ate)
This is my output:
$python3 Date.py
November 11th, 2005
The script should however give me 3 dates. This code works so I know that row[0] does in fact contain a value.I feel like there is some sort of simple formatting error but I am not sure where to begin troubleshooting. When I format it "Correctly" this is the code:
import bs4 as bs
import urllib.request
import pandas as pd
import csv
site = "http://zinc.docking.org/substance/"
for i in range(10, 16):
site1 = str("%s%i" % (site, i))
sauce = urllib.request.urlopen(site1).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
table1 = soup.find("table", attrs={"class": "substance-properties"})
table2 = soup.find("table", attrs={"class": "protomers"})
for row in table1.findAll('tr'):
row1 = row.findAll('td')
ate = row1[0].getText()
print(ate)
The error I get is as follows:
Traceback (most recent call last):
File "Stack.py", line 11, in <module>
ate = row1[1].getText()
IndexError: list index out of range
The first code works so I know that row[0] does in fact contain a value. Any ideas?

You might want to fix your indentation:
import bs4 as bs
import urllib.request
site = "http://zinc.docking.org/substance/"
for i in range(10, 16):
site1 = str("%s%i" % (site, i))
sauce = urllib.request.urlopen(site1).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
table1 = soup.find("table", attrs={"class": "substance-properties"})
for row in table1.findAll('tr'):
row1 = row.findAll('td')
Date = row1[0].getText()
print(Date)
Edit: You should rename your Date variable, that is a reserved name. Also, by convention Python vars are lower case.

Related

Python - List out of range error in web scraping

I have been running this python code and it gives me an error saying
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-9-6ff1d459c8bd> in <module>
6 soup = BeautifulSoup(data, 'html5lib')
7 df = pd.DataFrame(columns=["Name", "Sector", "Price", "Price/Earnings", "Dividend_Yield", "Earnings/Share", "52_Week_Low", "52_Week_High", "Market_Cap", "EBITDA"])
----> 8 for row in soup.find_all('tbody')[1].find_all('tr'):
9 col = row.find_all("td")
10 Name = col[0].text
IndexError: list index out of range
The code i have user do do the python web scraping is,
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.kaggle.com/priteshraj10/sp-500-companies"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
df = pd.DataFrame(columns=["Name", "Sector", "Price", "Price/Earnings", "Dividend_Yield", "Earnings/Share", "52_Week_Low", "52_Week_High", "Market_Cap", "EBITDA"])
for row in soup.find_all('tbody')[1].find_all('tr'):
col = row.find_all("td")
Name = col[0].text
Sector = col[1].text
Price = col[2].text
Price_Earnings = col[3].text
Dividend_Yield = col[4].text
Earnings_Share = col[5].text
Week_Low = col[6].text
Week_High = col[7].text
Market_Cap = col[8].text
EBITDA = col[9].text
df = df.append({"Name":Name,"Sector":Sector,"Price":Price,"Price_Earnings":Price_Earnings,"Dividend_Yield":Dividend_Yield,"Earnings_Share":Earnings_Share,"Week_Low":Week_Low,"Week_High":Week_High,"Market_Cap":Market_Cap,"EBITDA":EBITDA}, ignore_index=True)
Can you help me on this?
If you try to print the variable soup you will see that the HTML returned does not contain the information you want probably because the site has a block to avoid web-scraping
Apparently this line of code is returning you a list
for row in soup.find_all('tbody')[1]
And it expects at least to have two items (python list indexing starts from 0), and it's not.
What you could do is print this item:
print(soup.find_all('tbody'))
To see what are you trying to access the position index 1, and why is not there.
Additionally if you want to check the length:
print(len(soup.find_all('tbody')))
It should be smaller than 1, hence the error.
I'd recommend you, instead of printing using a debugger to find out what's up with your code.
The issue seems to be that the website you are trying to scrape , probably changed their HTML code somehow.

Getting a certain element out of the website table

I've been trying to get only one value from a table on a website. I've been following a tutorial but I am currently stuck. My goal is to extract the name of the country from the table and the number of total cases of that specific country and print it on the screen. For example:
China: 80,761 Total cases
I'm using Python 3.7.
This is my code so far:
import requests
from bs4 import BeautifulSoup
url='https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.findAll('table',{'id':'main_table_countries'})
If you have <table> tags, just go with pandas' .read_html(). It uses beautifulsoup under the hood, then you can just slice and dice the dataframe as you please:
import pandas as pd
url='https://www.worldometers.info/coronavirus/'
df = pd.read_html(url)[0]
print (df.iloc[:,:2])
To do it with beautifulsoup straight forward. First you want to grab the <table> tag. Within the <table> tag get all the <tr> tages (rows). Then iterate through each row to get all the <td> tags (the data). The data you want are in index positions 0 and 1, so just print those out.
import requests
from bs4 import BeautifulSoup
url='https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table',{'id':'main_table_countries'})
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
if data != []:
print (data[0].text, data[1].text)
ADDITIONAL:
import pandas as pd
country = 'China'
url='https://www.worldometers.info/coronavirus/'
df = pd.read_html(url)[0]
print (df[df['Country,Other'] == country].iloc[:,:2])
OR
import requests
from bs4 import BeautifulSoup
import re
country = 'China'
url='https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table',{'id':'main_table_countries'})
rows = table.find('a', text=re.compile(country))
for row in rows:
data = row.parent.parent.parent.find_all('td')[1].text
print (row, data)
You can get the target info this way:
for t in table[0].find_all('tr'):
target = t.find_all('td')
if len(target)>0:
print(target[0].text, target[1].text)
Output:
China 80,761
Italy 9,172
Iran 8,042
etc.

extract title from a link using BeautifulSoup

I am using beautifulsoup to scrape a website but need help with this as I am new to python and beautifulsoup
How do I get VET from the following
"[[VET]]"
This is my code so far
import bs4 as bs
import urllib.request
import pandas as pd
#This is the Home page of the website
source = urllib.request.urlopen('file:///C:/Users/Aiden/Downloads/stocks/Stock%20Premarket%20Trading%20Activity%20_%20Biggest%20Movers%20Before%20the%20Market%20Opens.html').read().decode('utf-8')
soup = bs.BeautifulSoup(source,'lxml')
#find the Div and put all info into varTable
table = soup.find('table',{"id":"decliners_tbl"}).tbody
#find all Rows in table and puts into varTableRows
tableRows = table.find_all('tr')
print ("There is ",len(tableRows),"Rows in the Table")
print(tableRows)
columns = [tableRows[1].find_all('td')]
print(columns)
a = [tableRows[1].find_all("a")]
print(a)
So my output from print(a) is "[[<a class="mplink popup_link" href="https://marketchameleon.com/Overview/VET/">VET</a>]]"
and I want to extract VET out
AD
You can use a.text or a.get_text().
If you have multiple elements you'd need list comprehension on this function
Thank you for all the reply, I was able to work it out using the following code
source = urllib.request.urlopen('file:///C:/Users/Aiden/Downloads/stocks/Stock%20Premarket%20Trading%20Activity%20_%20Biggest%20Movers%20Before%20the%20Market%20Opens.html').read().decode('utf-8')
soup = bs.BeautifulSoup(source,'html.parser')
table = soup.find("table",id="decliners_tbl")
for decliners in table.find_all("tbody"):
rows = decliners.find_all("tr")
for row in rows:
ticker = row.find("a").text
volume = row.findAll("td", class_="rightcell")[3].text
print(ticker, volume)

How can I loop through all <th> tags within my script for web scraping?

As of now, I'm only getting ['1'] as the output of what's being printed with my current code below. I want to grab 1-54 on the Team Batting table in the Rk column on the website https://www.baseball-reference.com/teams/NYY/2019.shtml.
How would I go about modifying colNum so it can print the 1-54 in the Rk column? I'm pointing out the colNum line because I feel the issue lies there but I could be wrong.
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser') # parse as HTML page, this is the source code of the page
week = soup.find(class_='table_outer_container')
items = week.find("thead").get_text() # grabs table headers
th = week.find("th").get_text() # grabs Rk only.
tbody = week.find("tbody")
tr = tbody.find("tr")
thtwo = tr.find("th").get_text()
colNum = [thtwo for thtwo in thtwo]
print(colNum)
Your mistake was in the last few lines as you mentioned. If I understood right, you wanted a list of all the values in the "Rk" column. In order to get all the rows, you have to use the find_all() function. I tweaked your code a little bit in order to get the text of the first field in each row in the following lines:
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser')
is the source code of the page
week = soup.find(class_='table_outer_container')
items = week.find("thead").get_text()
th = week.find("th").get_text()
tbody = week.find("tbody")
tr = tbody.find_all("tr")
colnum = [row.find("th").get_text() for row in tr]
print(colnum)

Scraping multiple pages in one Beautiful Soup script -- getting same result

I'm trying to loop through a script that parses tables with Beautiful Soup in Python 2.7.
The first table parse works and produces the expected results. The second loop produces exactly the same results as the first loop.
Additional details:
If I manually use the url that the second loop used to parse, I get
the intended page that I want to scrape. There is a little delay in refresh.
I use this on other websites and the loop works as intended.
Here is the script:
import urllib2
import csv
from bs4 import BeautifulSoup # latest version bs4
week = raw_input("Which week?")
week = str(week)
data = []
first = "http://fantasy.nfl.com/research/projections#researchProjections=researchProjections%2C%2Fresearch%2Fprojections%253Foffset%253D"
middle = "%2526position%253DO%2526sort%253DprojectedPts%2526statCategory%253DprojectedStats%2526statSeason%253D2015%2526statType%253DweekProjectedStats%2526statWeek%253D"
last = "%2Creplace"
page_num = 1
for page_num in range(1,3):
page_mult = (page_num-1) * 25 +1
next = str(page_mult)
url = first + next + middle + week + last
print url #I added this in order to check my output
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
table = soup.find('table', attrs={'class':'tableType-player hasGroups'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
b = open('NFLtable.csv', 'w')
a = csv.writer(b)
a.writerows(data)
b.close()
page_num =page_num+1
print data
On the actual page they are using AJAX to request additional results, with a JSON response with some HTML as one of the values.
I modified your code a bit, give it a try:
import urllib2
import urllib
import csv
from bs4 import BeautifulSoup # latest version bs4
import json
week = raw_input("Which week?")
week = str(week)
data = []
url_format = "http://fantasy.nfl.com/research/projections?offset={offset}&position=O&sort=projectedPts&statCategory=projectedStats&statSeason=2015&statType=weekProjectedStats&statWeek={week}"
for page_num in range(1, 3):
page_mult = (page_num - 1) * 25 + 1
next = str(page_mult)
url = url_format.format(week=week, offset=page_mult)
print url # I added this in order to check my output
request = urllib2.Request(url, headers={'Ajax-Request': 'researchProjections'})
raw_json = urllib2.urlopen(request).read()
parsed_json = json.loads(raw_json)
html = parsed_json['content']
soup = BeautifulSoup(html, "html.parser")
table = soup.find('table', attrs={'class': 'tableType-player hasGroups'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
print data
I tested with week=4.

Categories