Trouble grabbing data from a webpage located within comment - python

I've written a script in python to get some data from a website. It seems I did it the right way. However, when I print the data I get an error list index out of range. The data are within comment. So in my script I tried to use the python's built-in comment processing method. Could anybody point me out where I'm going wrong?
Link to the website: website_link
Script I've tried so far with:
import requests
from bs4 import BeautifulSoup, Comment
res = requests.get("replace_with_the_above_link")
soup = BeautifulSoup(res.text, 'lxml')
for comment in soup.find_all(string=lambda text:isinstance(text,Comment)):
sauce = BeautifulSoup(comment, 'lxml')
items = sauce.select("#tco_detail_data")[0]
data = ' '.join([' '.join(item.text.split()) for item in items.select("li")])
print(data)
This is the traceback:
Traceback (most recent call last):
File "C:\Users\Local\Programs\Python\Python35-32\new_line_one.py", line 8, in <module>
items = sauce.select("#tco_detail_data")[0]
IndexError: list index out of range
Please click on the below link to see which portion of data I would like to grab: Expected_output_link

None of the comments contain html with a "#tco_detail_data" tag, so select returns an empty list, which raises an IndexError when you try to select the first item.
However, you can find the data in a "ul#tco_detail_data" tag.
res = requests.get(link)
soup = BeautifulSoup(res.text, 'lxml')
data = soup.select_one("#tco_detail_data")
print(data)
If you want data in a list,
data = [list(item.stripped_strings) for item in data.select("ul")]
If you prefer a string,
data = '\n'.join([item.get_text(' ', strip=True) for item in data.select("ul")])

Related

Python: Why printed it as blank from a list type by using for loop(edited: case solved)

I'm trying to extract data using web scraping with python. The information contains a table of content of the movie released dates and nations. After I requested and used BeautifulSoup. It printed out as a blank []. I don't know how to fix it... Here is my code:
soup = BeautifulSoup(response)
element_dates = ".ipl-zebra-list ipl-zebra-list--fixed-first release-dates-table-test-only" # css selector (date release table)
select_datesTag = soup.select(element_dates)
result = [i.text for i in select_datesTag]
print(result)
>>>[]
Edit:
Thank you all for trying to help me. As the previous printed result showed as blank, indicating that the information I tried to extract was unsuccessful.
The reason caused that was because of the false label I picked for the "element_dates" css selector, that instead of the ".ipl-..." but actually is ".release-date-item__date".
This is the link of the website I was working on and provided with the fixed line of codes:
import requests
from bs4 import BeautifulSoup
target_url = "https://www.imdb.com/title/tt4154796/releaseinfo"
target_params = {"ref_": "tt_ov_inf"}
response = requests.get(target_url, params = target_params)
response = response.text
soup = BeautifulSoup(response)
element_dates = ".release-date-item__date"
print(element_dates) # successfully printed all the data with element_dates variable.

How to Grab Specific Text

I want to grab the price of bitcoin from this website: https://www.coindesk.com/price/bitcoin
but I am not sure how to do it, i'm pretty new to coding.
This is my code so far, I am not sure what I am doing wrong. Thanks in advance.
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.coindesk.com/price/bitcoin')
r_content = r.content
soup = BeautifulSoup(r_content, 'lxml')
p_value = soup.find('span', {'class': "currency-price", "data-value": True})['data-value']
print(p_value)
This is the result:
Traceback (most recent call last): File
"C:/Users/aidan/PycharmProjects/scraping/Scraper.py", line 8, in
p_value = soup.find('span', {'class': "currency-price", "data-value": True})['data-value'] TypeError: 'NoneType' object is not
subscriptable
Content is dynamically sourced from an API call returning json. You can use a list of currencies or a single currency. With requests javascript doesn't run and this content isn't added to the DOM and various DOM changes, to leave html as seen in browser, don't occur.
import requests
r = requests.get('https://production.api.coindesk.com/v1/currency/ticker?currencies=BTC').json()
print(r)
price = r['data']['currency']['BTC']['quotes']['USD']['price']
print(price)
r = requests.get('https://production.api.coindesk.com/v1/currency/ticker?currencies=ADA,BCH,BSV,BTC,BTG,DASH,DCR,DOGE,EOS,ETC,ETH,IOTA,LSK,LTC,NEO,QTUM,TRX,XEM,XLM,XMR,XRP,ZEC').json()
print(r)
The problem here is that the soup.find() call is not returning a value (that is, there is no span with the attributes you have defined on the page) therefore when you try to get data-value there is no dictionary to look it up in.
your website doesn't hold the data in html, this way you can't scrape it, but they are using an end point that you could use:
data = requests.get('https://production.api.coindesk.com/v1/currency/ticker?currencies=BTC').json()
p_value = data['data']['currency']['BTC']['quotes']['USD']['price']
print(p_value)
# output: 11375.678380772
the price is changing all the time so my output may be diffrent

Web scrape not pulling back title correctly

I am trying to pull back only the Title from a source code online. My code is able to currently pull all the correct lines, but I cannot figure out how to make it only pull back the title.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
URL = 'https://sc2replaystats.com/replay/playerStats/10774659/8465'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tb = soup.find('table', class_='table table-striped table-condensed')
for link in tb.find_all('tr'):
name = link.find('td')
print(name.get_text('title'))
I expect it to only say
Nexus
Pylon
Gateway
Assimilator
ect
but I get the error:
Traceback (most recent call last):
File "main.py", line 11, in <module>
print(name.get_text().strip())
AttributeError: 'NoneType' object has no attribute 'get_text'
I don't understand what I am doing wrong since from what I read it should only pull back the desired results
Try below code. Your first row had table header instead of table data so it will be none when you are looking for the td tag.
So add the condition to check when you can find either td or span inside td tag then get its title as below.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
URL = 'https://sc2replaystats.com/replay/playerStats/10774659/8465'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tb = soup.find('table', class_='table table-striped table-condensed')
for link in tb.find_all('tr'):
name = link.find('span')
if name is not None:
# Process only if the element is available
print(name['title'])
I thinkg you should use something like
for link in tb.find_all('tr'):
name = link.select('td[title]')
print(name.get_text('title'))
Because until I see, the string comes empty because there are not title tag name, so you are trying to get text from title attr from the tag td
bkyada's answer is perfect if you want another solution then.
In your for loop instead of finding td find_all span and iterate through it and find it's title attribute.
containers = link.find('span')
if containers is not None:
print(containers['title'])
It is more efficient to simply use the class name to identify the elements with title attribute as they all have one in first column.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
URL = 'https://sc2replaystats.com/replay/playerStats/10774659/8465'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tb = soup.find('table', class_='table table-striped table-condensed')
titles = [i['title'] for i in tb.select('.blizzard_icons_single')]
print(titles)
titles = {i['title'] for i in tb.select('.blizzard_icons_single')} #set of unique
print(titles)
As title attribute is limited to that column you could also have used (slighlty less quick) attribute selector:
titles = {i['title'] for i in tb.select('[title]')} #set of unique

Python BS4 crawler indexerror

I am trying to create a simple crawler that pulls meta data from websites and saves the information into a csv. So far I am stuck here, I have followed some guides but am now stuck with the error:
IndexError: list of index out of range.
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
# Copy all of the content from the provided web page
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile('<title>(.*)</title>')
# Grab the link to the original article using a REGEX
patFinderLink = re.compile('<link rel.*href="(.*)" />')
# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)
# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)
# Print out the results to screen
for i in listIterator:
print findPatTitle[i] # The title
print findPatLink[i] # The link to the original article
articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article
divBegin = articlePage.find('<div>') # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div
# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)
# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll('p')
# Print all of the paragraphs to screen
for i in paragList:
print i
print '\n'
# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)
print soup2.findAll('title')
print soup2.findAll('link')
titleSoup = soup2.findAll('title')
linkSoup = soup2.findAll('link')
for i in listIterator:
print titleSoup[i]
print linkSoup[i]
print '\n'
Any help would be greatly appreciated.
The error I get is
File "C:\Users......", line 24, in (module)
print findPatTitle[i] # the title
IndexError:list of index out of range
Thank you.
It seems that you are not using all the power that bs4 can give you.
You are getting this error because the lenght of patFinderTitle is just one, since all html has usually only one title element per document.
A simple way to grab the title of a HTML, is using bs4 itself:
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
soup = BeautifulSoup(webpage)
# get the content of title
title = soup.title.text
You will probably get the same error if you try to iterate over your findPatLink in the currently way, since it has length 6. For me, it is not clear enough if you want to get all the link elements or all the anchor elements, but stickying with the first idea, you can improve your code using bs4 again:
link_href_list = [link['href'] for link in soup.find_all("link")]
And finally, since you don't want some urls, you can slice link_href_list in the way that you want. An improved version of the last expression which excludes the first and the second result could be:
link_href_list = [link['href'] for link in soup.find_all("link")[2:]]

Scraping multiple webpages and writing to a CSV file

I'm writing a program that will take seven pieces of data from a website and write it to a csv file per a company in the symbols.txt file, such as AAPL or NFLX. My problems seems to come from my confusion with index to make the script work. I am at a loss on how it fits. I thought that this format would work...
import urllib2
from BeautifulSoup import BeautifulSoup
import csv
import re
import urllib
# import modules
symbolfile = open("symbols.txt")
symbolslist = symbolfile.read()
newsymbolslist = symbolslist.split("\n")
i = 0
f = csv.writer(open("pe_ratio.csv","wb"))
# short cut to write
f.writerow(["Name","PE","Revenue % Quarterly","ROA% YOY","Operating Cashflow","Debt to Equity"])
#first write row statement
# define name_company as the following
while i<len(newsymbolslist):
page = urllib2.urlopen("http://finance.yahoo.com/q/ks?s="+newsymbolslist[i] +"%20Key%20Statistics").read()
soup = BeautifulSoup(page)
name_company = soup.findAll("div", {"class" : "title"})
for name in name_company: #add multiple iterations?
all_data = soup.findAll('td', "yfnc_tabledata1")
stock_name = name.find('h2').string #find company's name in name_company with h2 tag
f.writerow([stock_name, all_data[2].getText(),all_data[17].getText(),all_data[13].getText(), all_data[29].getText(),all_data[26].getText()]) #write down PE data
i+=1
I get the following error below when I try to run the code as is:
Traceback (most recent call last):
File "company_data_v1.py", line 28, in <module>
f.writerow([stock_name, all_data[2].getText(),all_data[17].getText(),all_data[13].getText(), all_data[29].getText()
all_data[26].getText()]) #write down PE data
IndexError: list index out of range
Thanks for your help in advance.
name_company = soup.findAll("div", {"class" : "title"})
soup = BeautifulSoup(page) #this is the first time you define soup
You define soup on the line after you attempt to do soup.findAll. Python tells you exactly what the problem is: you haven't defined soup at the findAll line.
Flip the order of those lines.
I assume when you said "where to put the variables to make the script work" you were referring to this 'soup' variable? The one in your error message?
If so then I suggest declaring 'soup' before you try to use it in soup.findAll(). As you can see, you declared soup = BeautifulSoup(page) one line after soup.findAll(). It should go above it.

Categories