Scraping website in Python - python

I have a problem with website scrape in Python. Specifically, the problem is I can not scrape live scores websites with library BeautifulSoup in Python. The problem in my code is that: the html elements can not be inserted into list in Python.
import urllib3
from bs4 import BeautifulSoup
import requests
import pymysql
import timeit
data_list=[]
url_p=requests.get('my url website')
soup = BeautifulSoup(url_p.text,'html.parser')
vathmoi_table=soup.find("td",class_="label")
for table in soup.findAll("table"):
print(table)
print(vathmoi_table)
for team_name in soup.findAll("td"):
data_list_r=[]
simvolo = team_name.find("img")
name=team_name.find("td",class_="label")
vathmologia=team_name.find("td",class_="points")
if(name!=None):
data_list_r.append(symvolo.get_text().strip())
data_list_r.append(name.get_text().strip())
data_list_r.append(vathmologia.get_text().strip())
data_list.append(data_list_r)
for tr_parse in team_name.findAll("tr"):
team=tr_parse.find("td",class_="team")
if(team!=None):
print(team.get_text())
print(data_list)

Related

Web Scraping with Beautiful Soup Python - Seesaw - The output has no length error

I am trying to get the comments from a website called Seesaw but the output has no length. What am I doing wrong?
import requests
import requests
import base64
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as req
from requests import get
html_text = requests.get("https://app.seesaw.me/#/activities/class/class.93a29acf-0eef-4d4e-9d56-9648d2623171").text
soup = BeautifulSoup(html_text, "lxml")
comments = soup.find_all("span", class_ = "ng-binding")
print(comments)
Because there is no span element with class ng-binding on the page (these elements added later via JavaScript)
import requests
html_text = requests.get("https://app.seesaw.me/#/activities/class/class.93a29acf-0eef-4d4e-9d56-9648d2623171").text
print(f'{"ng-binding" in html_text=}')
So output is:
"ng-binding" in html_text=False
Also you can check it using "View Page Source" function in your browser. You can try to use Selenium for automate interaction with the site.

How to scrape plaintext from multiple links off of one website?

from bs4 import BeautifulSoup
import bs4 as bs
import pandas as pd
import numpy as py
import json
import csv
import re
import urllib.request
sauce =
urllib.request.urlopen("https://www.imdb.com/list/ls003073623/").read()
soup = bs.BeautifulSoup(sauce, 'html.parser')
soup.findAll('a', href=re.compile('^/title/'))
I am trying to scrape multiple links off of a website (about 500) and I don't want to manually input each and every URL, how do I go about scraping this?
With BeautifulSoup
If I understand it right, you are trying to obtain a list containing a part of all the links on a given website. There is an example on BeautifulSoup's documentation that shows exactly how to do that:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen("https://www.imdb.com/list/ls003073623/")
soup = BeautifulSoup(html_page)
ids = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
ids.append(link.get('href').split("/")[4])
print(ids)
With Selenium
For reference, and since it doesn't seem like the question is limited to only BeautifulSoup, here's how we would do the same using Selenium, a very popular alternative.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.imdb.com/list/ls003073623/")
ids = []
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
ids.append(elem.get_attribute("href").split("/")[4])
print(ids)

Python BeautifulSoup - trouble parsing table from webpage

I'd like to parse the table data from the following site:
Pricing data and create a dataframe with all of the table values (vCPU, Memory, Storage, Price). However, with the following code, I can't seem to find the table on the page. Can someone help me figure out how to parse out the values?
Using the pd.read_html, an error shows up that no tables are found.
import pandas as pd
from bs4 import BeautifulSoup
import requests
import csv
url = "https://aws.amazon.com/ec2/pricing/on-demand/"
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, 'html.parser')
data=[]
tables = soup.find_all('table')
df = pd.read_html(url)
If your having trouble because of dynamic content a good work around is selenium, it simulates browser experience so you dont have to worry about managing cookies and other problems that come with dynamic web content. I was able to scrape the page with the following:
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep
driver = webdriver.Firefox()
driver.get('https://aws.amazon.com/ec2/pricing/on-demand/')
sleep(3)
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
driver.close()
data=[]
tables = soup.find_all('table')
print(tables)

Failed to extract tables and data using beautifulsoup

I was trying to parse yahoo finance webpage using beautifulsoup. I am using python 2.7 and bs4 4.3.2. My final objective is to extract in python all the tabulated data from http://finance.yahoo.com/q/ae?s=PXT.TO. As a start, following code cannot find any table from the url. What am i missing?
from bs4 import BeautifulSoup
from urllib2 import urlopen
url = "http://finance.yahoo.com/q/ae?s=PXT.TO"
soup = BeautifulSoup(urlopen(url).read())
table = soup.findAll("table")
print table`

Scraping Product Names using BeautifulSoup

I'm using BeautifulSoup (BS4) to build a scraper tool that will allow me to pull the product name from any TopShop.com product page, which sits between 'h1' tags. Can't figure out why the code I've written isn't working!
from urllib2 import urlopen
from bs4 import BeautifulSoup
import re
TopShop_URL = raw_input("Enter a TopShop Product URL")
ProductPage = urlopen(TopShop_URL).read()
soup = BeautifulSoup(ProductPage)
ProductNames = soup.find_all('h1')
print ProductNames
I get this working using requests (http://docs.python-requests.org/en/latest/)
from bs4 import BeautifulSoup
import requests
content = requests.get("TOPShop_URL").content
soup = BeautifulSoup(content)
product_names = soup.findAll("h1")
print product_names
Your code is correct, but the problem is that the div which includes the product name is dynamically generated via JavaScript.
In order to be able to successfully parse this element you should mind using Selenium or a similar tool, that will allow you to parse the webpage after all the dom has been fully loaded.

Categories