Web scrape a website not formatted properly - python

I am working on scraping a web site link is "https://homeshopping.pk/search.php?q=samsung%20phones". Iam finding difficulty in accessing to one of the div class. I think its is not formatted properly. Reason for asking this question is to confirm that is it not formatted properly or I am doing something wrong.
Screenshot is:
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
page_url = "https://homeshopping.pk/search.php?q=samsung%20phones"
uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
print(page_soup)
# finds each product from the store page
container1 = page_soup.find_all("div", {"class": "findify-container findify-search findify-widget-2"})
len(container1)
print(container1)

This is where this thing loads the products from - https://api-v3.findify.io/v3/search?user[uid]=TW1bcavcZKWeb32z&user[sid]=6kn0FcKb4QjgMz60&user&t_client=1584424566753&key=cae15cfe-508b-41d1-a019-161c02ffd97c&q=samsung%20phones
Now, are those params fixed? I have no slightest idea. Can you parse this? Absolutely, parse with json.loads, not bs
import requests, json
source = requests.get('https://api-v3.findify.io/v3/search?user[uid]=TW1bcavcZKWeb32z&user[sid]=6kn0FcKb4QjgMz60&user&t_client=1584424566753&key=cae15cfe-508b-41d1-a019-161c02ffd97c&q=samsung%20phones')
j = json.loads(source.content.decode())
for item in j["items"]:
print(item["title"])

Related

HtmlAgilityPack and BeautifulSoup missing tags

I am scraping this site: https://finance.yahoo.com/quote/MSFT/press-releases.
In the browser, there are 20+ articles. However, when I pull the site's HTML down and load it into HTML agility pack, only the first three articles are appearing.
let client = new WebClient()
let uri = "https://finance.yahoo.com/quote/MSFT/press-releases"
let response = client.DownloadString(uri)
let doc = HtmlDocument()
doc.LoadHtml(response)
works:
let node = doc.DocumentNode.SelectSingleNode("//*[#id=\"summaryPressStream-0-Stream\"]/ul/li[1]")
node.InnerText
no works:
let node = doc.DocumentNode.SelectSingleNode("//*[#id=\"summaryPressStream-0-Stream\"]/ul/li[10]")
node.InnerText
Is it because there are some jenky li tags in the yahoo site? Is it is limitation of the HtmlAgilityPack?
I also did the same script in Python using BeautifulSoup and have the same problem:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://finance.yahoo.com/quote/MSFT/press-releases?"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', href=True):
print(link['href'])
Thanks

Python urlopener not retrieving tables and lists

I am trying to make a simple webscraper where I take the information off a HTML page. It's simple but I have a problem I can't seem to solve:
When I download the HTML page by myself and parse it using BeautifulSoup, it parses everything and gives me all the data, this is ok but I don't need to do this. Instead I am trying to using a link instead which doesn't seem to be working. Whenever I use the link using the "urlopen" function and parse the page using BeautifulSoup, it always seems to completely ignore/exclude some lists and tables from the HTML file. These tables appear when I look up the page online using the "Inspect Element" method, and they also appear when I download the HTML page myself but they never appear when I use the "urlopen" function. I even tried encoding post data and sending it as an argument of the function but it doesn't seem to work that way either.
import bs4
from urllib.request import urlopen as uReq
from urllib.parse import urlencode as uEnc
from bs4 import BeautifulSoup as soup
my_url = 'https://sp.com.sa/en/tracktrace/?tid=RB961555017SG'
#data = {'tid':'RB961555017SG'}
#sdata = uEnc(data)
#sdata = bytearray(sdata, 'utf-8')
uClient = uReq(my_url, timeout=2) # opening url or downloading the webpage
page_html = uClient.read() # saving html file in page_html
uClient.close() # closing url or connection idk properly
page_soup = soup(page_html, "html.parser") # parsing the html file and saving
updates = page_soup.findAll("div",{"class":"col-sm-12"})
#updates = page_soup.findAll("ol", {})
print(updates)
These tables contain the information I need, is there anyway I can fix this?
request works a bit differently than a browser. E.g. it does not actually run JavaScript.
In this case the table with info is generated by a script rather than hardcoded in HTML. You can see the actual source-code using "view-source:" followed by the url.
view-source:https://sp.com.sa/en/tracktrace/?tid=RB961555017SG
So, we'd want to run that script some way. The easiest is to use "selenium", which uses the driver of your browser. Then simply take the loaded html and run that through BS4. I noticed that there is a more specific tag you can use rather than "col-sm-12". Hope it helps :)
import bs4
from urllib.request import urlopen as uReq
from urllib.parse import urlencode as uEnc
from bs4 import BeautifulSoup as soup
my_url = 'https://sp.com.sa/en/tracktrace/?tid=RB961555017SG'
from selenium import webdriver
import time
chrome_path = "path of chrome driver that fits your current browser version"
driver = webdriver.Chrome(chrome_path)
driver.get(my_url)
time.sleep(5) #to make sure it's loaded
page_html = driver.page_source
driver.quit()
page_soup = soup(page_html, "html.parser") # parsing the html file and saving
updates = page_soup.findAll("table",{"class":"table-shipment table-striped"}) #The table is more specific to you wanted data than the col-sm-12 so I'd rather use that.
print(updates)

How to make beautiful soup grab only what is between a set of "[:" ":]" in a web page?

Good afternoon! How do I make Beautifulsoup grab only what is between multiple sets of "[:" and ":]" So far I have got the entire page in my soup, but it does not have tags, sadly.
What it looks like so far
I have tried a couple of things so far:
soup.findAll(text="[")
keys = soup.find("span", attrs = {"class": "objectBox objectBox-string"})
import bs4 as bs
import urllib.request
source = urllib.request.urlopen("https://login.microsoftonline.com/common/discovery/keys").read()
soup = bs.BeautifulSoup(source,'lxml')
# ---------------------------------------------
# prior script that I was playing with trying to tackle this issue
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set URL to scrape new certs from
newcerts = "https://login.microsoftonline.com/common/discovery/keys"
# Connect to the URL
response = requests.get(newcerts)
# Parse HTML and save to BeautifulSoup Object
soup = BeautifulSoup(response.text, "html.parser")
keys = soup.find("span", attrs = {"class": "objectBox objectBox-string"})
End goal is to retrieve the public PKI keys from Azure's website at https://login.microsoftonline.com/common/discovery/keys
Not sure if this is what you meant to grab. Try the script below:
import json
import requests
url = 'https://login.microsoftonline.com/common/discovery/keys'
res = requests.get(url)
jsonobject = json.loads(res.content)
for item in jsonobject['keys']:
print(item['x5c'])

Problem with scraping data from website with BeautifulSoup

I am trying to take a movie rating from the website Letterboxd. I have used code like this on other websites and it has worked, but it is not getting the info I want off of this website.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://letterboxd.com/film/avengers-endgame/")
soup = BeautifulSoup(page.content, 'html.parser')
final = soup.find("section", attrs={"class":"section ratings-histogram-
chart"})
print(final)
This prints nothing, but there is a tag in the website for this class and the info I want is under it.
The reason behind this, is that the website loads most of the content asynchronously, so you'll have to look at the http requests it sends to the server in order to load the page content after loading the page layout. You can find them in "network" section in the browser (F12 key).
For instance, one of the apis they use to load the rating is this one:
https://letterboxd.com/csi/film/avengers-endgame/rating-histogram/
You can get the weighted average from another tag
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://letterboxd.com/film/avengers-endgame/')
soup = bs(r.content, 'lxml')
print(soup.select_one('[name="twitter:data2"]')['content'])
Text of all histogram
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://letterboxd.com/csi/film/avengers-endgame/rating-histogram/')
soup = bs(r.content, 'lxml')
ratings = [item['title'].replace('\xa0',' ') for item in soup.select('.tooltip')]
print(ratings)

Scraping a table appearing on click with python

I want to scrape information from this page.
Specifically, I want to scrape the table which appears when you click "View all" under the "TOP 10 HOLDINGS" (you have to scroll down on the page a bit).
I am new to webscraping, and have tried using BeautifulSoup to do this. However, there seems to be an issue because the "onclick" function I need to take into account. In other words: The HTML code I scrape directly from the page doesn't include the table I want to obtain.
I am a bit confused about my next step: should I use something like selenium or can I deal with the issue in an easier/more efficient way?
Thanks.
My current code:
from bs4 import BeautifulSoup
import requests
Soup = BeautifulSoup
my_url = 'http://www.etf.com/SHE'
page = requests.get(my_url)
htmltxt = page.text
soup = Soup(htmltxt, "html.parser")
print(soup)
You can get a json response from the api: http://www.etf.com/view_all/holdings/SHE. The table you're looking for is located in 'view_all'.
import requests
from bs4 import BeautifulSoup as Soup
url = 'http://www.etf.com/SHE'
api = "http://www.etf.com/view_all/holdings/SHE"
headers = {'X-Requested-With':'XMLHttpRequest', 'Referer':url}
page = requests.get(api, headers=headers)
htmltxt = page.json()['view_all']
soup = Soup(htmltxt, "html.parser")
data = [[td.text for td in tr.find_all('td')] for tr in soup.find_all('tr')]
print('\n'.join(': '.join(row) for row in data))

Categories