Beutifulsoup to extract all external resources from html

Beutifulsoup to extract all external resources from html - python

I am looking to identify the urls that request external resources in html files.
I currently use the scr attribute in the img and script tags, and the href attribute in the link tag (to identify css).
Are there other tags that I should be examining to identify other resources?
For reference, my code in Python is currently:
html = read_in_file(file)
soup = BeautifulSoup(html)
image_scr = [x['src'] for x in soup.findAll('img')]
css_link = [x['href'] for x in soup.findAll('link')]
scipt_src = [] ## Often times script doesn't have attributes 'src' hence need for try/except
for x in soup.findAll('script'):
try:
scipt_src.append(x['src'])
except KeyError:
pass

Updated my code to capture what seemed like the most common resources in html code. Obviously this doesn't look at resources requested in either CSS or Javascript. If I am missing tags please comment.
from bs4 import BeautifulSoup
def find_list_resources (tag, attribute,soup):
list = []
for x in soup.findAll(tag):
try:
list.append(x[attribute])
except KeyError:
pass
return(list)
html = read_in_file(file)
soup = BeautifulSoup(html)
image_scr = find_list_resources('img',"src",soup)
scipt_src = find_list_resources('script',"src",soup)
css_link = find_list_resources("link","href",soup)
video_src = find_list_resources("video","src",soup)
audio_src = find_list_resources("audio","src",soup)
iframe_src = find_list_resources("iframe","src",soup)
embed_src = find_list_resources("embed","src",soup)
object_data = find_list_resources("object","data",soup)
soruce_src = find_list_resources("source","src",soup)

Related

Not getting output when parsing with lxml using xpath in Beautifulsoup

When I was trying to scrape the data from Sephora and Ulta using beautifulsoup, I could get the html content of the page. Then when I tried to use lxml to parse it using xpath, i didn't get any output. But working with this same xpath in selenium, i could get the output.
Using Beautifulsoup
for i in range(len(df)):
response = requests.get(df['product_url'].iloc[i])
my_url=df['product_url'].iloc[i]
My_url= ureq(my_url)
my_html=My_url.read()
My_url.close()
soup = BeautifulSoup(my_html, 'html.parser')
dom = et.HTML(str(soup))
#price
try:
price=(dom.xpath('//*[#id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span/text()'))
df['price'].iloc[i]=price
except:
pass
Using Selenium
lst=[]
urls=df['product_url']
for url in urls[:599]:
time.sleep(1)
driver.get(url)
time.sleep(2)
try:
prize=driver.find_element('xpath','//*[#id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span').text
except:
pass
lst.append([prize])
pz=None
dt=None
Does anyone know why i cant get the content using lxml to parse it using same xpath in beautifulsoup? Thanks so much in advance.
Sample Link of Ulta:
[1]: https://www.ulta.com/p/coco-mademoiselle-eau-de-parfum-spray-pimprod2015831
Sample Link of Sephora:
[2]: https://www.sephora.com/product/coco-mademoiselle-P12495?skuId=513168&icid2=products

1. About the XPath
driver.find_element('xpath','//*[#id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span').text
I'm a bit surprised that the selenium code works for your Sephora links - the link you provided redirects to a productnotcarried page, but at this link (for example), that XPath has no matches. You can use //p[#data-comp="Price "]//span/b instead.
Actually, even for Ulta, I prefer //*[#class="ProductHero__content"]//*[#class="ProductPricing"]/span just for human-readability although it looks better if you use this path with css selectors
prize=driver.find_element("css selector", '*.ProductHero__content *.ProductPricing>span').text
[Coding for both sites - Selenium]
To account for both sites, you could set up something like this reference dictionary:
xRef = {
'www.ulta.com': '//*[#id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span',
'www.sephora.com': '//p[#data-comp="Price "]//span/b'
}
# for url in urls[:599]:... ################ REST OF CODE #############
and then use it accordingly
# from urllib.parse import urlsplit
# lst, urls, xRef = ....
# for url in urls[:599]:
# sleep...driver.get...sleep...
try:
uxrKey = urlsplit(url).netloc
prize = driver.find_element('xpath', xRef[uxrKey]).text
except:
# pass # you'll just be repeating whatever you got in the previous loop for prize
# [also, if this happens in the first loop, an error will be raised at lst.append([prize])]
prize = None # 'MISSING' # '' #
################ REST OF CODE #############
2. Limitations of Scraping with bs4+requests
I don't know what et and ureq are, but response from requests.get can be parsed without them; although [afaik] bs4 doesn't have any XPath support, css selectors can be used with .select .
price = soup.select('.ProductHero__content .ProductPricing>span') # for Ulta
price = soup.select('p[data-comp~="Price"] span>b') # for Sephora
Although that's enough for Sephora, there's another issue - the price in Ulta pages are loaded with js so the parent of the price span is empty.
3. [Suggested Solution] Extracting from JSON inside script Tags
For both sites, product data can be found inside script tags, so this function can be used to extract price from either site:
# import json
############ LONGER VERSION ##########
def getPrice_fromScript(scriptTag):
try:
s, sj = scriptTag.get_text(), json.loads(scriptTag.get_text())
while s:
sPair = s.split('"#type"', 1)[1].split(':', 1)[1].split(',', 1)
t, s = sPair[0].strip(), sPair[1]
try:
if t == '"Product"': return sj['offers']['price'] # Ulta
elif t == '"Organization"': return sj['offers'][0]['price'] # Sephora
# elif.... # can add more options
# else.... # can add a default
except: continue
except: return None
#######################################
############ SHORTER VERSION ##########
def getPrice_fromScript(scriptTag):
try:
sj = json.loads(scriptTag.get_text())
try: return sj['offers']['price'] # Ulta
except: pass
try: return sj['offers'][0]['price'] # Sephora
except: pass
# try...except: pass # can try more options
except: return None
#######################################
and you can use it with your BeautifulSoup code:
# from requests_html import HTMLSession # IF you use instead of requests
# def getPrice_fromScript....
for i in range(len(df)):
response = requests.get(df['product_url'].iloc[i]) # takes too long [for me]
# response = HTMLSession().get(df['product_url'].iloc[i]) # is faster [for me]
## error handing, just in case ##
if response.status_code != 200:
errorMsg = f'Failed to scrape [{response.status_code} {response.reason}] - '
print(errorMsg, df['product_url'].iloc[i])
continue # skip to next loop/url
soup = BeautifulSoup(response.content, 'html.parser')
pList = [p.strip() for p in [
getPrice_fromScript(s) for s in soup.select('script[type="application/ld+json"]')[:5] # [1:2]
] if p and p.strip()]
if pList: df['price'].iloc[i] = pList[0]
(The price should be in the second script tag with type="application/ld+json", but this is searching the first 5 just in case....)
Note: requests.get was being very slow when I was testing these codes, especially for Sephora, so I ended up using HTMLSession().get instead.

How to get the href value of a link with bs4?

I need help where I can extract all the matches from 2020/2021's URLs from this [website][1] and scrape them.
I am sending a request to this link.
The section of the HTML that I want to retrieve is this part:
Here's the code that I am using:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
website = 'https://www.espncricinfo.com/series/ipl-2020-21-1210595/match-results'
response = requests.get(website)
soup = BeautifulSoup(response.content,'html.parser')
match_result = soup.find_all('a',{'class':'match-info-link-FIXTURES'});
soup.get('href')
url_part_1 = 'https://www.espncricinfo.com/'
url_part_2 = []
for item in match_result:
url_part_2.append(item.get('href'))
url_joined = []
for link_2 in url_part_2:
url_joined.append(urllib.parse.urljoin(url_part_1,link_2))
first_link = url_joined[0]
match_url = soup.find_all('div',{'class':'link-container border-bottom'});
soup.get('href')
url_part_3 = 'https://www.espncricinfo.com/'
url_part_4 = []
for item in match_result:
url_part_4.append(item.get('href'))
print(url_part_4)
[1]: https://www.espncricinfo.com/series/ipl-2020-21-1210595/match-results

You don't need the second item.find_all('a',{'class':'match-info-link-FIXTURES'}): call below for item in match_result: since you already have the tags with the hrefs.
You can get the href with item.get('href').
You can do:
url_part_1 = 'https://www.espncricinfo.com/'
url_part_2 = []
for item in match_result:
url_part_2.append(item.get('href'))
The result will look something like:
['/series/ipl-2020-21-1210595/delhi-capitals-vs-mumbai-indians-final-1237181/full-scorecard',
'/series/ipl-2020-21-1210595/delhi-capitals-vs-sunrisers-hyderabad-qualifier-2-1237180/full-scorecard',
'/series/ipl-2020-21-1210595/royal-challengers-bangalore-vs-sunrisers-hyderabad-eliminator-1237178/full-scorecard',
'/series/ipl-2020-21-1210595/delhi-capitals-vs-mumbai-indians-qualifier-1-1237177/full-scorecard',
'/series/ipl-2020-21-1210595/sunrisers-hyderabad-vs-mumbai-indians-56th-match-1216495/full-scorecard',
...
]

From official doc's
:
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_.
Try
soup.find_all("a", class_="match-info-link-FIXTURES")

How to scrape data from interactive chart using python?

I have a next link which represent an exact graph I want to scrape: https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1
I'm simply can't understand is it a xml or svg graph and how to scrape data. I think I need to use bs4, requests but don't know the way to do that.
Anyone could help?

You will load HTML like this:
import requests
url = "https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1"
resp = requests.get(url)
data = resp.text
Then you will create a BeatifulSoup object with this HTML.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, features="html.parser")
After this, it is usually very subjective how to parse out what you want. The candidate codes may vary a lot. This is how I did it:
Using BeautifulSoup, I parsed all "rect"s and check if "onmouseover" exists in that rect.
rects = soup.svg.find_all("rect")
yx_points = []
for rect in rects:
if rect.has_attr("onmouseover"):
text = rect["onmouseover"]
x_start_index = text.index("'") + 1
y_finish_index = text[x_start_index:].index("'") + x_start_index
yx = text[x_start_index:y_finish_index].split()
print(text[x_start_index:y_finish_index])
yx_points.append(yx)
As you can see from the image below, I scraped onmouseover= part and get those 02.2015 155,1 parts.
Here, this is how yx_points looks like now:
[['12.2009', '100,0'], ['01.2010', '101,8'], ['02.2010', '103,7'], ...]

from bs4 import BeautifulSoup
import requests
import re
#First get all the text from the url.
url="https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1"
response = requests.get(url)
html = response.text
#Find all the tags in which the data is stored.
soup = BeautifulSoup(html, 'lxml')
texts = soup.findAll("rect")
final = []
for each in texts:
names = each.get('onmouseover')
try:
q = re.findall(r"'(.*?)'", names)
final.append(q[0])
except Exception as e:
print(e)
#The details are appended to the final variable

BS4 Not Locating Element in Python

I am somewhat new to Python and can't for the life of me figure out why the following code isn’t pulling the element I am trying to get.
It currently returns:
for player in all_players:
player_first, player_last = player.split()
player_first = player_first.lower()
player_last = player_last.lower()
first_name_letters = player_first[:2]
last_name_letters = player_last[:5]
player_url_code = '/{}/{}{}01'.format(last_name_letters[0], last_name_letters, first_name_letters)
player_url = 'https://www.basketball-reference.com/players' + player_url_code + '.html'
print(player_url) #test
req = urlopen(player_url)
soup = bs.BeautifulSoup(req, 'lxml')
wrapper = soup.find('div', id='all_advanced_pbp')
table = wrapper.find('div', class_='table_outer_container')
for td in table.find_all('td'):
player_pbp_data.append(td.get_text())
Currently returning:
--> for td in table.find_all('td'):
player_pbp_data.append(td.get_text()) #if this works, would like to
AttributeError: 'NoneType' object has no attribute 'find_all'
Note: iterating through children of the wrapper object returns:
< div class="table_outer_container" > as part of the tree.
Thanks!

Make sure that table contains the data you expect.
For example https://www.basketball-reference.com/players/a/abdulka01.html doesn't seem to contain a div with id='all_advanced_pbp'

Try to explicitly pass the html instead:
bs.BeautifulSoup(the_html, 'html.parser')

I trie to extract data from the url you gave but it did not get full DOM. after then i try to access the page with browser with javascrip and without javascrip, i know website need javascrip to load some data. But the page like players it need not. The simple way to get dynamic data is using selenium
This is my test code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
player_pbp_data = []
def get_list(t="a"):
with requests.Session() as se:
url = "https://www.basketball-reference.com/players/{}/".format(t)
req = se.get(url)
soup = BeautifulSoup(req.text,"lxml")
with open("a.html","wb") as f:
f.write(req.text.encode())
table = soup.find("div",class_="table_wrapper setup_long long")
players = {player.a.text:"https://www.basketball-reference.com"+player.a["href"] for player in table.find_all("th",class_="left ")}
def get_each_player(player_url="https://www.basketball-reference.com/players/a/abdulta01.html"):
with webdriver.Chrome() as ph:
ph.get(player_url)
text = ph.page_source
'''
with requests.Session() as se:
text = se.get(player_url).text
'''
soup = BeautifulSoup(text, 'lxml')
try:
wrapper = soup.find('div', id='all_advanced_pbp')
table = wrapper.find('div', class_='table_outer_container')
for td in table.find_all('td'):
player_pbp_data.append(td.get_text())
except Exception as e:
print("This page dose not contain pbp")
get_each_player()

python passing argument containing quote

I'm learning to scrape text from the web. Ive written the following function
from bs4 import BeautifulSoup
import requests
def get_url(source_url):
r = requests.get(source_url)
data = r.text
#extract HTML for parsing
soup = BeautifulSoup(data, 'html.parser')
#get H3 tags with class ...
h3list = soup.findAll("h3", { "class" : "entry-title td-module-title" })
#create data structure to store links in
ulist = []
#pull links from each article heading
for href in h3list:
ulist.append(href.a['href'])
return ulist
I am calling this from a separate file...
from print1 import get_url
ulist = get_url("http://www.startupsmart.com.au/")
print(ulist[3])
The problem is that the css selector I am using is quite unique to the site I am parsing. So the function is a bit 'brittle'. I want to pass the css selector as an argument to the function
If I add a parameter to the function definition
def get_url(source_url, css_tag):
and try to pass "h3", { "class" : "entry-title td-module-title" }
it spazzes out
TypeError: get_url() takes exactly 1 argument (2 given)
I tried escaping all the quotes but it still doesn't work.
I'd really appreciate some help. I can't find a previoud answer to this one.

Here's a version that works:
from bs4 import BeautifulSoup
import requests
def get_url(source_url, tag_name, attrs):
r = requests.get(source_url)
data = r.text
# extract HTML for parsing
soup = BeautifulSoup(data, 'html.parser')
# get H3 tags with class ...
h3list = soup.findAll(tag_name, attrs)
# create data structure to store links in
ulist = []
# pull links from each article heading
for href in h3list:
ulist.append(href.a['href'])
return ulist
ulist = get_url("http://www.startupsmart.com.au/", "h3", {"class": "entry-title td-module-title"})
print(ulist[3])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beutifulsoup to extract all external resources from html - python

Related

Not getting output when parsing with lxml using xpath in Beautifulsoup

How to get the href value of a link with bs4?

How to scrape data from interactive chart using python?

BS4 Not Locating Element in Python

python passing argument containing quote

Categories

Resources