Folks,
Ive managed to get beautifulsoup to scrape a page with the following
html = response.read()
soup = BeautifulSoup(html)
links = soup.findAll('a')
There are several occurrences of
<A href="javascript:Set_Variables('foo1','bar1''')"onmouseover="javascript: return window.status=''">
<A href="javascript:Set_Variables('foo2','bar2''')"onmouseover="javascript: return window.status=''">
How can I iterate through this and get the foo/bar values?
Thanks
You can use regular expressions to extract variables from href attributes:
import re
from bs4 import BeautifulSoup
data = """
<div>
<table>
<A href="javascript:Set_Variables('foo1','bar1''')" onmouseover="javascript: return window.status=''">
<A href="javascript:Set_Variables('foo2','bar2''')" onmouseover="javascript: return window.status=''">
</table>
</div>
"""
soup = BeautifulSoup(data)
pattern = re.compile(r"javascript:Set_Variables\('(\w+)','(\w+)'")
for a in soup('a'):
match = pattern.search(a['href'])
if match:
print match.groups()
Prints:
('foo1', 'bar1')
('foo2', 'bar2')
Related
HTML Exemple
<html>
<div book="blue" return="abc">
<h4 class="link">www.example.com</h4>
<p class="author">RODRIGO</p>
</html>
Ex1:
url = urllib.request.urlopen(url)
page_soup = soup(url.read(), "html.parser")
res=page_soup.find_all(attrs={"class": ["author","link"]})
for each in res:
print(each)
Result1:
www.example.com
RODRIGO
Ex2:
url = urllib.request.urlopen(url)
page_soup = soup(url.read(), "html.parser")
res=page_soup.find_all(attrs={"book": ["blue"]})
for each in res:
print(each["return")
Result 2:
abc
!!!puzzle!!!
The question I have is how to return the 3 results in a single query?
Result 3
www.example.com
RODRIGO
abc
Example HTML seems to be broken - Assuming the div wrappes the other tags and it is may not the only book you can select all books:
for e in soup.find_all(attrs={"book": ["blue"]}):
print(' '.join(e.stripped_strings),e.get('return'))
Example
from bs4 import BeautifulSoup
html = '''
<html>
<div book="blue" return="abc">
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>
'''
soup = BeautifulSoup(html)
for e in soup.find_all(attrs={"book": ["blue"]}):
print(' '.join(e.stripped_strings),e.get('return'))
Output
www.rodrigo.com RODRIGO abc
A more structured example could be:
data = []
for e in soup.select('[book="blue"]'):
data.append({
'link':e.h4.text,
'author':e.select_one('.author').text,
'return':e.get('return')
})
data
Output:
[{'link': 'www.rodrigo.com', 'author': 'RODRIGO', 'return': 'abc'}]
For the case one attribute against many values a regex approach is suggested:
from bs4 import BeautifulSoup
import re
html = """<html>
<div book="blue" return="abc">
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>"""
soup = BeautifulSoup(html, 'lxml')
by_clss = soup.find_all(class_=re.compile(r'link|author'))
print(b_clss)
For more flexibility, a custom query function can be passed to find or find_all:
from bs4 import BeautifulSoup
html = """<html>
<div href="blue" return="abc"></div> <!-- div need a closing tag in a html-doc-->
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>"""
def query(tag):
if tag.has_attr('class'):
# tag['class'] is a list. Here assumed that has only one value
return set(tag['class']) <= {'link', 'author'}
if tag.has_attr('book'):
return tag['book'] in {'blue'}
return False
print(soup.find_all(query))
# [<div book="blue" return="abc"></div>, <h4 class="link">www.rodrigo.com</h4>, <p class="author">RODRIGO</p>]
Notice that your html-sample has no closing div-tag. In my second case I added it otherwise the soup... will not taste good.
EDIT
To retrieve elements which satisfies a simultaneous conditions on attributes the query could look like this:
def query_by_attrs(**tag_kwargs):
# tag_kwargs: {attr: [val1, val2], ...}
def wrapper(tag):
for attr, values in tag_kwargs.items():
if tag.has_attr(attr):
# check if tag has multi-valued attributes (class,...)
if not isinstance((tag_attr:=tag[attr]), list): # := for python >=3.8
tag_attr = (tag_attr,) # as tuple
return bool(set(tag_attr).intersection(values)) # false if empty set
return wrapper
q_data = {'class': ['link', 'author'], 'book': ['blue']}
results = soup.find_all(query_by_attrs(**q_data))
print(results)
Extract All link from WebSite
import requests
from bs4 import BeautifulSoup
url = 'https://mixkit.co/free-stock-music/hip-hop/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
I'm trying web scraping for the first time and I'm using BeautifulSoup to gather bits of information from a website. I'm trying to get all the elements which has one class but not another. For example:
from bs4 import BeautifulSoup
html = """
<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>
"""
soup = BeautifulSoup(html)
In this example, I want to get all the elements with the something class. However, when finding all elements containing that class I also get the element containing the somethingelse class, and I do not want these.
The code I'm using to get it is:
results = soup.find_all("a", {"class": "something"})
Any help is appreciated! Thanks.
This will work fine:
from bs4 import BeautifulSoup
text = '''<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>'''
soup = BeautifulSoup(text, 'html.parser')
r1 = soup.find_all("a", {"class": "something"})
r2 = soup.find_all("a", {"class": "somethingelse"})
for item in r2:
if item in r1:
r1.remove(item)
print(r1)
Output
[<a class="something">Information I want</a>]
For extracting the text present in the tags, just add this lines:
for item in r1:
print(item.text)
Output
Information I want
For this task, you can find elements by lambda function, for example:
from bs4 import BeautifulSoup
html_doc = """<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>
"""
soup = BeautifulSoup(html_doc, "html.parser")
a = soup.find(
lambda tag: tag.name == "a" and tag.get("class", []) == ["something"]
)
print(a)
Prints:
<a class="something">Information I want</a>
Or: specify "class" as a list:
a = soup.find("a", {"class": ["something"]})
print(a)
Prints:
<a class="something">Information I want</a>
EDIT:
For filtering type-icon type-X:
from bs4 import BeautifulSoup
html_doc = """
<a class="type-icon type-1">Information I want 1</a>
<a class="type-icon type-1 type-cell type-abbr">Information I don't want</a>
<a class="type-icon type-2">Information I want 2</a>
<a class="type-icon type-2 type-cell type-abbr">Information I don't want</a>
"""
soup = BeautifulSoup(html_doc, "html.parser")
my_types = ["type-icon", "type-1", "type-2"]
def my_filter(tag):
if tag.name != "a":
return False
c = tag.get("class", [])
return "type-icon" in c and not set(c).difference(my_types)
a = soup.find_all(my_filter)
print(a)
Prints:
[<a class="type-icon type-1">Information I want 1</a>, <a class="type-icon type-2">Information I want 2</a>]
Or extract tags you don't want first:
soup = BeautifulSoup(html_doc, "html.parser")
# extract tags I don't want:
for t in soup.select(".type-cell.type-abbr"):
t.extract()
print(soup.select(".type-icon.type-1, .type-icon.type-2"))
Prints:
[<a class="type-icon type-1">Information I want 1</a>, <a class="type-icon type-2">Information I want 2</a>]
I am attempting to scrape HTML to create a dictionary that includes a pitchers name and his handed-ness. The data-tags are buried--so far I've only been able to collect the pitchers name from the data set. The HTML output (for each player) is as follows:
<div class="pitcher players">
<input name="import-data" type="hidden" value="%5B%7B%22slate_id%22%3A20190%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210893103%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20192%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210894893%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20193%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210895115%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%5D"/>
<a class="player-popup" data-url="https://rotogrinders.com/players/johnny-cueto-11193?site=draftkings" href="https://rotogrinders.com/players/johnny-cueto-11193">Johnny Cueto</a>
<span class="meta stats">
<span class="stats">
R
</span>
<span class="salary" data-role="salary" data-salary="$11.8K">
$11.8K
</span>
<span class="fpts" data-fpts="14.96" data-product="56" data-role="authorize" title="Projected Points">14.96</span>
I've tinkered and and coming up empty--I'm sure I'm overthinking this. Here is the code I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = [soup.find_all("div", {'class':'pitcher players'}]
What's the best way to loop through the results set for the more granular data tag information I need?
I need the text from the HTML beginning with , and handed-ness from the tag
Optimally, I would have a dictionary with the following:
{Johnny Cueto : R, Player 2 : L, ...}
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = soup.find_all("div", {'class': 'pitcher players'})
dicti={}
for j in results:
dicti[j.a.text]=j.select(".stats")[1].text.strip("\n").strip()
just use select or find function of the founded element,and you will be able to iterate
I have a div whose id is "img-cont"
<div class="img-cont-box" id="img-cont" style='background-image: url("http://example.com/example.jpg");'>
I want to extract the url in background-image using beautiful soup.How can I do it?
You can you find_all or find for the first match.
import re
soup = BeautifulSoup(html_str)
result = soup.find('div',attrs={'id':'img-cont','style':True})
if result is not None:
url = re.findall('\("(http.*)"\)',result['style']) # return a list.
Try this:
import re
from bs4 import BeautifulSoup
html = '''\
<div class="img-cont-box" \
id="img-cont" \
style='background-image: url("http://example.com/example.jpg");'>\
'''
soup = BeautifulSoup(html, 'html.parser')
div = soup.find('div', id='img-cont')
print(re.search(r'url\("(.+)"\)', div['style']).group(1))
How to extract data that is inside <p> paragraph tags and <li> which are under a named <div> class?
Use the functions find() and find_all():
import requests
from bs4 import BeautifulSoup
url = '...'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
div = soup.find('div', {'class':'class-name'})
ps = div.find_all('p')
lis = div.find_all('li')
# print the content of all <p> tags
for p in ps:
print(p.text)
# print the content of all <li> tags
for li in lis:
print(li.text)