python BeautifulSoup finding certain things in a table

python BeautifulSoup finding certain things in a table - python

Folks,
Ive managed to get beautifulsoup to scrape a page with the following
html = response.read()
soup = BeautifulSoup(html)
links = soup.findAll('a')
There are several occurrences of
<A href="javascript:Set_Variables('foo1','bar1''')"onmouseover="javascript: return window.status=''">
<A href="javascript:Set_Variables('foo2','bar2''')"onmouseover="javascript: return window.status=''">
How can I iterate through this and get the foo/bar values?
Thanks

You can use regular expressions to extract variables from href attributes:
import re
from bs4 import BeautifulSoup
data = """
<div>
<table>
<A href="javascript:Set_Variables('foo1','bar1''')" onmouseover="javascript: return window.status=''">
<A href="javascript:Set_Variables('foo2','bar2''')" onmouseover="javascript: return window.status=''">
</table>
</div>
"""
soup = BeautifulSoup(data)
pattern = re.compile(r"javascript:Set_Variables\('(\w+)','(\w+)'")
for a in soup('a'):
match = pattern.search(a['href'])
if match:
print match.groups()
Prints:
('foo1', 'bar1')
('foo2', 'bar2')

Related

Python - BeautifulSoup - How to return two different elements or more, with different attributes?

HTML Exemple
<html>
<div book="blue" return="abc">
<h4 class="link">www.example.com</h4>
<p class="author">RODRIGO</p>
</html>
Ex1:
url = urllib.request.urlopen(url)
page_soup = soup(url.read(), "html.parser")
res=page_soup.find_all(attrs={"class": ["author","link"]})
for each in res:
print(each)
Result1:
www.example.com
RODRIGO
Ex2:
url = urllib.request.urlopen(url)
page_soup = soup(url.read(), "html.parser")
res=page_soup.find_all(attrs={"book": ["blue"]})
for each in res:
print(each["return")
Result 2:
abc
!!!puzzle!!!
The question I have is how to return the 3 results in a single query?
Result 3
www.example.com
RODRIGO
abc

Example HTML seems to be broken - Assuming the div wrappes the other tags and it is may not the only book you can select all books:
for e in soup.find_all(attrs={"book": ["blue"]}):
print(' '.join(e.stripped_strings),e.get('return'))
Example
from bs4 import BeautifulSoup
html = '''
<html>
<div book="blue" return="abc">
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>
'''
soup = BeautifulSoup(html)
for e in soup.find_all(attrs={"book": ["blue"]}):
print(' '.join(e.stripped_strings),e.get('return'))
Output
www.rodrigo.com RODRIGO abc
A more structured example could be:
data = []
for e in soup.select('[book="blue"]'):
data.append({
'link':e.h4.text,
'author':e.select_one('.author').text,
'return':e.get('return')
})
data
Output:
[{'link': 'www.rodrigo.com', 'author': 'RODRIGO', 'return': 'abc'}]

For the case one attribute against many values a regex approach is suggested:
from bs4 import BeautifulSoup
import re
html = """<html>
<div book="blue" return="abc">
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>"""
soup = BeautifulSoup(html, 'lxml')
by_clss = soup.find_all(class_=re.compile(r'link|author'))
print(b_clss)
For more flexibility, a custom query function can be passed to find or find_all:
from bs4 import BeautifulSoup
html = """<html>
<div href="blue" return="abc"></div> <!-- div need a closing tag in a html-doc-->
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>"""
def query(tag):
if tag.has_attr('class'):
# tag['class'] is a list. Here assumed that has only one value
return set(tag['class']) <= {'link', 'author'}
if tag.has_attr('book'):
return tag['book'] in {'blue'}
return False
print(soup.find_all(query))
# [<div book="blue" return="abc"></div>, <h4 class="link">www.rodrigo.com</h4>, <p class="author">RODRIGO</p>]
Notice that your html-sample has no closing div-tag. In my second case I added it otherwise the soup... will not taste good.
EDIT
To retrieve elements which satisfies a simultaneous conditions on attributes the query could look like this:
def query_by_attrs(**tag_kwargs):
# tag_kwargs: {attr: [val1, val2], ...}
def wrapper(tag):
for attr, values in tag_kwargs.items():
if tag.has_attr(attr):
# check if tag has multi-valued attributes (class,...)
if not isinstance((tag_attr:=tag[attr]), list): # := for python >=3.8
tag_attr = (tag_attr,) # as tuple
return bool(set(tag_attr).intersection(values)) # false if empty set
return wrapper
q_data = {'class': ['link', 'author'], 'book': ['blue']}
results = soup.find_all(query_by_attrs(**q_data))
print(results)

Extract All link from WebSite
import requests
from bs4 import BeautifulSoup
url = 'https://mixkit.co/free-stock-music/hip-hop/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))

Trying to find all <a> elements without a specific class

I'm trying web scraping for the first time and I'm using BeautifulSoup to gather bits of information from a website. I'm trying to get all the elements which has one class but not another. For example:
from bs4 import BeautifulSoup
html = """
<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>
"""
soup = BeautifulSoup(html)
In this example, I want to get all the elements with the something class. However, when finding all elements containing that class I also get the element containing the somethingelse class, and I do not want these.
The code I'm using to get it is:
results = soup.find_all("a", {"class": "something"})
Any help is appreciated! Thanks.

This will work fine:
from bs4 import BeautifulSoup
text = '''<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>'''
soup = BeautifulSoup(text, 'html.parser')
r1 = soup.find_all("a", {"class": "something"})
r2 = soup.find_all("a", {"class": "somethingelse"})
for item in r2:
if item in r1:
r1.remove(item)
print(r1)
Output
[<a class="something">Information I want</a>]
For extracting the text present in the tags, just add this lines:
for item in r1:
print(item.text)
Output
Information I want

For this task, you can find elements by lambda function, for example:
from bs4 import BeautifulSoup
html_doc = """<a class="something">Information I want</a>
<a class="something somethingelse">Information I don't want</a>
"""
soup = BeautifulSoup(html_doc, "html.parser")
a = soup.find(
lambda tag: tag.name == "a" and tag.get("class", []) == ["something"]
)
print(a)
Prints:
<a class="something">Information I want</a>
Or: specify "class" as a list:
a = soup.find("a", {"class": ["something"]})
print(a)
Prints:
<a class="something">Information I want</a>
EDIT:
For filtering type-icon type-X:
from bs4 import BeautifulSoup
html_doc = """
<a class="type-icon type-1">Information I want 1</a>
<a class="type-icon type-1 type-cell type-abbr">Information I don't want</a>
<a class="type-icon type-2">Information I want 2</a>
<a class="type-icon type-2 type-cell type-abbr">Information I don't want</a>
"""
soup = BeautifulSoup(html_doc, "html.parser")
my_types = ["type-icon", "type-1", "type-2"]
def my_filter(tag):
if tag.name != "a":
return False
c = tag.get("class", [])
return "type-icon" in c and not set(c).difference(my_types)
a = soup.find_all(my_filter)
print(a)
Prints:
[<a class="type-icon type-1">Information I want 1</a>, <a class="type-icon type-2">Information I want 2</a>]
Or extract tags you don't want first:
soup = BeautifulSoup(html_doc, "html.parser")
# extract tags I don't want:
for t in soup.select(".type-cell.type-abbr"):
t.extract()
print(soup.select(".type-icon.type-1, .type-icon.type-2"))
Prints:
[<a class="type-icon type-1">Information I want 1</a>, <a class="type-icon type-2">Information I want 2</a>]

Scraping multiple data tags from HTML using beautiful Soup

I am attempting to scrape HTML to create a dictionary that includes a pitchers name and his handed-ness. The data-tags are buried--so far I've only been able to collect the pitchers name from the data set. The HTML output (for each player) is as follows:
<div class="pitcher players">
<input name="import-data" type="hidden" value="%5B%7B%22slate_id%22%3A20190%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210893103%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20192%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210894893%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20193%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210895115%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%5D"/>
<a class="player-popup" data-url="https://rotogrinders.com/players/johnny-cueto-11193?site=draftkings" href="https://rotogrinders.com/players/johnny-cueto-11193">Johnny Cueto</a>
<span class="meta stats">
<span class="stats">
R
</span>
<span class="salary" data-role="salary" data-salary="$11.8K">
$11.8K
</span>
<span class="fpts" data-fpts="14.96" data-product="56" data-role="authorize" title="Projected Points">14.96</span>
I've tinkered and and coming up empty--I'm sure I'm overthinking this. Here is the code I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = [soup.find_all("div", {'class':'pitcher players'}]
What's the best way to loop through the results set for the more granular data tag information I need?
I need the text from the HTML beginning with , and handed-ness from the tag
Optimally, I would have a dictionary with the following:
{Johnny Cueto : R, Player 2 : L, ...}

import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = soup.find_all("div", {'class': 'pitcher players'})
dicti={}
for j in results:
dicti[j.a.text]=j.select(".stats")[1].text.strip("\n").strip()
just use select or find function of the founded element,and you will be able to iterate

how to extract an attribute value of div using BeautifulSoup

I have a div whose id is "img-cont"
<div class="img-cont-box" id="img-cont" style='background-image: url("http://example.com/example.jpg");'>
I want to extract the url in background-image using beautiful soup.How can I do it?

You can you find_all or find for the first match.
import re
soup = BeautifulSoup(html_str)
result = soup.find('div',attrs={'id':'img-cont','style':True})
if result is not None:
url = re.findall('\("(http.*)"\)',result['style']) # return a list.

Try this:
import re
from bs4 import BeautifulSoup
html = '''\
<div class="img-cont-box" \
id="img-cont" \
style='background-image: url("http://example.com/example.jpg");'>\
'''
soup = BeautifulSoup(html, 'html.parser')
div = soup.find('div', id='img-cont')
print(re.search(r'url\("(.+)"\)', div['style']).group(1))

Web crawling using python beautifulsoup

How to extract data that is inside <p> paragraph tags and <li> which are under a named <div> class?

Use the functions find() and find_all():
import requests
from bs4 import BeautifulSoup
url = '...'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
div = soup.find('div', {'class':'class-name'})
ps = div.find_all('p')
lis = div.find_all('li')
# print the content of all <p> tags
for p in ps:
print(p.text)
# print the content of all <li> tags
for li in lis:
print(li.text)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python BeautifulSoup finding certain things in a table - python

Related

Python - BeautifulSoup - How to return two different elements or more, with different attributes?

Trying to find all <a> elements without a specific class

Scraping multiple data tags from HTML using beautiful Soup

how to extract an attribute value of div using BeautifulSoup

Web crawling using python beautifulsoup

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python BeautifulSoup finding certain things in a table - python

Related

Python - BeautifulSoup - How to return two different elements or more, with different attributes?

Trying to find all <a> elements without a *specific* class

Scraping multiple data tags from HTML using beautiful Soup

how to extract an attribute value of div using BeautifulSoup

Web crawling using python beautifulsoup

Categories

Resources

Trying to find all <a> elements without a specific class