I started off by pulling the page with Selenium and I believe I passed the part of the page I needed to BeautifulSoup correctly using this code:
soup = BeautifulSoup(driver.find_element("xpath", '//*[#id="version_table"]/tbody').get_attribute('outerHTML'))
Now I can parse using BeautifulSoup
query = soup.find_all("tr", class_=lambda x: x != "hidden*")
print (query)
My problem is that I need to dig deeper than just this one query. For example, I would like to nest this one inside of the first (so the first needs to be true, and then this one):
query2 = soup.find_all("tr", id = "version_new_*")
print (query2)
Logically speaking, this is what I'm trying to do (but I get SyntaxError: invalid syntax):
query = soup.find_all(("tr", class_=lambda x: x != "hidden*") and ("tr", id = "version_new_*"))
print (query)
How do I accomplish this?
As mentioned without any example it is hard to help or give a precise answer - However you could use a css selector:
soup.select('tr[id^="version_new_"]:not(.hidden)')
Example
from bs4 import BeautifulSoup
html = '''
<tr id="version_new_1" class="hidden"></tr>
<tr id="version_new_2"></tr>
<tr id="version_new_3" class="hidden"></tr>
<tr id="version_new_4"></tr>
'''
soup = BeautifulSoup(html)
soup.select('tr[id^="version_new_"]:not(.hidden)')
Output
Will be a ResultSet you could iterate to scrape what you need.
[<tr id="version_new_2"></tr>, <tr id="version_new_4"></tr>]
Regarding: query = soup.find_all(...) and print (query)
find_all is going to return an iterable type. Iterable types can be iterated.
for query in soup.find_all(...):
print(query)
You can use a lambda function (along with regex) for every element to do some advanced conditioning
import re
query = soup.find_all(
lambda tag:
tag.name == 'tr' and
'id' in tag.attrs and re.search('^version_new_*', str(tag.attrs['id'])) and
'class' in tag.attrs and not re.search('^hidden*', str(tag.attrs['class']))
)
print(list(query))
For every element in the html, we are checking...
If the tag is a table row (tr)
If the tag has an id and if that id matches the pattern
If the tag has a class and if that class matches the pattern
Related
I am using BeautifulSoup to scrape a website. The retrieved resultset looks like this:
<td><span class="I_Want_This_Class_Name"></span><span class="other_name">Text Is Here</span></td>
From here, I want to retrieve the class name "I_Want_This_Class_Name". I can get the "Text Is Here" part no problem, but the class name itself is proving to be difficult.
Is there a way to do this using BeautifulSoup resultset or do I need to convert to a dictionary?
Thank you
from bs4 import BeautifulSoup
doc = '''<td><span class="I_Want_This_Class_Name"></span><span class="other_name">Text Is Here</span></td>
'''
soup = BeautifulSoup(doc, 'html.parser')
res = soup.find('td')
out = {}
for each in res:
if each.has_attr('class'):
out[each['class'][0]] = each.text
print(out)
output will be like:
{'I_Want_This_Class_Name': '', 'other_name': 'Text Is Here'}
If you are trying to get the class name for this one result, then I would use the select method on your soup object, calling the class key:
foo_class = soup.select('td>span.I_Want_This_Class_Name')[0]['class'][0]
Note here that the select method does return a list, hence the indexing before the key.
I have a problem with using xpath to get inconsistent price list
Example
<td><span="green">$33.99</span></td>
<td>Out of stock</td>
<td><span="green">$27.99</span></td>
<td><span="green">$35.00</span></td>
How to get the price inside span and Out of stock at the same time?
Because I get only $33.99 or anything that have span and text that is not inside span got skipped. And it ruined the ordering.
The failed attempt that I used w/ updated from #piratefache's solution (Scrapy)
product_prices_tds = response.xpath('//td/')
product_prices = []
for td in product_prices_tds:
if td.xpath('//span'):
product_prices = td.xpath('//span/text()').extract()
else:
product_prices = td.xpath('//text()').extract()
for n in range(len(product_names)):
items['price'] = product_prices[n]
yield items
It's not working because product_prices doesn't get the right text it get from all over the place. Not just inside span or outside as I intended to.
Update
For the one who came later. I fixed my code Thanks to #piratefache's. Here's corrected snippet for who want to use later.
product_prices_tds = response.xpath('//td')
product_prices = []
for td in product_prices_tds:
if td.xpath('span'):
product_prices.append(td.xpath('span//text()').extract())
else:
product_prices.append(td.xpath('/text()').extract())
for n in range(len(product_names)):
items['price'] = product_prices[n]
yield items
See edit below with Scrapy
Based on your html code, using BeautifulSoup library, you can get the information this way :
from bs4 import BeautifulSoup
page = """<td><span="green">$33.99</span></td>
<td>Out of stock</td>
<td><span="green">$27.99</span></td>
<td><span="green">$35.00</span></td>"""
soup = BeautifulSoup(page, features="lxml")
tds = soup.body.findAll('td') # get all spans
for td in tds:
# if attribute span exist
if td.find('span'):
print(td.find('span').text)
# if not, just print inner text (here it's out of stock)
else:
print(td.text)
output :
$33.99
Out of stock
$27.99
$35.00
With Scrapy:
import scrapy
page = """<td><span="green">$33.99</span></td>
<td>Out of stock</td>
<td><span="green">$27.99</span></td>
<td><span="green">$35.00</span></td>"""
response = scrapy.Selector(text=page, type="html")
tds = response.xpath('//td')
for td in tds:
# if attribute span exist
if td.xpath('span'):
print(td.xpath('span//text()')[0].extract())
# if not, just print inner text (here it's out of stock)
else:
print(td.xpath('text()')[0].extract())
output :
$33.99
Out of stock
$27.99
$35.00
XPath solution (from 2.0 upwards) (same logic as #piratefache posted before):
for $td in //td
return
if ($td[span])
then
$td/span/data()
else
$td/data()
Applied on
<root>
<td>
<span>$33.99</span>
</td>
<td>Out of stock</td>
<td>
<span>$27.99</span>
</td>
<td>
<span>$35.00</span>
</td>
</root>
returns
$33.99
Out of stock
$27.99
$35.00
BTW: <span="green"> is not valid XML. Probably an attribute #color or similar is missing (?)
Here is the website that i'm parsing: http://uniapple.net/usaddress/address.php?address1=501+10th+ave&address2=&city=nyc&state=ny&zipcode=10036&country=US
I would like to be able to find the word that will be in line 39 between the td tags. That line tells me if the address is residential or commercial, which is what I need for my script.
Here's what I have, but i'm getting this error:
AttributeError: 'NoneType' object has no attribute 'find_next'
The code I'm using is:
from bs4 import BeautifulSoup
import urllib
page = "http://uniapple.net/usaddress/address.php?address1=501+10th+ave&address2=&city=nyc&state=ny&zipcode=10036&country=US"
z = urllib.urlopen(page).read()
thesoup = BeautifulSoup(z, "html.parser")
comres = (thesoup.find("th",text=" Residential or ").find_next("td").text)
print(str(comres))
text argument would not work in this particular case. This is related to how the .string property of an element is calculated. Instead, I would use a search function where you can actually call get_text() and check the complete "text" of an element including the children nodes:
label = thesoup.find(lambda tag: tag and tag.name == "th" and \
"Residential" in tag.get_text())
comres = label.find_next("td").get_text()
print(str(comres))
Prints Commercial.
We can go a little bit further and make a reusable function to get a value by label:
soup = BeautifulSoup(z, "html.parser")
def get_value_by_label(soup, label):
label = soup.find(lambda tag: tag and tag.name == "th" and label in tag.get_text())
return label.find_next("td").get_text(strip=True)
print(get_value_by_label(soup, "Residential"))
print(get_value_by_label(soup, "City"))
Prints:
Commercial
NYC
All you are missing is a bit of housekeeping:
ths = thesoup.find_all("th")
for th in ths:
if 'Residential or' in th.text:
comres = th.find_next("td").text
print(str(comres))
>> Commercial
You'll need to use a regular expression as your text field, like re.compile('Residential or'), rather than a string.
This was working for me. I had to iterate over the results provided, though if you only expect a single result per page you could swap find for find_all:
for r in thesoup.find_all(text=re.compile('Residential or')):
r.find_next('td').text
Python newbie here. Python 2.7 with beautifulsoup 3.2.1.
I'm trying to scrape a table from a simple page. I can easily get it to print, but I can't get it to return to my view function.
The following works:
#app.route('/process')
def process():
queryURL = 'http://example.com'
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
print table
return 'All good'
I can also return html successfully. But when I try to return table instead of return 'All good' I get the following error:
TypeError: ResultSet object is not an iterator
I also tried:
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
out = []
for row in table.findAll('tr'):
colvals = [col.text for col in row.findAll('td')]
out.append('\t'.join(colvals))
return table
With no success. Any suggestions?
You're trying to return an object, you're not actually getting the text of the object so return table.text should be what you are looking for. Full modified code:
def process():
queryURL = 'http://example.com'
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
return table.text
EDIT:
Since I understand now that you want the HTML code that forms the site instead of the values, you can do something like this example I made:
import urllib
url = urllib.urlopen('http://www.xpn.org/events/concert-calendar')
htmldata = url.readlines()
url.close()
for tag in htmldata:
if '<th' in tag:
print tag
if '<tr' in tag:
print tag
if '<thead' in tag:
print tag
if '<tbody' in tag:
print tag
if '<td' in tag:
print tag
You can't do this with BeautifulSoup (at least not to my knowledge) is because BeautifulSoup is more for parsing or printing the HTML in a nice looking manner. You can just do what I did and have a for loop go through the HTML code and if a tag is in the line, then print it.
If you want to store the output in a list to use later, you would do something like:
htmlCodeList = []
for tag in htmldata:
if '<th' in tag:
htmlCodeList.append(tag)
if '<tr' in tag:
htmlCodeList.append(tag)
if '<thead' in tag:
htmlCodeList.append(tag)
if '<tbody' in tag:
htmlCodeList.append(tag)
if '<td' in tag:
htmlCodeList.append(tag)
This save the HTML line in a new element of the list. so <td> would be index 0 the next set of tags would be index 1, etc.
After #Heinst pointed out that I was trying to return an Object and not a string, I also found a more elegant solution to convert the BeautifulSoup Object into a string and return it:
return str(table)
I am trying to get BeautifulSoup to do the following.
I have HTML files which I wish to modify. I am interested in two tags in particular, one which I will call TagA is
<div class ="A">...</div>
and one which I will call TagB
<p class = "B">...</p>
Both tags occur independently throughout the HTML and may themselves contain other tags and be nested inside other tags.
I want to place a marker tag around every TagA whenever it is not immediately followed by TagB so that
<p class="A"">...</p> becomes <marker><p class="A">...</p></marker>
But when TagA is followed immediately by TagB, I want the marker Tag to surround them both
so that
<p class="A">...</p><div class="B">...</div>
becomes
<marker><p class="A">...</p><div class="B">...</div></marker>
I can see how to select TagA and enclose it with the marker tag, but when it is followed by TagB I do not know if or how the BeautiulSoup 'selection' can be extended to include the NextSibling.
Any help appreciated.
beautifulSoup does have a "next sibling" function. find all tags of class A and use a.next_sibling to check if it is b.
look at the docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways
I think I was going about this the wrong way by trying to extend the 'selection' from one tag to the following. Instead I found the following code which insets the outer 'Marker' tag and then inserts the A and B tags does the trick.
I am pretty new to Python so would appreciate advice regarding improvements or snags with the following.
def isTagB(tag):
#If tag is <p class = "B"> return true
#if not - or tag is just a string return false
try:
return tag.name == 'p'#has_key('p') and tag.has_key('B')
except:
return False
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<div class = "A"><p><i>more content</i></p></div><div class = "A"><p><i>hello content</i></p></div><p class="B">da <i>de</i> da </p><div class = "fred">not content</div>""")
for TagA in soup.find_all("div", "A"):
Marker = soup.new_tag('Marker')
nexttag = TagA.next_sibling
#skipover white space
while str(nexttag).isspace():
nexttag = nexttag.next_sibling
if isTagB(nexttag):
TagA.replaceWith(Marker) #Put it where the A element is
Marker.insert(1,TagA)
Marker.insert(2,nexttag)
else:
#print("FALSE",nexttag)
TagA.replaceWith(Marker) #Put it where the A element is
Marker.insert(1,TagA)
print (soup)
import urllib
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen("http://ursite.com") #gives html response
soup = BeautifulSoup(html)
all_div = soup.findAll("div",attrs={}) #use attrs as dict for attribute parsing
#exa- attrs={'class':"class","id":"1234"}
single_div = all_div[0]
#to find p tag inside single_div
p_tag_obj = single_div.find("p")
you can use obj.findNext(), obj.findAllNext(), obj.findALLPrevious(), obj.findPrevious(),
to get attribute you can use obj.get("href"), obj.get("title") etc.