Python beautifulsoup select CSS issue - python

Learning scrapy and I'm trying to use it to get some specific topics in a forum.
In the forum the infomation I need is stored like:
<tbody id="threadnumber">
<tr>
<th class="new">
<em>[topic]</em>
postname
</th>
<td class="by">
<a something to show the poster and time>**</a>
</td>
<td class="num">
<a something to show the numbers of read and replys>**</a>
</td>
<td class="by">
<a something to show the last replyer and time>**</a>
</td>
</tr>
</tbody>
<tbody id="threadnumber">#next thread
<tr>....
</tr>
</tbody>
Is there any method to get the postname in the second a tag for a specific topic whose unique topicid is stored in the first a tag. Should I use sibling?
For example I get
[NEWS]
news1
[NEWS]
news2
[NEWS]
news3
[PIC]
picture1
for input.
And I want to get an output only include "NEWS" topic like['news1','news2','news3']
Thanks for your help!

You can use BeautifulSoup to find all tags with class="post". Then for each tag, you search a <a> tag in a descendant from its parent, and test whether its text is the topic you are interested in. If true, you add the postname to a result list. Code could be:
def findposts(soup, topic):
'''Finds all postname associated to topic in a BeautifulSoup element'''
posts = [] # initialize an empty result list
# search postnames by class
for postname in soup.findAll('a', attrs = {'class': 'post'}):
# find associated topic in immediate parent
if postname.findParent().find('a').text == topic:
posts.append(postname.text) # Ok add to result list
return posts
With your example data, you could do:
soup = BeautifulSoup('data', 'html.parser')
print(findpost(soup, 'topic')
and the result would be as expected:
['postname']

Related

How to extract HTML table following a specific heading?

I am using BeautifulSoup to parse HTML files. I have a HTML file similar to this:
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key B</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>THE GOOD STUFF</h3>
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
I want to extract the string "I WANT THIS STRING". The perfect solution would be to get the first table following the h3 heading called "THE GOOD STUFF". I have no idea how to do this with BeautifulSoup - I only know how to extract a table with a specific class, or a table nested within some particular tag, but not following a particular tag.
I think a fallback solution could make use of the string "Key C", assuming it's unique (it almost certainly is) and appears in only that one table, but I'd feel better with going for the specific h3 heading.
Following the logic of #Zroq's answer on another question, this code will give you the table following your defined header ("THE GOOD STUFF"). Please note I just put all your html in the variable called "html".
from bs4 import BeautifulSoup, NavigableString, Tag
soup=BeautifulSoup(html, "lxml")
for header in soup.find_all('h3', text=re.compile('THE GOOD STUFF')):
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, Tag):
if nextNode.name == "h3":
break
print(nextNode)
Output:
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
Cheers!
The docs explain that if you don't want to use find_all, you can do this:
for sibling in soup.a.next_siblings:
print(repr(sibling))
I am sure there are many ways to this more efficiently, but here is what I can think about right now:
from bs4 import BeautifulSoup
import os
os.chdir('/Users/Downloads/')
html_data = open("/Users/Downloads/train.html",'r').read()
soup = BeautifulSoup(html_data, 'html.parser')
all_td = soup.find_all("td")
flag = 'no_print'
for td in all_td:
if flag == 'print':
print(td.text)
break
if td.text == 'Key C':
flag = 'print'
Output:
I WANT THIS STRING

bs4 parent attrs python

I'm just starting coding in Python and my friend asked me for application finding specific data on the web, representing it nicely.
I already found pretty web, where the data is contained, I can find basic info, but then the challenge is to get deeper.
While using BS4 in Python 3.4 I have reached exemplary code:
<tr class=" " somethingc1="" somethingc2="" somethingc3="" data-something="1" something="1something6" something_id="6something0">
<td class="text-center td_something">
<div>
Super String of Something
</div>
</td>
<td class="text-center">08/26 15:00</td>
<td class="text-center something_status">
<span class="something_status_something">Full</span>
</td>
</tr>
<tr class=" " somethingc1="" somethingc2="" somethingc3="" data-something="0" something="1something4" something_id="6something7">
<td class="text-center td_something">
<div>
Super String of Something
</div>
</td>
<td class="text-center">05/26 15:00</td>
<td class="text-center something_status">
<span class="something_status_something"></span>
</td>
</tr>
What I want to do now is finding the date string of but only if data-something="1" of parent and not if data-something="0"
I can scrap all dates by :
soup.find_all(lambda tag: tag.name == 'td' and tag.get('class') == ['text-center'] and not tag.has_attr('style'))
but it does not check parent. That is why I tried:
def KieMeWar(tag):
return tag.name == 'td' and tag.parent.name == 'tr' and tag.parent.attrs == {"data-something": "1"} #and tag.get('class') == ['text-center'] and not tag.has_attr('style')
soup.find_all(KieMeWar)
The result is an empty set. What is wrong or how to reach the target I am aiming for with easiest solution?
P.S. This is exemplary part of full code, that is why I use not Style, even though it does not appear here but does so later.
BeautifulSoup's findAll has the attrs kwarg, which is used to find tags with a given attribute
import bs4
soup = bs4.BeautifulSoup(html)
trs = soup.findAll('tr', attrs={'data-something':'1'})
That finds all tr tags with the attribute data-something="1". Afterwards, you can loop through the trs and grab the 2nd td tag to extract the date
for t in trs:
print(str(t.findAll('td')[1].text))
>>> 08/26 15:00

Can't scrape HTML table using BeautifulSoup

I'm trying to scrape data off a table on a web page using Python, BeautifulSoup, Requests, as well as Selenium to log into the site.
Here's the table I'm looking to get data for...
<div class="sastrupp-class">
<table>
<tbody>
<tr>
<td class="key">Thing I dont want 1</td>
<td class="value money">$1.23</td>
<td class="key">Thing I dont want 2</td>
<td class="value">99,999,999</td>
<td class="key">Target</td>
<td class="money value">$1.23</td>
<td class="key">Thing I dont want 3</td>
<td class="money value">$1.23</td>
<td class="key">Thing I dont want 4</td>
<td class="value percentage">1.23%</td>
<td class="key">Thing I dont want 5</td>
<td class="money value">$1.23</td>
</tr>
</tbody>
</table>
</div>
I can find the "sastrupp-class" fine, but I don't know how to look through it and get to the part of the table I want.
I figured I could just look for the class that I'm searching for like this...
output = soup.find('td', {'class':'key'})
print(output)
but that doesn't return anything.
Important to note:
< td>s inside the table have the same class name as the one that I want. If I can't separate them out, I'm ok with that although I'd rather just return the one I want.
2.There are other < div>s with class="sastrupp-class" on the site.
I'm obviously a beginner at this so let me know if I can help you help me.
Any help/pointers would be appreciated.
1) First of, to get your 'Target' you need find_all, not find. Then, considering you know exactly in which position your target will be (in the example you gave it is index=2) the solution could be reached like this:
from bs4 import BeautifulSoup
html = """(YOUR HTML)"""
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('div', {'class': 'sastrupp-class'})
all_keys = table.find_all('td', {'class': 'key'})
my_key = all_keys[2]
print my_key.text # prints 'Target'
2)
There are other < div>s with class="sastrupp-class" on the site
Again, you need to select the one you need using find_all and then selecting the correct index.
Example HTML:
<body>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Target</div>
</body>
To extract the target, you can just:
all_divs = soup.find_all('div', {'class':'sastrupp-class'})
target = all_divs[3] # assuming you know exactly which index to look for

python BeautifulSoup4 break for loop when tag found

I have a problem breaking a for loop when going trough a html with bs4.
I want to save a list separated with headings.
The HTML code can look something like below, however it contains more information between the desired tags:
<h2>List One</h2>
<td class="title">
<a title="Title One">This is Title One</a>
</td>
<td class="title">
<a title="Title Two">This is Title Two</a>
</td>
<h2>List Two</h2>
<td class="title">
<a title="Title Three">This is Title Three</a>
</td>
<td class="title">
<a title="Title Four">This is Title Four</a>
</td>
I would like to have the results printed like this:
List One
This is Title One
This is Title Two
List Two
This is Title Three
This is Title Four
I have come this far with my script:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('some webiste')
soup = BeautifulSoup(html, "lxml")
quote1 = soup.h2
print quote1.text
quote2 = quote1.find_next_sibling('h2')
print quote2.text
for quotes in soup.findAll('h2'):
if quotes.find(text=True) == quote2.text:
break
if quotes.find(text=True) == quote1.text:
for anchor in soup.findAll('td', {'class':'title'}):
print anchor.text
print quotes.text
I have tried to break the loop when "quote2" (List Two) is found. But the script gets all the td-content and ignoring the next h2-tags.
So how do I break the for loop with next h2-tag?
In my opinion the problem lies in your HTML syntax. According to https://validator.w3.org it's not legal to mix "td" and "h3" (or generally any header tag). Also, implementing list with tables is most likely not a good practice.
If you can manipulate your input files, the list you seem to need could be implemented with "ul" and "li" tags (first 'li' in 'ul' containing the header) or, if you need to use tables, just put your header inside of "td" tag, or even more cleanly with "th"s:
<table>
<tr>
<th>Your title</th>
</tr>
<tr>
<td>Your data</td>
</tr>
</table>
If the input is not under your control, your script could perform search and replace on the input text anyway, putting the headers into table cells or list items.

Add parent tags with beautiful soup

I have many pages of HTML with various sections containing these code snippets:
<div class="footnote" id="footnote-1">
<h3>Reference:</h3>
<table cellpadding="0" cellspacing="0" class="floater" style="margin-bottom:0;" width="100%">
<tr>
<td valign="top" width="20px">
1.
</td>
<td>
<p> blah </p>
</td>
</tr>
</table>
</div>
I can parse the HTML successfully and extract these relevant tags
tags = soup.find_all(attrs={"footnote"})
Now I need to add new parent tags about these such that the code snippet goes:
<div class="footnote-out"><CODE></div>
But I can't find a way of adding parent tags in bs4 such that they brace the identified tags. insert()/insert_before add in after the identified tags.
I started by trying string manupulation:
for tags in soup.find_all(attrs={"footnote"}):
tags = BeautifulSoup("""<div class="footnote-out">"""+str(tags)+("</div>"))
but I believe this isn't the best course.
Thanks for any help. Just started using bs/bs4 but can't seem to crack this.
How about this:
def wrap(to_wrap, wrap_in):
contents = to_wrap.replace_with(wrap_in)
wrap_in.append(contents)
Simple example:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<body><a>Some text</a></body>")
wrap(soup.a, soup.new_tag("b"))
print soup.body
# <body><b><a>Some text</a></b></body>
Example with your document:
for footnote in soup.find_all("div", "footnote"):
new_tag = soup.new_tag("div")
new_tag['class'] = 'footnote-out'
wrap(footnote, new_tag)

Categories