beautifulsoup4 - how do I parse a specific class name? - python

on site in tags like:
tr class="productBank tr-turn tr-link row body odd" data-target="tr67" data-key="finservice"
or
tr class="productBank tr-turn tr-link row body even" data-target="tr420" data-key="runa-bank"
is stored info I want to parse, but also in that site there another tags like this:
tr class="productBank tr-turn tr-link curr_old row body odd" data-target="tr324" data-key="sov-bank"
or
tr class="productBank tr-turn tr-link curr_old row body even" data-target="tr64" data-key="morskoybank"
and if i try this:
items = soup.find_all('tr', class_='productBank')
it will return all king of tags content, but if i write name of the class all along there be empty list.
How can I access to that specific type of class?

It is all about finding the differences. In this case, you can find all tr elements with the 'productBank' class, then filter out the elements with the 'curr_old' class. The following code achieves this:
[e for e in soup.find_all('tr', {'class': ['productBank']}) if 'curr_old' not in e['class']]

Related

Webscraping - Beautifulsoup4 - Accessing indexed item in a find_all loop

How do I make it so that I can choose an item in the list in that for loop?
When I print it without brackets, I get the full list and every index seems to be the proper item that I need
for h3 in soup.find_all('h3', itemprop="name"):
bookname = h3.a.text
bookname = bookname.split('\n')
print(bookname)
However, when I print it by specifying an index, whether it is inside the loop or outside, it returns "list index out of range"
for h3 in soup.find_all('h3', itemprop="name"):
bookname = h3.a.text
bookname = bookname.split('\n')
print(bookname[2])
What's my problem here? How do I change my code so that I can scrape all the h3 names, yet at the same time be able to choose specific indexed h3 names when I want to?
Here's the entire code:
import requests
from bs4 import BeautifulSoup
source = requests.get("https://ca1lib.org/s/ginger") #gets the source of the site and returns it
soup = BeautifulSoup(source.text, 'html5lib')
for h3 in soup.find_all('h3', itemprop="name"):
bookname = h3.a.text
bookname = bookname.split('\n')
print(bookname[2])
At a first glance, assuming that your h3 element contains more book names ("book1" \n "book2" \n "book3"), your problem could be that certain h3 elements have less than 3 elements, so the bookname[2] part can't access an element from a shorter list.
On the other hand, if your h3 element has only 1 item (h3 book1 h3), you are iterating all the h3 tags, so you are basically taking each one of them (so in your first iteration you'll have "h3 book1 h3", in your second iteration "h3 book2 h3"), in which case you should make a list with all the h3.a.text elements, then access the desired value.
Hope this helps!
I forgot to append. I figured it out.
Here's my final code:
import requests
from bs4 import BeautifulSoup
source = requests.get("https://ca1lib.org/s/ginger") #gets the source of the site and returns it
soup = BeautifulSoup(source.text, 'html.parser')
liste = []
for h3_tag in soup.find_all('h3', itemprop="name"):
liste.append(h3_tag.a.text.split("\n"))
#bookname = h3.a.text #string
#bookname = bookname.split('\n') #becomes list
print(liste[5])

Extracting text from find_next_sibling(), BeautifulSoup

I am trying to extract the description of the Chinese character from this website: http://www.hsk.academy/en/hsk_1
Example html:
<tr>
<td>
<span class="hanzi">爱</span>
<br/>ài</td>
<td>to love; affection; to be fond of; to like</td>
</tr>
I would like the last td tag's text be put into a list for each description of the character. However, currently I am given the whole tag including the tags themselves. I can't .text the find_next_sibling(): AttributeError: 'NoneType' object has no attribute 'text'.
This is my code:
for item in soup.find_all("td"):
EnglishItem = item.find_next_sibling()
if EnglishItem:
if not any(EnglishItem in s for s in EnglishDescriptionList):
EnglishDescriptionList.insert(count, EnglishItem)
count += 1
print EnglishDescriptionList
Try this:
english_descriptions = []
table = soup.find('table', id='flat_list')
for e in table.select('.hanzi'):
english_desc = e.parent.find_next_sibling().text
if not any(english_desc in s for s in english_descriptions):
english_descriptions.append(english_desc)
This selects (finds) all tags of class hanzi (within the table with id="flat_list") which will be the <span> tags. Then the parent of each <span> is accessed - this is the first <td> in each row. Finally the next sibling is accessed and this is the target tag that contains the English description.
You can do away with the count and just append items to the list with
english_descriptions.append()
Also, I don't think that you need to check whether the current english description is a substring of an existing one (is that what you're trying to do?). If not you can simplify to this list comprehension:
table = soup.find('table', id='flat_list')
english_descriptions = [e.parent.find_next_sibling().text for e in table.select('.hanzi')]

Get all the tags except the first tags using beautifulSoup

I have html structure like this:
<table><tr><p>Hello1</p></tr><tr><p>Shirt</p></tr></table>
<table><tr><p>Hello2</p></tr><tr><p>Jeans</p></tr><tr><p>Jacket</p></tr></table>
<table><tr><p>Hello3</p></tr><tr><p>Trouser</p></tr></table>
I to get all the tr tags in all tables except the first tr tag in every table.
Output should be like:
Shirt
Jeans
Jacket
Trouser
My current code is:
soup = BeautifulSoup(data)
n = soup.findAll('table')
for tr in n:
t = tr.findAll('tr')[1].findAll('span')
for p in t:
print(p.text)
One problem with your code above is that you are only getting the second tr with the [1] index. Instead, what you want to use is a splice [1:], which gets everything after the first. Also, to get the text, use find(text=True) instead of getting the span. See below for the solution:
import BeautifulSoup
n = BeautifulSoup.BeautifulSoup(data)
for table in n.findAll('table'):
for tr in table.findAll('tr')[1:]:
print tr.find(text=True)
Note: the above prints objects on a newline, whereas your ouput suggested they should be on a separate line. That should be a trivial change.

beautifulsoup python how to loop through cells in a table and find <a>links</a>

soup = BeautifulSoup(''.join(html))
table = soup.find("table")
firstRow = table.contents[0]
for tr in firstRow:
if 'Total' in tr:
text = ''.join(tr.find(text=True))
print(text)
Sometimes the table element contains a link with the text instead of plain text. In that case the above for loop loops through all cells and doesn't find the text 'Total', because it's in
<a title="err">Total</a>
instead.
How can I modify my loop to find the text in the link if there is a link?
Calling your iteration variable tr is misleading. You're iterating over a table row; the individual items are td or th elements, or just cells. Not a table row.
Looking at the Beautiful Soup documentation, it looks like you want the string property:
If a tag has only one child, and that child is a NavigableString, the child is made available as .string ... If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child.
So:
for cell in firstRow:
if "Total" in cell.string:
# ...
If that doesn't work for you (i.e., there's stuff in the cell you want besides the text in the string) then what you want to do is get all the text in the table cell before testing it for "Total":
for cell in firstRow:
text = "".join(cell.find_all(text=True))
if "Total" in text:
print(text)

How to get a nested element in beautiful soup

I am struggling with the syntax required to grab some hrefs in a td.
The table, tr and td elements dont have any class's or id's.
If I wanted to grab the anchor in this example, what would I need?
< tr >
< td > < a >...
Thanks
As per the docs, you first make a parse tree:
import BeautifulSoup
html = "<html><body><tr><td><a href='foo'/></td></tr></body></html>"
soup = BeautifulSoup.BeautifulSoup(html)
and then you search in it, for example for <a> tags whose immediate parent is a <td>:
for ana in soup.findAll('a'):
if ana.parent.name == 'td':
print ana["href"]
Something like this?
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [td.find('a') for td in soup.findAll('td')]
That should find the first "a" inside each "td" in the html you provide. You can tweak td.find to be more specific or else use findAll if you have several links inside each td.
UPDATE: re Daniele's comment, if you want to make sure you don't have any None's in the list, then you could modify the list comprehension thus:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [a for a in (td.find('a') for td in soup.findAll('td')) if a]
Which basically just adds a check to see if you have an actual element returned by td.find('a').

Categories