I am trying to extract the description of the Chinese character from this website: http://www.hsk.academy/en/hsk_1
Example html:
<tr>
<td>
<span class="hanzi">爱</span>
<br/>ài</td>
<td>to love; affection; to be fond of; to like</td>
</tr>
I would like the last td tag's text be put into a list for each description of the character. However, currently I am given the whole tag including the tags themselves. I can't .text the find_next_sibling(): AttributeError: 'NoneType' object has no attribute 'text'.
This is my code:
for item in soup.find_all("td"):
EnglishItem = item.find_next_sibling()
if EnglishItem:
if not any(EnglishItem in s for s in EnglishDescriptionList):
EnglishDescriptionList.insert(count, EnglishItem)
count += 1
print EnglishDescriptionList
Try this:
english_descriptions = []
table = soup.find('table', id='flat_list')
for e in table.select('.hanzi'):
english_desc = e.parent.find_next_sibling().text
if not any(english_desc in s for s in english_descriptions):
english_descriptions.append(english_desc)
This selects (finds) all tags of class hanzi (within the table with id="flat_list") which will be the <span> tags. Then the parent of each <span> is accessed - this is the first <td> in each row. Finally the next sibling is accessed and this is the target tag that contains the English description.
You can do away with the count and just append items to the list with
english_descriptions.append()
Also, I don't think that you need to check whether the current english description is a substring of an existing one (is that what you're trying to do?). If not you can simplify to this list comprehension:
table = soup.find('table', id='flat_list')
english_descriptions = [e.parent.find_next_sibling().text for e in table.select('.hanzi')]
Related
Some background here, please skip to the end if you're not interested in this.
I'm trying to scrape some data about singers on the Billboard Hot 100 and have run into the following problem while scraping information about what genres the singers belong to. I'm currently using the infobox table all singers have.
Eg. https://en.wikipedia.org/wiki/Lil_Nas_X
Inside the infobox, the genres cell is structured differently for different singers. Some singers like Lil Nas above has the genres tag inside a list
<th scope="row">Genres</th><td><div class="hlist hlist-separated">
<ul>
<li>Hip hop<sup id="cite_ref-Allmusic_genre_1-0" class="reference">[1]</sup></li>
<li>country rap<sup id="cite_ref-Allmusic_genre_1-1" class="reference">[1]</sup><sup id="cite_ref-2" class="reference">[2]</sup><sup id="cite_ref-3" class="reference">[3]</sup></li>
<li>pop<sup id="cite_ref-Allmusic_genre_1-2" class="reference"><a href="#cite_note-Allmusic_genre-1">
While some others don't.
<th scope="row">Genres</th>
<td>Alternative rock,
post-grunge
<sup id="cite_ref-Villains_AMG_1-0" class="reference">[1]</sup></td></tr>
In these cases I run into a problem. If I extract all the 'a' tags within 'td' I'm getting the 'a' tag within the cite note which is inside 'sup' too.
I want to be able to extract the 'a' tags inside 'td' but not inside the 'sup'. This is the code I am currently using for this portion.
cell_containing_genres = soup.find('th', string = "Genres").find_next('td') # This finds the next cell after the title Genres
if cell_containing_genres.find_all('li') == []: #Extracting all the links from the cell above without list items
genre_list = cell_containing_genres.find_all('a')
for genre in genre_list:
genres.append(genre.get('href'))
else: # Extracting all the links from the cell above with list items
genre_list = cell_containing_genres.find_all('li')
for genre in genre_list:
genres.append(genre.find_all('a')[0].get('href'))
Thanks a lot for your help.
on site in tags like:
tr class="productBank tr-turn tr-link row body odd" data-target="tr67" data-key="finservice"
or
tr class="productBank tr-turn tr-link row body even" data-target="tr420" data-key="runa-bank"
is stored info I want to parse, but also in that site there another tags like this:
tr class="productBank tr-turn tr-link curr_old row body odd" data-target="tr324" data-key="sov-bank"
or
tr class="productBank tr-turn tr-link curr_old row body even" data-target="tr64" data-key="morskoybank"
and if i try this:
items = soup.find_all('tr', class_='productBank')
it will return all king of tags content, but if i write name of the class all along there be empty list.
How can I access to that specific type of class?
It is all about finding the differences. In this case, you can find all tr elements with the 'productBank' class, then filter out the elements with the 'curr_old' class. The following code achieves this:
[e for e in soup.find_all('tr', {'class': ['productBank']}) if 'curr_old' not in e['class']]
I'm working in python on a Selenium problem. I'm trying to gather each element with an h1 tag and following that tag, I want to get the closest h2 and paragraph text tags and place that data into an object.
My current code looks like:
cards = browser.find_elements_by_tag_name("h1")
ratings = browser.find_elements_by_tag_name('h3')
descriptions = browser.find_elements_by_tag_name('p')
print(len(cards))
print(len(ratings))
print(len(descriptions))
which is generating inconsistent numbers.
To get the <h1> tag elements and then the next sibling <h2> and <p> tag elements you can use the following solution:
cards = browser.find_elements_by_tag_name("h1")
ratings = browser.find_elements_by_xpath("//h1//following-sibling::h2")
descriptions = browser.find_elements_by_xpath("//h1//following-sibling::p")
print(len(cards))
print(len(ratings))
print(len(descriptions))
This is an extraction of an HTML file from http://www.flashscore.com/hockey/sweden/shl/results/
<td title="Click for match detail!" class="cell_sa score bold">4:3<br><span class="aet">(3:3)</span></td>
<td title="Click for match detail!" class="cell_sa score bold">2:5</td>
I would now like to extract the scores after regulation time.
This means whenever 'span class = "aet"' is present (after td class="cell_sa score bold") I need to get the text from span class = "aet". If span class = "aet" is not present I would like to extract the text from td class="cell_sa score bold".
In the above case the desired output (in a list) would be:
[3:3,2:5]
How could go I go for that with xpath statements in python?
You can reach text node of desired tags, obeying conditions you defined by this:
(/tr/td[count(./span[#class = 'aet']) > 0]/span[#class = 'aet'] | /tr/td[0 = count(./span[#class = 'aet'])])/text()
I supposed <td> tags are grouped in a <tr> tag.
If you want to strictly just choose <td>s having 'cell_sa' and 'score' and 'bold' add [contains(#class, 'cell_sa')][contains(#class, 'score')][contains(#class, 'bold')] after each td. As below:
(/tr/td[contains(#class, 'cell_sa')][contains(#class, 'score')][contains(#class, 'bold')][count(./span[#class = 'aet']) > 0]/span[#class = 'aet'] | /tr/td[contains(#class, 'cell_sa')][contains(#class, 'score')][contains(#class, 'bold')][0 = count(./span[#class = 'aet'])])/text()
As you see I tried to implement #class check method order-independently and loose (Just as it is in css selector). You could implement this check as a simple string comparison which results a fragile data consumer
soup = BeautifulSoup(''.join(html))
table = soup.find("table")
firstRow = table.contents[0]
for tr in firstRow:
if 'Total' in tr:
text = ''.join(tr.find(text=True))
print(text)
Sometimes the table element contains a link with the text instead of plain text. In that case the above for loop loops through all cells and doesn't find the text 'Total', because it's in
<a title="err">Total</a>
instead.
How can I modify my loop to find the text in the link if there is a link?
Calling your iteration variable tr is misleading. You're iterating over a table row; the individual items are td or th elements, or just cells. Not a table row.
Looking at the Beautiful Soup documentation, it looks like you want the string property:
If a tag has only one child, and that child is a NavigableString, the child is made available as .string ... If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child.
So:
for cell in firstRow:
if "Total" in cell.string:
# ...
If that doesn't work for you (i.e., there's stuff in the cell you want besides the text in the string) then what you want to do is get all the text in the table cell before testing it for "Total":
for cell in firstRow:
text = "".join(cell.find_all(text=True))
if "Total" in text:
print(text)