Parse JavaScript href with Python - python

Been having a lot of trouble with this... new to Python so sorry if I just don't know the proper search terms to find the info myself. I'm not even positive it's because of the JS but that's the best idea I've got.
Here's the section of HTML I'm parsing:
...
<div class="promotion">
<div class="address">
5203 Alhama Drive
</div>
</div>
...
...and the Python I'm using to do it (this version is the closest I've gotten to success):
homeFinderSoup = BeautifulSoup(open("homeFinderHTML.html"), "html5lib")
addressClass = homeFinderSoup.find_all('div', 'address')
for row in addressClass:
print row.get('href')
...which returns
None
None
None

# Create soup from the html. (Here I am assuming that you have already read the file into
# the variable "html" as a string).
soup = BeautifulSoup(html)
# Find all divs with class="address"
address_class = soup.find_all('div', {"class": "address"})
# Loop over the results
for row in address_class:
# Each result has one <a> tag, and we need to get the href property from it.
print row.find('a').get('href')

Related

Get content from certain tags with certain attributes using BS4

I need to get the content from the following tag with these attributes: <span class="h6 m-0">.
An example of the HTML I'll encounter would be <span class="h6 m-0">Hello world</span>, and it obviously needs to return Hello world.
My current code is as follows:
page = BeautifulSoup(text, 'html.parser')
names = [item["class"] for item in page.find_all('span')]
This works fine, and gets me all the spans in the page, but I don't know how to specify that I only want those with the specific class "h6 m-0" and grab the content inside. How will I go about doing this?
page = BeautifulSoup(text, 'html.parser')
names = page.find_all('span' , class_ = 'h6 m-0')
Without knowing your use case I don't know if this will work.
names = [item["class"] for item in page.find_all('span',class_="h6 m-0" )]
can you please be more specific about what problem you face
but this should work fine for you

Getting "None" when parsing out data on python,BS4

For a while have been trying to make a python program which can split data from websites. I came across the bs4 library for python and decided to use it for that job.
The problem is that I always get as a result None which is something that I cannot understand
I want to get only one word which is in a #href, located in a div class and for that, I wrote a function which is like that:
def run(self):
response = requests.get(self.url)
soup = BeautifulSoup(response.text, 'html.parser')
finalW = soup.find('a', attrs={'class': 'target'})
print(finalW)
With this code, I expect to get a word, but it just returns None.
It is highly possible, too, that I had made a mistake with the path to this directory, so I post an image with the thing I want to extract from the HTML:
When bs4 is not able to find the query, it returns None.
In your case the html is more or less like this.
...
<div class='target'>
neededlink
notneededlink
...
</div>
...
soup.find('a', attrs={'class': 'target'}) thus will not be able to math your query as there are not attrs in a.
If you are certain that your link is first in below query.
soup.find('div', {'class': 'target'}).find('a')['href']

Unable to Scrape Content that comes after a Comment Python BeautifulSoup

I am trying to scrape the tables from the following page:
https://www.baseball-reference.com/boxes/CHA/CHA193805220.shtml
When I reach the html for the batting tables I encounter a very long comment which contains the html for the table
<div id="all_WashingtonSenatorsbatting" class="table_wrapper table_controls">
<div class="section_heading">
<div class="section_heading_text">
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
.....
-->
<div class="table_outer_container mobile_table">
<div class="footer no_hide_long">
Where the last two div are what I am interested in scraping and everything in between the <!-- and the --> is a comment which happens to contain a copy of the table in the table_outer_container class below.
The problem is that when I read the page source into beautiful soup it does will not read anything after the comment within the table_wrapper class div which contains everything. The following code illustrates the problem:
batting = page_source.find('div', {'id':'all_WashingtonSenatorsbatting'})
divs = batting.find_all('div')
len(divs)
gives me
Out[1]: 3
When there are obviously 5 div children under the div id="all_WashingtonSenatorsbatting" element.
Even when I extract the comment using
from bs4 import Comment
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
comments.extract()
The resulting soup still doesn't contain the last two div elements I want to scrape. I am trying to play with the code using regular expressions but so far no luck, any suggestions?
I found workable solution, By using the following code I extract the comment (which brings with it the last two div elements I wanted to scrape), process it again in BeautifulSoup and scrape the table
s = requests.get(url).content
soup = BeautifulSoup(s, "html.parser")
table = soup.find_all('div', {'class':'table_wrapper'})[0]
comment = t(text=lambda x: isinstance(x, Comment))[0]
newsoup = BeautifulSoup(comment, 'html.parser')
table = newsoup.find('table')
It took me a while to get to this and would be interested to see if anyone comes up with any other solutions or can offer an explanation of how this problem came to be.

Regex using Python

I am trying to catch from pattern that was downloaded from specific URL, specific values but without success.
Part of the pattern is:
"All My Loving"</td>\n<td style="text-align:center;">1963</td>\n<td><i>UK: With the Beatles<br />\nUS: Meet The Beatles!</i></td>\n<td>McCartney</td>\n<td>McCartney</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;"><span style="display:none" class="sortkey">7001450000000000000\xe2\x99\xa0</span>45</td>\n<td></td>\n</tr>\n<tr>\n<td>"All Things Must Pass"</td>\n<td style="text-align:center;">1969</td>\n<td><i>Anthology 3</i></td>\n<td>Harrison</td>\n<td>Harrison</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td></td>\n</tr>\n<tr>\n<td>"All Together Now"</td>\n<td style="text-align:center;">1967</td>\n<td><i>Yellow Submarine</i></td>\n<td>McCartney, with Lennon</td>\n<td>McCartney, with Lennon</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td></td>\n</tr>\n<tr>\n<td>"
I want to catch the Title and the 1st <td>McCartney</td> with specific values from the file and to print it out as a JSON file.
Can I run with FOR loop with regex ? How I can do it using python ?
Thanks,
If you want to parse HTML use an HTML parser (such as BeautifulSoup), not regex.
from bs4 import BeautifulSoup
html = '''All My Loving"</td>\n<td style="text-align:center;">1963</td>\n<td><i>UK: With the Beatles<br />\nUS: Meet The Beatles!</i></td>\n<td>McCartney</td>\n<td>McCartney</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;"><span style="display:none" class="sortkey">7001450000000000000\xe2\x99\xa0</span>45</td>\n<td></td>\n</tr>\n<tr>\n<td>"All Things Must Pass"</td>\n<td style="text-align:center;">1969</td>\n<td><i>Anthology 3</i></td>\n<td>Harrison</td>\n<td>Harrison</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td></td>\n</tr>\n<tr>\n<td>"All Together Now"</td>\n<td style="text-align:center;">1967</td>\n<td><i>Yellow Submarine</i></td>\n<td>McCartney, with Lennon</td>\n<td>McCartney, with Lennon</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td></td>\n</tr>\n<tr>\n<td>
'''
soup = BeautifulSoup(html, 'html.parser')
a = soup.find('a') # will only find the first <a> tag
print(a.attrs['title'])
tds = soup.find_all('td') # will find all <td> tags
for td in tds:
if 'McCartney' in td.text:
print(td)
# All My Loving
# <td>McCartney</td>
# <td>McCartney</td>
# <td>McCartney, with Lennon</td>
# <td>McCartney, with Lennon</td>

Python 3, beautiful soup, get next tag

I have the following html part which repeates itself several times with other href links:
<div class="product-list-item margin-bottom">
<a title="titleexample" href="http://www.urlexample.com/example_1" data-style-id="sp_2866">
Now I want to get all the href links in this document that are directly after the div tag with the class "product-list-item".
Pretty new to beautifulsoup and nothing that I came up with worked.
Thanks for your ideas.
EDIT: Does not really have to be beautifulsoup; when it can be done with regex and the python html parser this is also ok.
EDIT2: What I tried (I'm pretty new to python, so what I did might be totaly stupid from an advanced viewpoint):
soup = bs4.BeautifulSoup(htmlsource)
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].get("class"))
This will give me a list of all the "product-list-item" but then I tried something like
print(x[i].get("class").next_element)
Because I thought next_element or next_sibling should give me the next tag but it just leads to AttributeError: 'list' object has no attribute 'next_element'. So I tried with only the first list element:
print(x[i][0].get("class").next_element)
Which led to this error: return self.attrs[key] KeyError: 0.
Also tried with .find_all("href") and .get("href") but this all leads to the same errors.
EDIT3: Ok seems I found out how to solve it, now I did:
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].next_element.next_element.get("href"))
This can also be shortened by using another attribute to the find_all function:
x = soup.find_all("div", "product-list-item")
for i in x:
print(i.next_element.next_element.get("href"))
greetings
I want to get all the href links in this document that are directly after the div tag with the class "product-list-item"
To find the first <a href> element in the <div>:
links = []
for div in soup.find_all('div', 'product-list-item'):
a = div.find('a', href=True) # find <a> anywhere in <div>
if a is not None:
links.append(a['href'])
It assumes that the link is inside <div>. Any elements in <div> before the first <a href> are ignored.
If you'd like; you can be more strict about it e.g., taking the link only if it is the first child in <div>:
a = div.contents[0] # take the very first child even if it is not a Tag
if a.name == 'a' and a.has_attr('href'):
links.append(a['href'])
Or if <a> is not inside <div>:
a = div.find_next('a', href=True) # find <a> that appears after <div>
if a is not None:
links.append(a['href'])
There are many ways to search and navigate in BeautifulSoup.
If you search with lxml.html, you could also use xpath and css expressions if you are familiar with them.

Categories