Getting "None" when parsing out data on python,BS4 - python

For a while have been trying to make a python program which can split data from websites. I came across the bs4 library for python and decided to use it for that job.
The problem is that I always get as a result None which is something that I cannot understand
I want to get only one word which is in a #href, located in a div class and for that, I wrote a function which is like that:
def run(self):
response = requests.get(self.url)
soup = BeautifulSoup(response.text, 'html.parser')
finalW = soup.find('a', attrs={'class': 'target'})
print(finalW)
With this code, I expect to get a word, but it just returns None.
It is highly possible, too, that I had made a mistake with the path to this directory, so I post an image with the thing I want to extract from the HTML:

When bs4 is not able to find the query, it returns None.
In your case the html is more or less like this.
...
<div class='target'>
neededlink
notneededlink
...
</div>
...
soup.find('a', attrs={'class': 'target'}) thus will not be able to math your query as there are not attrs in a.
If you are certain that your link is first in below query.
soup.find('div', {'class': 'target'}).find('a')['href']

Related

Website scraping with Python, how do I know where to reference in the html?

I'm a complete beginner who has only built basic Python projects. Right now I'm building a scraper in Python with bs4 to help me read success stories off of a website. These success stories are all in a table, so I thought I would find an html tag that said table and would encompass the entire table.
However, it is all just <div and <span class, and when I use soup.find("div") or ("span") it returns only the single word "div" or "span". This is what I have so far, and I know it isn't right or set up correctly but I'm too inexperienced to know why yet.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests
req = Request('https://www.calix.com/about-calix/success-stories.html', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, "lxml")
soup.find("div", {"id": "content-calix-en-site-prod-home-about-calix-success-stories-jcr-content"})
print('div')
I have watched several tutorials on how to use bs4 and I have successfully scraped basic websites, but all I can do for this one is get ALL of the html, not the chunks I need (just the success stories).
You are printing 'div' make sure to be printing soup as soup gets updated whenever you find something within it.
You should have a look at the bs4 documentation.
soup.find("div", {"id": "content-calix-en-site-prod-home-about-calix-success-stories-jcr-content"})
Here you're calling soup.find() but you're not saving the results into a variable, so the results are lost.
print('div')
And here you're printing the literal string div. I don't think that's what you intended.
Try something like this:
div = soup.find("div", {"id": "..."})
print(div)

How do I retrieve objects based on a partial custom attribute name in Scrapy?

I have the following element:
<div data-offer="MTs3O29sZG5hdnkuY29tOzQxMDYy" class="Offer__Card-sc-14rx0hy-0 iBdrTi"></div>
I need to find it with scrapy but I have two comlpications. The class can change so it is not going to have that value. Pretty much off the table.
The second problem is that data-offer value can vary between data-offer, data-offer-promo, data-offer-double
Do you know how can I find this elements based on a partial attribute name?
Like bring me everything that has a custom attribute "data-offer*"
Or everything that starts with it works too, but not the value, the attribute name.
I tried this with no success
response.css('[div::attr^="data-offer"]')
You can find those elements using beautifulSoup. This will find the first div element that has a "data-offer" attribute:
soup = BeautifulSoup(response.body, 'lxml')
results = soup.find("div", {"data-offer" : True})
You could also get a list with all the elements that have the same condition:
soup = BeautifulSoup(response.body, 'lxml')
results = soup.find_all("div", {"data-offer" : True})

How do I use Bs4 to pull similar information but from different places in DOM hierarchy?

I'm trying to scrape information from a series of pages from like these two:
https://www.nysenate.gov/legislation/bills/2019/s240
https://www.nysenate.gov/legislation/bills/2019/s8450
What I want to do is build a scraper that can pull down the text of "See Assembly Version of this Bill". In the two links listed above, the classes are the same but for one page it's the only iteration of that class, but for another it's the third.
I'm trying to make something like this work:
assembly_version = soup.select_one(".bill-amendment-detail content active > dd")
print(assembly_version)
But I keep getting None
Any thoughts?
url = "https://www.nysenate.gov/legislation/bills/2019/s11"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
assembly_version = soup.find(class_="c-block c-bill-section c-bill--details").find("a").text.strip()
print(assembly_version)

Get content from certain tags with certain attributes using BS4

I need to get the content from the following tag with these attributes: <span class="h6 m-0">.
An example of the HTML I'll encounter would be <span class="h6 m-0">Hello world</span>, and it obviously needs to return Hello world.
My current code is as follows:
page = BeautifulSoup(text, 'html.parser')
names = [item["class"] for item in page.find_all('span')]
This works fine, and gets me all the spans in the page, but I don't know how to specify that I only want those with the specific class "h6 m-0" and grab the content inside. How will I go about doing this?
page = BeautifulSoup(text, 'html.parser')
names = page.find_all('span' , class_ = 'h6 m-0')
Without knowing your use case I don't know if this will work.
names = [item["class"] for item in page.find_all('span',class_="h6 m-0" )]
can you please be more specific about what problem you face
but this should work fine for you

Parse JavaScript href with Python

Been having a lot of trouble with this... new to Python so sorry if I just don't know the proper search terms to find the info myself. I'm not even positive it's because of the JS but that's the best idea I've got.
Here's the section of HTML I'm parsing:
...
<div class="promotion">
<div class="address">
5203 Alhama Drive
</div>
</div>
...
...and the Python I'm using to do it (this version is the closest I've gotten to success):
homeFinderSoup = BeautifulSoup(open("homeFinderHTML.html"), "html5lib")
addressClass = homeFinderSoup.find_all('div', 'address')
for row in addressClass:
print row.get('href')
...which returns
None
None
None
# Create soup from the html. (Here I am assuming that you have already read the file into
# the variable "html" as a string).
soup = BeautifulSoup(html)
# Find all divs with class="address"
address_class = soup.find_all('div', {"class": "address"})
# Loop over the results
for row in address_class:
# Each result has one <a> tag, and we need to get the href property from it.
print row.find('a').get('href')

Categories