BeautifulSoup children of div - python

I am trying to find the children div for a specific div on a website using beautifulSoup.
I have inspired myself from this answer : Beautiful Soup find children for particular div
However, when I want to retrieve all the content of the divs with class='row' which has a parent div with class="container search-results-wrapper endless_page_template" as seen below: My problem is that it only retrieves the content of the first div class='row'.
I am using the following code :
boatContainer= page_soup.find_all('div', class_='container search-results-wrapper endless_page_template')
for row in boatContainer:
all_boats = row.find_all('div', class_='row')
for boat in all_boats:
print(boat.text)
I apply this on this website.
What can I do so that my solution retrieves the data of the divs from class='row' which belong in the div class='container search-results-wrapper endless_page_template' ?

Use response.content instead of response.text.
you're also not requesting the correct url in your code. https://www.sailogy.com/en/search/?search_where=ibiza&trip_date=2020-06-06&weeks_count=1&skipper=False&search_src=home only displays a single boat hence you're code is only returning one row.
Use https://www.sailogy.com/en/search/?search_where=ibiza&trip_date=2020-06-06&weeks_count=1&guests_count=&order_by=-rank&is_roundtrip=&coupon_code=&skipper=None instead in this case
You'll probally find use in adjusting the url parameters to filter boats at some point !

Related

How to bring back 1st div child in python using bs4 soup.select within a dynamic table

In the below html elements, I have been unsuccessful using beautiful soup.select to only obtain the first child after div class="wrap-25PNPwRV"> (i.e. -11.94M and 2.30M) in list format
<div class="value-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>‪−11.94M‬</div>
<div class="change-25PNPwRV negative-25PNPwRV">−119.94%</div></div></div>
<div class="value-25PNPwRV additional-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>‪2.30M‬</div>
<div class="change-25PNPwRV negative-25PNPwRV">−80.17%</div></div></div>
Above is just two examples within the html I'm attempting to scrape within the dynamic javascript coded table which the above source code lies within, but there are many more div attributes on the page, and many more div class "wrap-25PNPwRV" inside the javascript table
I currently have the below code which allows me to scrape all the contents within div class ="wrap-25PNPwRV"
data_list = [elem.get_text() for elem in soup.select("div.wrap-25PNPwRV")]
Output:
['-11.94M', '-119.94%', '2.30M', '-80.17%']
However, I would like to use soup.select to yield the desired output :
['-11.94M', '2.30M']
I tried following this guide https://www.crummy.com/software/BeautifulSoup/bs4/doc/ but have been unsuccessful to implement it to my above code.
Please note, if soup.select is not possible to perform the above, I am happy to use an alternative providing it generates the same list format/output
You can use the :nth-of-type CSS selector:
data_list = [elem.get_text() for elem in soup.select(".wrap-25PNPwRV div:nth-of-type(1)")]
I'd suggest to not use the .wrap-25PNPwRV class. Seems random and almost certainly will change in the future.
Instead, select the <div> element which has other element with class="change..." as sibling. For example
print([t.text.strip() for t in soup.select('div:has(+ [class^="change"])')])
Prints:
['−11.94M', '2.30M']

Scraping sub div under div class

I'm trying to scrape amazon, the concept is I search for a product in search-box and from results I count the rank of occurance of product in search page using product's unique ASIN. So I have able to scrape the main div but I'm unable to scrape sub-divs under main div which contains search results.
In the picture you can see , main div class which has sub divs containing unique ASIN no. How can I iterate over sub divs, I have tried response.xpath('//div[#class="s-main-slot s-result-list s-search-results sg-row"]') and response.css('.s-main-slot,.s-result-list,.s-search-results,.sg-row').extract() but both seem to have some missing data and I can't iterate over them. How can I iterate over sub divs? I'm fairly new to scrappy any help would be really appreciated , thanks.
with css which I'm more familiar with you can do it like this.
results = response.css('div.s-search-results > div[data-asin]::attr(data-asin)').getall()
for asin in results:
print(result)
Explaination
div.s-search-results target outer div. > div[data-asin] targets divs directly inside the outer div that has the "data-asin" attribute. ::attr('data-asin') reads attribute 'asin'. You can change that last part if you want to extract other information.

How to get href link from in this a tag?

I successfully get href link from http://quotes.toscrape.com/ example by implementing:
response.css('div.quote > span > a::attr(href)').extract()
and it gives all partial link inside href of each a tag:
['/author/Albert-Einstein', '/author/J-K-Rowling', '/author/Albert-Einstein', '/author/Jane-Austen', '/author/Marilyn-Monroe', '/author/Albert-Einstein', '/author/Andre-Gide', '/author/Thomas-A-Edison', '/author/Eleanor-Roosevelt', '/author/Steve-Martin']
by the way in above example each a tag has this format:
(about)
So I tried to make the same for this site: http://www.thegoodscentscompany.com/allproc-1.html
The problem here is that the style of a tag is a bit different as such:
formaldehyde
As you see I can't get link from href by using similar method above. I want to get link (http://www.thegoodscentscompany.com/data/rw1247381.html) from this a tag, but i could not make it. How can i get this link?
Try this response.css('a::attr(onclick)').re(r"Window\('(.*?)'\)")

Beautiful Soup find first <a> whose title attribute equal a certain string

I'm working with beautiful soup and am trying to grab the first tag on a page that has the attribute equal to a certain string.
For example:
What I've been trying to do is grab the href of the first that is found whose title is "export".
If I use soup.select("a[title='export']") then I end up finding all tags who satisfy this requirement, not just the first.
If I use find("a", {"title":"export"}) with conditions being set such that the title should equal "export", then it grabs the actual items inside the tag, not the href.
If I write .get("href") after calling find(), I get None back.
I've been searching the documentation and stack overflow for an answer but have yet found one. Does anyone know a solution to this? Thank you!
What I've been trying to do is grab the href of the first that is found whose title is "export".
You're almost there. All you need to do is, once you've obtained the tag, you'll need to just index it to get the href. Here's a slightly more bulletproof version:
try:
url = soup.find('a', {'title' : 'export'})['href']
print(url)
except TypeError:
pass
following the same topic in the html file I would like to find just the patent number, title of the citations from the HTML tag. I tried this but it prints all the titles in the HTML file, but I specifically want it under the citations only.
url = 'https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry'
patent = html_file.read()
#print(patent)
soup = BeautifulSoup(patent, 'html.parser')
x=soup.select('tr[itemprop="backwardReferences"]')
y=soup.select('td[itemprop="title"]')
print(y)```

Using lxml and python how would I loop through all the divs within a div on a website?

For fun I am trying to write a script in python that goes through all the posts on the front page of a given subreddit. I have the following code:
from lxml import html
import requests
subredditURL = "https://www.reddit.com/r/" + "pics/"
subredditPage = requests.get(subredditURL)
subredditTree = html.fromstring(subredditPage.content)
subreddit_rows_xpath = subredditTree.xpath('//*[#id="siteTable"]')
for div in subreddit_rows_xpath:
print(div)
Now I thought the for loop would print out as many divs as their are posts on the page I am looking at. I think for a typical reddit subreddit's front page this would be 25 posts. The reason I thought this would work is when I manually inspect the siteTable div, it seems to contain a series of 25 divs with x_paths with the following format, within the siteTable div:
//*[#id="thing_t3_63fuuy"]
where the id seems to be a random string and there is one of these divs for each post on the front page, and they contain relevant information for the post I can explore.
Instead of printing out 25 divs the code above returns:
<Element div at 0x110669f70>
Implying only one div, not the 25 I expected. How am I going about this wrong?
Here is the link for the url I am exploring if that helps: https://www.reddit.com/r/pics/
The expression subredditTree.xpath('//*[#id="siteTable"]') returns a list with only 1 element. So iterating over it using:
for div in subreddit_rows_xpath:
print(div)
only outputs 1 element, because that's all that exists. If you want to iterate over all of the div elements under subreddit_rows_xpath, you can use:
subreddit_table_divs = subredditTree.xpath('//*[#id="siteTable"]//div')
for div in subreddit_table_divs:
print(div)
However, I am guessing you want more than just a bunch of lines that look like <Element div at 0x99999999999>. You probably want the either the title or the link to the posts.
To get the titles, you need to drill down two levels to the links:
subreddit_titles = subredditTree.xpath(
'//*[#id="siteTable"]//div[#class="entry unvoted"]'
'/p/a[#data-event-action="title"]/text()'
)
To get the links to the images, it is the same path, just grab the href attribute.
subreddit_links = subredditTree.xpath(
'//*[#id="siteTable"]//div[#class="entry unvoted"]'
'/p/a[#data-event-action="title"]/#href'
)

Categories