I'm trying to scrape amazon, the concept is I search for a product in search-box and from results I count the rank of occurance of product in search page using product's unique ASIN. So I have able to scrape the main div but I'm unable to scrape sub-divs under main div which contains search results.
In the picture you can see , main div class which has sub divs containing unique ASIN no. How can I iterate over sub divs, I have tried response.xpath('//div[#class="s-main-slot s-result-list s-search-results sg-row"]') and response.css('.s-main-slot,.s-result-list,.s-search-results,.sg-row').extract() but both seem to have some missing data and I can't iterate over them. How can I iterate over sub divs? I'm fairly new to scrappy any help would be really appreciated , thanks.
with css which I'm more familiar with you can do it like this.
results = response.css('div.s-search-results > div[data-asin]::attr(data-asin)').getall()
for asin in results:
print(result)
Explaination
div.s-search-results target outer div. > div[data-asin] targets divs directly inside the outer div that has the "data-asin" attribute. ::attr('data-asin') reads attribute 'asin'. You can change that last part if you want to extract other information.
Related
I am practicing about web-scraping with Python.
And I have a problem because of the structure of webpage.
There are two <a> in same list class.
I want to extract just link of the post. And I want to know how can I extract this link without touching another one.
Now, I write the code like this:
def extract_job(post):
title = post.find("span", {"class": "title"})
company = post.find("span", {"class": "company"})
location = post.find("span", {"class": "region"})
link = post.find
What do I have to put after find function?
Instead of using the find method you can do it using css selectors with the select method. In the inspector if you right click on the element you want to find the selector for and click copy selector you can copy the selector and use that in your code.
link = post.select(put the copied selector here)
this is a tricky problem from my side where I was stuck into the webscraping part and was not able to proceed further.
https://i.stack.imgur.com/r4tN2.png
I need only grid-cell answers in a loop
I tried using
grid_cell=driver.find_element_by_css_selector('#tags-browser > div:nth-child(2) > div.mt-auto.grid.jc-space-between.fs-caption.fc-black-300 > div:nth-child(1)')
Now displaying the text of the tag will show 2061748 questions
grid_cell.text
but this is only for one element.
What if I wanted to have it in a loop where I need all the count for all the tags available in that page?
In this case, as per image, I iterated a for loop over '''javascript''' and '''java'''
but
get_element_using_css_selector would give a a specific count for either java or javascript but not for both.
And also if I choose
tag_counts = body.find_all('div', class_='grid_cell')
then I would get other classes also that are below grid-cell in the picture attached which are to be excluded.
Please suggest some solution. Any help would be appreciated.
There are 2 ways of achieving this:
First option:
Remove the tags you don't want to scrape and then scrape the tags that you do want. For example:
tags = body.find_all('div', class_='grid_cell s-anchor') # TODO: add full class name (to remove this tag)
for tag in tags:
tag.extract() # Remove tag from body
tags = body.find_all('div', class_='grid_cell') # This will contain all the tags you want.
Second option:
Loop through parent html tag and get the first tag using find(). For exmaple:
containers = body.find_all('div', class_='mt-auto grid') # Find parent tag
for container in containers:
tag = container.find('div', class_='grid_cell') # Get first tag in the container div
print(tag.text.strip())
I am trying to find the children div for a specific div on a website using beautifulSoup.
I have inspired myself from this answer : Beautiful Soup find children for particular div
However, when I want to retrieve all the content of the divs with class='row' which has a parent div with class="container search-results-wrapper endless_page_template" as seen below: My problem is that it only retrieves the content of the first div class='row'.
I am using the following code :
boatContainer= page_soup.find_all('div', class_='container search-results-wrapper endless_page_template')
for row in boatContainer:
all_boats = row.find_all('div', class_='row')
for boat in all_boats:
print(boat.text)
I apply this on this website.
What can I do so that my solution retrieves the data of the divs from class='row' which belong in the div class='container search-results-wrapper endless_page_template' ?
Use response.content instead of response.text.
you're also not requesting the correct url in your code. https://www.sailogy.com/en/search/?search_where=ibiza&trip_date=2020-06-06&weeks_count=1&skipper=False&search_src=home only displays a single boat hence you're code is only returning one row.
Use https://www.sailogy.com/en/search/?search_where=ibiza&trip_date=2020-06-06&weeks_count=1&guests_count=&order_by=-rank&is_roundtrip=&coupon_code=&skipper=None instead in this case
You'll probally find use in adjusting the url parameters to filter boats at some point !
For fun I am trying to write a script in python that goes through all the posts on the front page of a given subreddit. I have the following code:
from lxml import html
import requests
subredditURL = "https://www.reddit.com/r/" + "pics/"
subredditPage = requests.get(subredditURL)
subredditTree = html.fromstring(subredditPage.content)
subreddit_rows_xpath = subredditTree.xpath('//*[#id="siteTable"]')
for div in subreddit_rows_xpath:
print(div)
Now I thought the for loop would print out as many divs as their are posts on the page I am looking at. I think for a typical reddit subreddit's front page this would be 25 posts. The reason I thought this would work is when I manually inspect the siteTable div, it seems to contain a series of 25 divs with x_paths with the following format, within the siteTable div:
//*[#id="thing_t3_63fuuy"]
where the id seems to be a random string and there is one of these divs for each post on the front page, and they contain relevant information for the post I can explore.
Instead of printing out 25 divs the code above returns:
<Element div at 0x110669f70>
Implying only one div, not the 25 I expected. How am I going about this wrong?
Here is the link for the url I am exploring if that helps: https://www.reddit.com/r/pics/
The expression subredditTree.xpath('//*[#id="siteTable"]') returns a list with only 1 element. So iterating over it using:
for div in subreddit_rows_xpath:
print(div)
only outputs 1 element, because that's all that exists. If you want to iterate over all of the div elements under subreddit_rows_xpath, you can use:
subreddit_table_divs = subredditTree.xpath('//*[#id="siteTable"]//div')
for div in subreddit_table_divs:
print(div)
However, I am guessing you want more than just a bunch of lines that look like <Element div at 0x99999999999>. You probably want the either the title or the link to the posts.
To get the titles, you need to drill down two levels to the links:
subreddit_titles = subredditTree.xpath(
'//*[#id="siteTable"]//div[#class="entry unvoted"]'
'/p/a[#data-event-action="title"]/text()'
)
To get the links to the images, it is the same path, just grab the href attribute.
subreddit_links = subredditTree.xpath(
'//*[#id="siteTable"]//div[#class="entry unvoted"]'
'/p/a[#data-event-action="title"]/#href'
)
So this is how my HTML looks that I'm parsing. It is all within a table and gets repeated multiple times and I just want the href attribute value that is inside the div with the attribute class="Special_Div_Name". All these divs are then inside table rows and there are lots of rows.
<tr>
<div class="Special_Div_Name">
text
</div>
</tr>
What I want is only the href attribute values that end in ".mp3" that are inside the div with the attribute class="Special_Div_Name".
So far I was able to come up with this code:
download = soup.find_all('a', href = re.compile('.mp3'))
for text in download:
hrefText = (text['href'])
print hrefText
This code currently prints off every href attribute value on the page that ends in ".mp3" and it's very close to doing exactly what I want. Its just I only want the ".mp3"s that are inside that div class.
This minor adjustment should get you what you want:
special_divs = soup.find_all('div',{'class':'Special_Div_Name'})
for text in special_divs:
download = text.find_all('a', href = re.compile('\.mp3$'))
for text in download:
hrefText = (text['href'])
print hrefText
Since Beautiful Soup accepts most CSS selectors with the .select() method, I'd suggest using the attribute selector [href$=".mp3"] in order to select a elements with an href attribute ending with .mp3.
Then you can just prepend the selector .Special_Div_Name in order to only select anchor elements that are descendants:
for a in soup.select('div.Special_Div_Name a[href$=".mp3"]'):
print (a['href'])
In a more general case, if you would just like to select a elements with an [href] attribute that are a descendant of a div element, then you would use the selector div a[href]:
for a in soup.select('div a[href]'):
print (a)
If you don't use the code above, then based on the original code that you provided, you would need to select all the elements with a class of Special_Div_Name, then you would need to iterate over those elements and select the descendant anchor elements:
for div in soup.select('.Special_Div_Name'):
for a in div.find_all('a', href = re.compile('\.mp3$')):
print (a['href'])
As a side note, re.compile('.mp3') should be re.compile('\.mp3$') since . has special meaning in a regular expression. In addition, you will also want the anchor $ in order to match at the end of the sting (rather than anywhere in the string).