So to find a word in a certain page i decided to do this:
bodies = browser.find_elements_by_xpath('//*[self::span or self::strong or self::b or self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6 or self::p]')
for i in bodies:
bodytext = i.text
if "description" in bodytext or "Description" in bodytext:
print("found")
print(i.text)
I believe the code itself is fine, but ultimately I get nothing. To be honest I am not sure what is happening, it just seems like it doesn't work.
Here is an example of what website it would be ran on. Here is some of the page source:
<h2>product description</h2>
EDIT: It may be because the element is not in view, but I have tried to fix it with that in mind. Still, no luck.
Inspect the element and check if its inside the iframe,
if it does, you need switch to the frame first
driver.switch_to_frame(frame_id)
You could try waiting for a few ms to ensure the page has rendered and the expected element is visible with:
time.sleep(1)
Alternatively, try using IDs or custom data attributes to target the elements instead of xpath. Something like find_by_css_selector('.my_class') may work better.
Related
I am receiving the href links on some links and others returned None value.
I have the following snippet to retrieve the first 16 items on a page:
def loop_artikelen ():
artikelen = driver.find_elements(By.XPATH, "//*[#id='content']/div[2]/ul/li")
artikelen_lijst = []
for artikel in artikelen[0:15]:
titel = artikel.find_element(By.CLASS_NAME, 'hz-Listing-title').text
prijs = artikel.find_element(By.CLASS_NAME, 'hz-Listing-price').text
link = artikel.find_element(By.CLASS_NAME, 'hz-Listing-coverLink').get_attribute('href')
#if link == "None":
# link = artikel.find_element(By.XPATH(".//a").get_attribute('href'))
artikel = titel, prijs, link
artikelen_lijst.append(artikel)
The output looks like this when i print it out
('Fiets gestolen dus voor een mooi prijsje is ie van jou', '€ 400,00', None)
('Amslod middenmoter fiets', '€ 1.500,00', None)
('Batavus damesfiets', '€ 90,00', 'https://www.marktplaats.nl/v/fietsen-en-brommers/fietsen-dames-damesfietsen/m1933195519-batavus-damesfiets')
('Time edge', '€ 700,00', 'https://www.marktplaats.nl/v/fietsen-en-brommers/fietsen-racefietsen/m1933185638-time-edge')
I tried adding a time.sleep(2) between link and artikel, but it didn't work. You can also i tried something else after "#", that didn't work either.
Who can help me?
Thanks in advance
Link to site : https://www.marktplaats.nl/q/fiets/#offeredSince:Vandaag|sortBy:SORT_INDEX|sortOrder:DECREASING|
This seems like a diagnostic issue, and since it seems like you're pretty good with selenium already, I'll list some more general advice for helping with these sorts of problems rather than trying to solve this specific one. Not being able to find a web element is a very common problem. Here are some things that could be wrong:
The elements are not being found because your query is wrong. A lot of websites have screwy HTML because they are coded that way for whatever reason. Sometimes something that looks like a list cannot be found with a single XPath query. Also I HIGHLY recommend using CSS paths instead of XPath, almost anything that can be gotten with an XPath can be found with a CSS path and it generally yields better results.
The elements are not being found because they haven't been loaded. This is because either the webpage needs to be scrolled down or just because the website hasn't finished loading. You can try increasing the sleep timer to something like 1 minute to see if that's the problem, and/or manually scroll the page down during those 60 seconds.
I would try (2) first to see if that fixes your problem, since it is so easy to do, and only takes a minute.
EDIT: Since you mention: "href links on some links and others returned None value," if selenium can't find the element, it will throw an exception. If it can't find the attribute, it will return None, so the problem might be that it can find the element, but can't find the href attribute. (in other words, it is finding the wrong elements) Your problem is almost certainly that some of the elements are not links at all. I would recommend printing out all the elements you get to confirm that they are all the ones you think they are. Also, use CSSpath instead of Xpath because that will probably solve your problem.
what you said is true: "Your problem is almost certainly that some of the elements are not links at al"
So I looked deeper into the Selector and i saw the next thing: the URL where it retrieves None as value is different than the one that retrieves a value. But the difference is very small:
<a class="hz-Link hz-Link--block hz-Listing-coverLink" href="/v/spelcomputers-en-games/games-nintendo-2ds-en-3ds/m1934291750-animal-crossing-new-leaf-2ds-3ds?correlationId=6ffb1c0a-3d23-4a00-ab3b-a16587b61dea">
Versus
<a class="hz-Link hz-Link--block hz-Listing-coverLink" tabindex="0" href="/v/spelcomputers-en-games/games-nintendo-2ds-en-3ds/m1934287204-mario-kart-7-nintendo-3ds?correlationId=6ffb1c0a-3d23-4a00-ab3b-a16587b61dea">
The second one contains tabindex="0", which is the same for all other selectors giving None value. So i tried to retieve the URL by Tabindex and I didn't quite get this right. I tried doing and if statement where if the value is None, run this line:
link = artikel.find_element(By.ID, '0').get_attribute('href')
This didn't get the job quite done.
So I am wondering, how can i retrieve the HTML where the element contains a tabindex=0 value.
And this is where I realised, you were wrong. I am actually not that sufficient in Selenium :D
Basic concept I know:
find_element = find single elements. We can use .text or get.attribute('href') to make the element can be readable. Since find_elements is a list, we can't use .textor get.attribute('href') otherwise it shows no attribute.
To scrape information to be readable from find_elements, we can use for loop function:
vegetables_search = driver.find_elements(By.CLASS_NAME, "product-brief-wrapper")
for i in vegetables_search:
print(i.text)
Here is my problem, when I use find_element, it shows the same result. I searched the problem on the internet and the answer said that it's because using find_element would just show a single result only. Here is my code which hopes to grab different urls.
links.append(driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
But I don't know how to combine the results into pandas. If I print these codes, links variable prints the same url on the csv file...
vegetables_search = driver.find_elements(By.CLASS_NAME, "product-brief-wrapper")
Product_name =[]
links = []
for search in vegetables_search:
Product_name.append(search.find_element(By.TAG_NAME, "h4").text)
links.append(driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
#use panda modules to export the information
df = pd.DataFrame({'Product': Product_name,'Link': links})
df.to_csv('name.csv', index=False)
print(df)
Certainly, if I use loop function particularly, it shows different links.(That's mean my Xpath is correct(!?))
product_link = (driver.find_elements(By.XPATH, "//a[#rel='noopener']"))
for i in product_link:
print(i.get_attribute('href'))
My questions:
Besides using for loop function, how to make find_elements becomes readable? Just like find_element(By.attribute, 'content').text
How to go further step for my code? I cannot print out different urls.
Thanks so much. ORZ
This is the html code which's inspected from the website:
This line:
links.append(driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
should be changed to be
links.append(search.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href') will always search for the first element on the DOM matching .//a[#rel='noopener'] XPath locator while you want to find the match inside another element.
To do so you need to change WebDriver driver object with WebElement search object you want to search inside, as shown above.
Scrapping links should be a simple feat, usually just grabbing the src value of the a tag.
I recently came across this website (https://sunteccity.com.sg/promotions) where the href value of a tags of each item cannot be found, but the redirection still works. I'm trying to figure out a way to grab the items and their corresponding links. My typical python selenium code looks something as such
all_items = bot.find_elements_by_class_name('thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
However, I can't seem to retrieve any href, onclick attributes, and I'm wondering if this is even possible. I noticed that I couldn't do a right-click, open link in new tab as well.
Are there any ways around getting the links of all these items?
Edit: Are there any ways to retrieve all the links of the items on the pages?
i.e.
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
...
Edit:
Adding an image of one such anchor tag for better clarity:
By reverse-engineering the Javascript that takes you to the promotions pages (seen in https://sunteccity.com.sg/_nuxt/d4b648f.js) that gives you a way to get all the links, which are based on the HappeningID. You can verify by running this in the JS console, which gives you the first promotion:
window.__NUXT__.state.Promotion.promotions[0].HappeningID
Based on that, you can create a Python loop to get all the promotions:
items = driver.execute_script("return window.__NUXT__.state.Promotion;")
for item in items["promotions"]:
base = "https://sunteccity.com.sg/promotions/"
happening_id = str(item["HappeningID"])
print(base + happening_id)
That generated the following output:
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
https://sunteccity.com.sg/promotions/764
https://sunteccity.com.sg/promotions/766
https://sunteccity.com.sg/promotions/762
https://sunteccity.com.sg/promotions/767
https://sunteccity.com.sg/promotions/732
https://sunteccity.com.sg/promotions/733
https://sunteccity.com.sg/promotions/735
https://sunteccity.com.sg/promotions/736
https://sunteccity.com.sg/promotions/737
https://sunteccity.com.sg/promotions/738
https://sunteccity.com.sg/promotions/739
https://sunteccity.com.sg/promotions/740
https://sunteccity.com.sg/promotions/741
https://sunteccity.com.sg/promotions/742
https://sunteccity.com.sg/promotions/743
https://sunteccity.com.sg/promotions/744
https://sunteccity.com.sg/promotions/745
https://sunteccity.com.sg/promotions/746
https://sunteccity.com.sg/promotions/747
https://sunteccity.com.sg/promotions/748
https://sunteccity.com.sg/promotions/749
https://sunteccity.com.sg/promotions/750
https://sunteccity.com.sg/promotions/753
https://sunteccity.com.sg/promotions/755
https://sunteccity.com.sg/promotions/756
https://sunteccity.com.sg/promotions/757
https://sunteccity.com.sg/promotions/758
https://sunteccity.com.sg/promotions/759
https://sunteccity.com.sg/promotions/760
https://sunteccity.com.sg/promotions/761
https://sunteccity.com.sg/promotions/763
https://sunteccity.com.sg/promotions/765
https://sunteccity.com.sg/promotions/730
https://sunteccity.com.sg/promotions/734
https://sunteccity.com.sg/promotions/623
You are using a wrong locator. It brings you a lot of irrelevant elements.
Instead of find_elements_by_class_name('thumb-img') please try find_elements_by_css_selector('.collections-page .thumb-img') so your code will be
all_items = bot.find_elements_by_css_selector('.collections-page .thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
You can also get the desired links directly by .collections-page .thumb-img a locator so that your code could be:
links = bot.find_elements_by_css_selector('.collections-page .thumb-img a')
for link in links:
print(link.get_attribute("href"))
I am trying to get the href with selenium and python.
This is my page:
Some class information are changing depending on which elements. So I am trying basically to get all href for <a id="job____ .....
links.append(job.find_element_by_xpath('//a[#aria-live="polite"]//span').get_attribute(name="href"))
I tried couple of things but can't figure out how. How can i get all my href from the screenshot above?
Try this, but take care your xpath
"//a[#aria-live="polite"]//span"
will get a span, and i dont see any span with href on your html. Maybe this xpath solve it
//a[./span[#aria-live="polite"]]
links.append(job.find_element_by_xpath('//a[./span[#aria-live="polite"]]').get_attribute("href"))
But it wont get all urls, this with find_elements (return a list), extend your url list with list comprehension
links.extend([x.get_attribute("href") for x in job.find_elements_by_xpath('//a[./span[#aria-live="polite"]]')])
edit 1, other xpath solution
links.extend(["website_base_url"+x.get_attribute("href") for x in job.find_elements_by_xpath('//a[contains(#id, "job_")]')])
list_of_elements_with_href = wd.find_elements_by_xpath("//a[contains(#href,'')]")
for el_with_href in list_of_elements_with_href :
links.append(el.with_href.get_attribute("href"))
or if you need more specify:
list_of_elements_with_href = wd.find_elements_by_xpath("//a[contains(#href,'') and contains(#id,'job_')]")
Based on your description and attached image, I think you have got the wrong xpath. Try the following code.
find_links = driver.find_elements_by_xpath("//a[starts-with(#id,'job_')]")
links = []
for link in find_links:
links.append(link.get_attribute("href"))
Please note elements in find_elements_by_xpath instead of element.
I am unable to test this solution as you have not provided the website.
Here, I want to scrape a website called "fundsnetservices.com." Specifically, I want to grab the text below each program — it's about a paragraph's worth of text.
Using the Google Chrome Inspect method, I was able to pull this...
'/html/body/div[3]/div/div/div[1]/div/p[2]/text()'
... as the xpath. However, every time I print the text out, it returns [ ]. Why might this be?
response = urllib.request.urlopen('http://www.fundsnetservices.com/searchresult/30/International-Grants-&-Funders/18.html')
tree = etree.HTML(response.read().decode('utf-16'))
text = tree.xpath('/html/body/div[3]/div/div/div[1]/div/p[2]/text()')
It seems your code returns whitespace nodes. Correct your XPath with :
//p[#class="tdclass"]/text()[3]