I would like to get usernames from one page, but for some reason I just can't get it working..
After browsing internet and other Stackoverflow posts; I think the problem is that there are white spaces in #class, and it just doesn't work. Then I found solution to do it the other way, but the problem is that after first class, I would like to go to second class as well, and this is the only way I know I can do it with find_elements_by_xpath.
Picture to inspect element of what I want to get
In this picture, 'text' is one of the usernames it should scrap.
My code:
usernames = driver.find_elements_by_xpath("//a[#class='kik-widget card card-just-text card-with-shadow']//h3[#class='kik-widget-text-muted']")
usernames2 = [x.text for x in usernames]
print(usernames2)
Any help much appreciated.
Try to use contains in your case to avoid spaces.
For example:
//h3[contains(#class, 'kik-widget-text-muted')
Contains will skip spaces.
Hope this will help you.
Related
I am receiving the href links on some links and others returned None value.
I have the following snippet to retrieve the first 16 items on a page:
def loop_artikelen ():
artikelen = driver.find_elements(By.XPATH, "//*[#id='content']/div[2]/ul/li")
artikelen_lijst = []
for artikel in artikelen[0:15]:
titel = artikel.find_element(By.CLASS_NAME, 'hz-Listing-title').text
prijs = artikel.find_element(By.CLASS_NAME, 'hz-Listing-price').text
link = artikel.find_element(By.CLASS_NAME, 'hz-Listing-coverLink').get_attribute('href')
#if link == "None":
# link = artikel.find_element(By.XPATH(".//a").get_attribute('href'))
artikel = titel, prijs, link
artikelen_lijst.append(artikel)
The output looks like this when i print it out
('Fiets gestolen dus voor een mooi prijsje is ie van jou', '€ 400,00', None)
('Amslod middenmoter fiets', '€ 1.500,00', None)
('Batavus damesfiets', '€ 90,00', 'https://www.marktplaats.nl/v/fietsen-en-brommers/fietsen-dames-damesfietsen/m1933195519-batavus-damesfiets')
('Time edge', '€ 700,00', 'https://www.marktplaats.nl/v/fietsen-en-brommers/fietsen-racefietsen/m1933185638-time-edge')
I tried adding a time.sleep(2) between link and artikel, but it didn't work. You can also i tried something else after "#", that didn't work either.
Who can help me?
Thanks in advance
Link to site : https://www.marktplaats.nl/q/fiets/#offeredSince:Vandaag|sortBy:SORT_INDEX|sortOrder:DECREASING|
This seems like a diagnostic issue, and since it seems like you're pretty good with selenium already, I'll list some more general advice for helping with these sorts of problems rather than trying to solve this specific one. Not being able to find a web element is a very common problem. Here are some things that could be wrong:
The elements are not being found because your query is wrong. A lot of websites have screwy HTML because they are coded that way for whatever reason. Sometimes something that looks like a list cannot be found with a single XPath query. Also I HIGHLY recommend using CSS paths instead of XPath, almost anything that can be gotten with an XPath can be found with a CSS path and it generally yields better results.
The elements are not being found because they haven't been loaded. This is because either the webpage needs to be scrolled down or just because the website hasn't finished loading. You can try increasing the sleep timer to something like 1 minute to see if that's the problem, and/or manually scroll the page down during those 60 seconds.
I would try (2) first to see if that fixes your problem, since it is so easy to do, and only takes a minute.
EDIT: Since you mention: "href links on some links and others returned None value," if selenium can't find the element, it will throw an exception. If it can't find the attribute, it will return None, so the problem might be that it can find the element, but can't find the href attribute. (in other words, it is finding the wrong elements) Your problem is almost certainly that some of the elements are not links at all. I would recommend printing out all the elements you get to confirm that they are all the ones you think they are. Also, use CSSpath instead of Xpath because that will probably solve your problem.
what you said is true: "Your problem is almost certainly that some of the elements are not links at al"
So I looked deeper into the Selector and i saw the next thing: the URL where it retrieves None as value is different than the one that retrieves a value. But the difference is very small:
<a class="hz-Link hz-Link--block hz-Listing-coverLink" href="/v/spelcomputers-en-games/games-nintendo-2ds-en-3ds/m1934291750-animal-crossing-new-leaf-2ds-3ds?correlationId=6ffb1c0a-3d23-4a00-ab3b-a16587b61dea">
Versus
<a class="hz-Link hz-Link--block hz-Listing-coverLink" tabindex="0" href="/v/spelcomputers-en-games/games-nintendo-2ds-en-3ds/m1934287204-mario-kart-7-nintendo-3ds?correlationId=6ffb1c0a-3d23-4a00-ab3b-a16587b61dea">
The second one contains tabindex="0", which is the same for all other selectors giving None value. So i tried to retieve the URL by Tabindex and I didn't quite get this right. I tried doing and if statement where if the value is None, run this line:
link = artikel.find_element(By.ID, '0').get_attribute('href')
This didn't get the job quite done.
So I am wondering, how can i retrieve the HTML where the element contains a tabindex=0 value.
And this is where I realised, you were wrong. I am actually not that sufficient in Selenium :D
I'm trying to get string:
Liquidity (Including Fees)
from line
<div class="sc-bdVaJa KpMoH css-1ecm0so">Liquidity (Including Fees)</div>
I've tried these below
none of them gave me the string that I want:
usdbaslik = driver.find_element_by_css_selector("[class='sc-bdVaJa KpMoH css-1ecm0so']")
print(usdbaslik.text,":---text")
print(usdbaslik.tag_name,":---tag_name")
print(usdbaslik.id,":---id")
print(usdbaslik.size,":---size")
print(usdbaslik.rect,":---rect")
print(usdbaslik.location,":---location")
print(usdbaslik.location_once_scrolled_into_view,":---location_once_scrolled_into_view")
print(usdbaslik.parent,":---parent")
print(usdbaslik.screenshot_as_png,":--screenshot_as_png")
print(usdbaslik.screenshot_as_base64,":--screenshot_as_base64")
print(usdbaslik.__class__,":--__class__")
What am I doing wrong?
Thanks in advance.
There is (at least) one other element with that class on the page, so it's not a unique selector. The closest thing I was able to find to a unique selector looking at the page would be
usdbaslik = driver.find_elements_by_xpath('//div[#class="sc-VigVT fKQdIL"]//div[#class="sc-bdVaJa KpMoH css-1ecm0so"]')[0])
Then you can get the text from the label with
print(usdbaslik.get_attribute('innerText'))
I am making crawling app through selenium, python and I am stuck.
enter image description here
as in picture I can select text(with underline).
but what I need is numbers next to text.
but in F12 in chrome
enter image description here
numbers(red cricle) has class name, but that class names are all same.
there is no indicator that I can use to select numbers through selenium.(as far as I know)
so I tried to find any way to select element through HTML by selenium.
but I couldn't find any. Is there any way to do?
If I am looking for something does not exist, I am very sorry.
I only know python and selenium.. so If I cannot handle this, please let me know.
---edit
I think I make bad explanation.
what I need is find text first, than collect numbers (both of two).
but there is tons of text. I just screenshot little bit.
so I can locate texts by it's specific ids(lot's of it).
but how can I get numbers that is nest to text.
this is my question. sorry for bad explanation
and if BeautifulSoup can handle this please let me know. Thanks for your help.
special thanks to Christine
her code solved my problem.
You can use an XPath index to accomplish selecting first td element. Given the screenshot, you can select the first td containing 2,.167 as such:
cell = driver.find_element_by_xpath("//tr[td/a[text()='TEXT']]/td[#class='txt-r'][1]")
print(cell.text)
You should replace TEXT with the characters you underlined in your screenshot -- I do not have this keyboard so I cannot type the text for you.
The above XPath will query on all table rows, pick the row with your desired text, then query on table cells with class txt-r within a row. Because the two td elements both have class txt-r, you only want to pick one of them, using an index indicated by [1]. The [1] will pick the first td, with text 2,167.
Full sample as requested by the user:
# first get all text on the page
all_text_elements = driver.find_elements_by_xpath("//a[contains(#class, 'link-resource')]")
# iterate text elements and print both numbers that are next to text
for text_element in all_text_elements:
# get the text from web element
text = text_element.text
# find the first number next to it (2,167 from sample HTML)
first_number = driver.find_element_by_xpath("//tr[td/a[text()='" + text + "']]/td[#class='txt-r'][1]")
print(first_number.text)
# find 2nd number (0 from sample HTML)
second_number = driver.find_element_by_xpath("//tr[td/a[text()='" + text + "']]/td[#class='txt-r'][2]")
print(second_number.text)
I was wondering if there a way for me to search for the first few characters of an id. Eg using the find(id=''), if an item id was 'priceblock_ourprice' could I search just for items with the id starting with 'priceblock'?
I have searched for ways to do this but my searches have been fruitful.
and nothing i have tried has worked. Maybe something like this would work:
soup.find(id[0:9]="priceblock")
of course this didn't work but I was hoping someone has a fix, thanks in advance <3
You can use select_one with css selector, [id^='priceblock'] means id starts with priceblock:
soup.select_one("[id^='priceblock']")
So to find a word in a certain page i decided to do this:
bodies = browser.find_elements_by_xpath('//*[self::span or self::strong or self::b or self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6 or self::p]')
for i in bodies:
bodytext = i.text
if "description" in bodytext or "Description" in bodytext:
print("found")
print(i.text)
I believe the code itself is fine, but ultimately I get nothing. To be honest I am not sure what is happening, it just seems like it doesn't work.
Here is an example of what website it would be ran on. Here is some of the page source:
<h2>product description</h2>
EDIT: It may be because the element is not in view, but I have tried to fix it with that in mind. Still, no luck.
Inspect the element and check if its inside the iframe,
if it does, you need switch to the frame first
driver.switch_to_frame(frame_id)
You could try waiting for a few ms to ensure the page has rendered and the expected element is visible with:
time.sleep(1)
Alternatively, try using IDs or custom data attributes to target the elements instead of xpath. Something like find_by_css_selector('.my_class') may work better.