I have to click on each search result one by one from this url:
Search Guidelines
I first extract the total number of results from the displayed text so that I can set the upper limit for iteration
upperlimit=driver.find_element_by_id("total_results")
number = int(upperlimit.text.split(' ')[0])
The loop is then defiend as
for i in range(1,number):
However, after going through the first 10 results on the first page, list index goes out of range (probably because there are no more links to click). I need to click on "Next" to get the next 10 results, and so on till I'm done with all search results. How can I go around doing that?
Any help would be appreciated!
The problem is that the value of element with id total_results changes after the page is loaded, at first it contains 117, then changes to 44.
Instead, here is a more robust approach. It processes page by page until there is no more pages left:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Firefox()
url = 'http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true#/search/?searchText=bevacizumab&mode=&staticTitle=false&SEARCHTYPE_all2=true&SEARCHTYPE_all1=&SEARCHTYPE=GUIDANCE&TOPICLVL0_all2=true&TOPICLVL0_all1=&HIDEFILTER=TOPICLVL1&HIDEFILTER=TOPICLVL2&TREATMENTS_all2=true&TREATMENTS_all1=&GUIDANCETYPE_all2=true&GUIDANCETYPE_all1=&STATUS_all2=true&STATUS_all1=&HIDEFILTER=EGAPREFERENCE&HIDEFILTER=TOPICLVL3&DATEFILTER_ALL=ALL&DATEFILTER_PREV=ALL&custom_date_from=&custom_date_to=11-06-2014&PAGINATIONURL=%2FSearch.do%3FsearchText%40%40bevacizumab%26newsearch%40%40true%26page%40%40&SORTORDER=BESTMATCH'
driver.get(url)
page_number = 1
while True:
try:
link = driver.find_element_by_link_text(str(page_number))
except NoSuchElementException:
break
link.click()
print driver.current_url
page_number += 1
Basically, the idea here is to get the next page link, until there is no such ( NoSuchElementException would be thrown). Note that it would work for any number of pages and results.
It prints:
http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true&page=1
http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true&page=2#showfilter
http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true&page=3#showfilter
http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true&page=4#showfilter
http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true&page=5#showfilter
There is not even the need to programatically press on the Next button, if you see carrefully, the url just needs a new parameter when browsing other result pages:
url = "http://www.nice.org.uk/Search.do?searchText=bevacizumab&newsearch=true&page={}#showfilter"
for i in range(1,5):
driver.get(url.format(i))
upperlimit=driver.find_element_by_id("total_results")
number = int(upperlimit.text.split(' ')[0])
if you still want to programatically press on the next button you could use:
driver.find_element_by_class_name('next').click()
But I haven't tested that.
Related
This is my first time with selenium and the website I'm scraping (page) doesn't have a next page button and the pages for pagination don't change till you click the "..." and then it shows the next set of 10 pagination links. How do I loop through the clicking.
I've seen a few answers online but I don't couldn't adapt them to my code because of the links only come in sets. This is the code
from selenium.webdriver import Chrome
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
driver_path = 'Projects\Selenium Driver\chromedriver_win32'
driver = Chrome(executable_path=driver_path)
driver.get('https://business.nh.gov/nsor/search.aspx')
drop_down = driver.find_element(By.ID, 'ctl00_cphMain_lstStates')
select = Select(drop_down)
select.select_by_visible_text('NEW HAMPSHIRE')
driver.find_element(By.ID, 'ctl00_cphMain_btnSubmit').click()
content = driver.find_elements(By.CSS_SELECTOR, 'table#ctl00_cphMain_gvwOffender a')
hrefs = []
for link_el in content:
href = link_el.get_attribute('href')
hrefs.append(href)
offenders_href = hrefs[:10]
pagination_links = driver.find_elements(By.CSS_SELECTOR, 'table#ctl00_cphMain_gvwOffender tbody tr td table tbody a')
With your current code, the next page elements are already captured within list content[10:]. And the last page hyperlink with ellipsis is actually the next logical sequence. Using this fact, we can use a current page variable to keep track of the page being visited and use that to identify the right anchor tag element within list content for the next page.
With a do-while loop logic and using your code to scrape the required elements, here the primary code:
offenders_href = list()
curr_page = 1
while True:
# find all anchor tags with this table
content = driver.find_elements(By.CSS_SELECTOR, 'table#ctl00_cphMain_gvwOffender a')
hrefs = []
for link_el in content:
href = link_el.get_attribute('href')
hrefs.append(href)
offenders_href += hrefs[:10]
curr_page += 1
# find next page element
for page_elem in content[10:]:
if page_elem.get_attribute("href").endswith('$'+str(curr_page)+"')"):
next_page = page_elem
break
else:
# last page reached, break out of while
break
print(f'clicking {next_page.text}...')
next_page.click()
sleep(1)
I placed this code in function launch_click_pages. Launching it with your URL, it is a able to scroll through pages (it kept going, but I stopped it at some page):
>>> launch_click_pages('https://business.nh.gov/nsor/search.aspx')
clicking 2...
clicking 3...
clicking 4...
clicking 5...
clicking 6...
clicking 7...
clicking 8...
clicking 9...
clicking 10...
clicking ......
clicking 12...
clicking 13...
clicking 14...
clicking 15...
^C
You can try to execute script e.g. driver.execute_script("javascript:__doPostBack('ctl00$cphMain$gvwOffender','Page$5')") and you will redirected to fifth page
I'm trying to grab the individual video lengths for all videos on one channel and store it in a list or something.
So first I tried Beautiful Soup by using requests library and doing findAll("div") but I get nothing useful. None of the elements look at all like the inspect element on the youtube channel page. Apparently it's because YouTube loads dynamically or something. So you have to use selenium. Idk what that means, but anyway I tried selenium and got this error:
Unable to locate element: {"method":"css selector","selector":"[id="video-title"]"}
from this code:
from selenium import webdriver
from selenium.webdriver.common.by import By
PATH = (path\chromedriver.exe)
driver = webdriver.Chrome(PATH)
driver.get(r"https://www.youtube.com/c/0214mex/videos?view=0&sort=dd&shelf_id=0")
print(driver.title)
search = driver.find_element(By.ID,"video-title")
print(search)
driver.quit()
I get the feeling I don't really understand how web scraping works. Usually if I wanted to grab elements from a webpage I'd just do the soup thing, findAll on div and then keep going down until I reached the a tag or whatever I needed. But I'm having no luck with doing that on YT channel pages.
Is there an easy way of doing this? I can clearly see the hierarchy when I do inspect element on the YouTube page. It goes:
body -> div id=content -> ytd-browse class... -> ytd-two-column-browse-results... -> div id=primary -> div id=contents -> div id =items -> div id = dismissible -> div id =details -> div id=meta -> h3 class... -> and inside an a tag there's all the information I need.
I'm probably naive for thinking that if I simply findAll on "div" it would just show me all the divs, I'd then go to the last one div id=meta and then searchAll "h3" and then search "a" tags and I'd have my info. But searching for "div" with findAll (in BeautifulSoup) has none of those divs and actually the ones it comes up with I can't even find in the select element thing.
So yeah, I seem to be misunderstanding how the findAll thing works. Can anyone provide a simple step-by-step way of getting the information which I'm looking for? Is it impossible without using selenium?
Problem explanation
YouTube is dynamic in nature what it means is, basically the more you load by doing scroll down, the more content it will show. So yes that's dynamic in nature.
So even Selenium understand the same thing, scroll down and add more items title into the list. Also they are typical take few seconds to load, so having an explicit waits will definitely help you get all the title.
You need to maximize the windows, put time.sleep(.5) for visibility and bit of stability. I have put range to 200, meaning grab 200 title, you can put any sensible arbitrary number and script should do the magic.
Also, since it is dynamic, I have defined number_of_title_to_scrape = 100, you can try with your desired number as well.
Solution
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
driver.get("https://www.youtube.com/c/0214mex/videos")
wait = WebDriverWait(driver, 20)
video_title = []
len_of_videos= len(driver.find_elements(By.XPATH, "//a[#id='video-title']"))
j = 0
number_of_title_to_scrape = 100
for i in range(number_of_title_to_scrape):
elements = driver.find_elements(By.XPATH, "//a[#id='video-title']")
driver.execute_script("arguments[0].scrollIntoView(true);", elements[j])
time.sleep(.5)
title = wait.until(EC.visibility_of((elements[j])))
print(title.text)
video_title.append(title.text)
j = j +1
if j == number_of_title_to_scrape:
break
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Output :
【遊戯王】千年パズルが難易度最高レベルでマジで無理wwwww
【ルームツアー】3億円の豪邸の中ってどうなってるの?
コムドットと心霊スポット行ったらヤンキー達に道を塞がれた。
【事件】コムドットやまと。盗撮した人が変装したはじめしゃちょーでもキレ過ぎて分からない説wwwww
はじめしゃちょーで曲を作ってみた【2021年ver.】
3億円の豪邸を買いました!
【コアラのマーチ】新種のコアラ考えたら商品化されてしまいましたwwwww
ヨギボー100万円分買って部屋に全部置いてみた結果wwwww
自販機からコーラが止まらなくなりました
絶対にむせる焼きそばがあるらしいぞ!んなわけねえよ!
はじめしゃちょーがsumika「Lovers」歌ってみた【THE FIRST TAKE】
Mr.マリックのマジックを全部失敗させたら空気が...
ビビりの後輩を永遠にビックリさせてみたwwwwwwww
【泣くなはじめ】大家さん。今まで6年間ありがとうございました。
液体窒素を口の中に入れてみたwwwwww
ヒカルに1億円の掴み取りさせたら大変な事になってしまった。
ヒカルさんに縁切られました。電話してみます。
玄関に透明なアクリル板があったらぶつかる説
なんとかしてください。
【+1 FES 3rd STAGE】誰が最強?YouTuber瓦割り対決!
【実況付き】アナウンサーとニュースっぽく心霊スポット行ったら怖くない説wwwww
体重が軽過ぎる男がシーソーに乗った結果wwwww
糸電話を10年間待ち続けた結果…
愛車を人にあげる事になったので最後に改造しました。
【閲覧注意】ゴギブリを退治したいのに勢いだけで結局何もできないヤツ
【12Kg】巨大なアルミ玉の断面ってどうなってるの?切断します。
打つのに10年間かかる超スローボールを投げた結果…
人の家の前にジンオウガがいた結果wwwwwwww
【野球】高く打ち上げたボールを10年間待った結果…
シャンプー泡立ったままでイルカショー見に行ってみたwwwwwww
水に10年間潜り続けた男
ランニングマシン10年間走り続けた結果…
コロナ流行ってるけど親友の結婚式行ってきた。
10年間タピオカ吸い続けた結果...
【バスケ】スリーポイント10000回練習したぞ。
かくれんぼ10年間放置した結果...
【危険】24時間エスカレーター生活してみた。
1日ヒカキンさんの執事として働いてみたwwwww
人が食ってたパスタを液体窒素でカチカチにしといたwwwwwwwwww
1人だけ乗ってる観覧車が檻になってるドッキリwwwwwww
【検証】コカ・コーラが1番美味しいのはどんなシチュエーション?
はじめしゃちょーをアニメ風に描いてもらった結果wwwww
コーラの容器がブツブツになった結果wwwww #shorts
絶対にケガする重い扉を倒してみた結果wwwwww
ショートケーキの缶売ってたwwwww #shorts
ガチの事故物件で1日生活してみたら何か起こるの?
【初公開】はじめしゃちょーの1日密着動画。
コーラに油を混ぜてメントス入れたらすごい事になる?! #shorts
【拡散希望】河野大臣…コロナワクチンって本当に大丈夫なん…?
ヤバい服見つけたんだがwwwwwwwwww
ヒカキンさんにチャンネル登録者数抜かれました。
What if the classroom you were hiding in the locker was a women's changing room?
コーラがトゲトゲになった結果wwwwww #shorts
エヴァンゲリオンが家の前に立っていたら...?
夏の始まり。1人で豪華客船を貸し切ってみた。女と行きたかった。
【検証】大食いYouTuber VS オレと業務用調理器。どっちが早いの?
天気良いのにオレの家だけ雨降らせたらバレる?バレない?wwwww
3000円ガチャ発見!PS5当たるまで回したらヤバい金額にwwwwww
カラオケで入れた曲のラスサビ永遠ループドッキリwwwwwwww
【ラーメン】ペヤング超超超超超超大盛りペタマックスの新作出たwwwwwww食います
深夜に急に家に入ってくる配達員。
オレは社会不適合者なのか
巨大なクマさん買い過ぎちゃった!
GACKTさん。オレGACKTさんみたいになりたいっス。
100万円の世界最強のスピーカー買ったんやけど全てがヤバいwwwww
【奇妙】ヒカキンさんにしか見えない人が1日中めっちゃ倒れてたらどうする?
ヘリウムガス吸い過ぎたら一生声が戻らなくなるドッキリwwwwwwww
スマブラ世界最強の男 VS 何でもして良いはじめしゃちょー
山田孝之とはじめしゃちょーの質問コーナー!そして消えた200万円。
山田孝之さんにめちゃくちゃ怒られました。
ヒカキンじゃんけんで絶対チョキを出させる方法を発見wwwwwwww
6年ぶりに銅羅で起こされたら同じ反応するの?
バイト先の後輩だった女性と結婚しました。
フォーエイトはエイトフォーで捕まえられるの?
ジムが素敵な女の子だらけだったら限界超えてバーベルめっちゃ上がる説
同棲?
はじめしゃちょー。バイクを買う。
【実話】過去に女性化していた事を話します。
【近未来】自走するスーツケースを買いました。もう乗り物。
ジェットコースターのレールの上を歩いてみた。
バカな後輩ならパイの実がめっちゃ大きくなってても気づかねえよwwwwwwww
久しぶりに他人を怒鳴ったわ
【42万円】Amazonですごいモノが売ってたので買いました。そして人の家の前へ
ペヤングの最新作がヤバ過ぎて全部食べれませんでした。
人の家の前で日本刀を持ったクマさんがずっと待ってる動画
3Pシュートを10000回練習したらどれくらい上手くなるの?【〜5000回】
おい佐藤二朗。オレとやり合おうや。
ひとりぼっちの君へ。
これはオレのしたかった東京の生活じゃない。
巨大なクマのぬいぐるみを浮かせたい
バスケットボール100個で試合したらプロに勝てるんじゃね?
【閲覧注意】100デシベル以上でねるねるねるね作ったら日本1うるさい動画になったwwwwww
オレずっと筋トレ続けてたんスよ。
収録までさせて勝手にDr.STONEの世界にオレがいる話作ってみた
失禁マシーンってのがあるらしいぞwwwwwwwww
謎の部屋に閉じ込められました。おや?真ん中になにかあるぞ?
家に来てほしくないから看板作りました。
これが未来のサウナです。
【恐怖映像】オレの後輩がガチでクズすぎる
就活あるある【ゲストが豪華】
If you want a specific number of videos- go for for loop as mentioned in another answer. Below code will keep scrolling until manually close the browser.
from selenium import webdriver
driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.implicitly_wait(20)
driver.get("https://www.youtube.com/c/0214mex/videos")
j=0
try:
while True: # Infinite loop, to keep scrolling.
videos = driver.find_elements_by_id("dismissible") # Find all the videos. Initially the length of videos will be 30, keeps increasing as we scroll down.
driver.execute_script("arguments[0].scrollIntoView(true);", videos[j]) # Scroll to all videos, more videos will be loaded and list of videos will be updated everytime.
video_name = videos[j].find_element_by_tag_name("h3").get_attribute("innerText") # videos[j] - Get the name of jth video indiviually
video_length = videos[j].find_element_by_xpath(".//span[#class='style-scope ytd-thumbnail-overlay-time-status-renderer']").text # videos[j] - Get the length of jth video indiviually.
print("{}: {}-{}".format(j+1,video_name,video_length)) # To print in specific format.
j+=1 # Increment j
except Exception as e:
print(e)
driver.quit()
Output: (Manually closed the browser)
1: 【遊戯王】千年パズルが難易度最高レベルでマジで無理wwwww-12:26
2: 【ルームツアー】3億円の豪邸の中ってどうなってるの?-8:20
3: コムドットと心霊スポット行ったらヤンキー達に道を塞がれた。-18:08
...
294: これがwww世界1巨大なwww人をダメにするソファwwwwww-8:06
295: 皆さまにお願いがあります。-4:18
Message: no such window: target window already closed
from unknown error: web view not found
Im using selenium to check if FB pages exist. When i enter the page title in the search bar it works fine but after the second loop the name of the page gets attached to the preview search and i cant find a way to clear the previous search.
For example it looks for
xyz for the first time
then it looks for
xyzabc when i just want to look for abc this time.
How can i clear the search bar so i can just enter the input without the previous input?
Here is my code
for page_target in df.page_name.values:
time.sleep(3)
inputElement = driver.find_element_by_name("q")
inputElement.send_keys(page_target)
inputElement.submit()
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser').get_text()
title = soup.find(page_target)
#if page exists add 1 to the dic otherwise -1
if title > 0:
dic_holder[page_target] = 1
else:
dic_holder[page_target] = -1
driver.find_element_by_name("q").clear()
time.sleep(3)
You can use
WebElement.clear();//to clear the previous search item
WebElement.sendkeys(abc);//to insert the new search
Also I guess you have a sticky search in your application hence I recommend you to use this method everytime you insert something in the searchbox
Few ways to do it:
Use element.clear(). I see that you already tried in your code, not sure how it didn't work but I guess it is not text box or input element?
Use javascript: driver.execute_script('document.getElementsByName("q")[0].value=""');
Emulate Ctrl+A?
from selenium.webdriver.common.keys import Keys
elem.send_keys(Keys.CONTROL, 'a')
elem.send_keys("page 1")
I have been wondering how I could see when I am at the last page on an amazon listing. I have tried to get at the last page number on the bottom of the screen with nothing working, so I tried a different approach. To see if the 'Next' button can still be clicked. Here is what i have so far, any ideas on why it wont go to the next page?
from selenium.webdriver.chrome.webdriver import WebDriver
from selenium import webdriver
import time
def next():
giveawayPage = 1
currentPageURL = 'https://www.amazon.com/ga/giveaways?pageId=' + str(giveawayPage)
while True:
try:
nextButton = driver.find_element_by_xpath('//*[#id="giveawayListingPagination"]/ul/li[7]')
except:
nextPageStatus = 'false'
print('false')
else:
nextpageStatus = 'true'
giveawayPage = giveawayPage + 1
driver.get(currentPageURL)
if (nextPageStatus == 'false'):
break
if __name__ == '__main__':
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.amazon.com/ga/giveaways?pageId=1')
next()
The reason this doesn't work is because if you go to the last page of an Amazon Giveaway, the element that you're selecting is still there, it's just not clickable. On most pages, the element looks like:
<li class="a-last">...</li>
On the last page, it looks instead like:
<li class="a-disabled a-last">...</li>
So rather than checking if the element exists, it might be better to check if the element has the class 'a-disabled'.
the li for Next button is always exist, there are several ways to check if it last page:
check if the li has multiple class or not only a-last but also has class a-disabled
//li[#class="a-last"]
check if the <a> element exist in the li
//*[#id="giveawayListingPagination"]/ul/li[7]/a
I'm new to selenium and webscraping and I'm trying to get information from the link: https://www.carmudi.com.ph/cars/civic/distance:50km/?sort=suggested
Here's a snippet of the code I'm using:
while max_pages > 0:
results.extend(extract_content(driver.page_source))
next_page = driver.find_element_by_xpath('//div[#class="next-page"]')
driver.execute_script('arguments[0].click();', next_page)
max_pages -= 1
When I try to print results, I always get (max_pages) of the same results from page 1. The "Next page" button is visible in the page and when I try to find elements of the same class, it only shows 1 element. When I try getting the element by the exact xpath and performing the click action on it, it doesn't work as well. I enclosed it in a try-except block but there were no errors. Why might this be?
You are making this more complicated than it needs to be. There's no point in using JS clicks here... just use the normal Selenium clicks.
while True:
# do stuff on the page
next = driver.find_element_by_css_selector("a[title='Next page']")
if next
next.click()
else
break
replace:
next_page = driver.find_element_by_xpath('//div[#class="next-page"]')
driver.execute_script('arguments[0].click();', next_page)
with:
driver.execute_script('next = document.querySelector(".next-page"); next.click();')
If you try next = document.querySelector(".next-page"); next.click(); in console you can see it works.