I've used BeautifulSoup to find a specific div class in the page's HTML. I want to check if this div has a span class inside it. If the div has the span class, I want to maintain it on the page's code, but if it doesn't, I want to delete it, maybe using Selenium.
For that I have two lists selecting the elements (div and span). I tried to check if one list is inside the other, and that kind of worked. But how can one delete that found element from the page's source code?
Edit
I've edited the code after a few conversations in the commentaries section. With help, I was able to implement code to remove elements executing javascript.
The code is running with no errors, but nothing is being deleted from the page.
# Import required module
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
# Option to launch browser in incognito
options = Options()
options.add_argument("--incognito")
#options.add_argument("--headless")
# Using chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
# Web page url request
driver.get('https://www.facebook.com/ads/library/?active_status=all&ad_type=all&country=BR&q=frete%20gr%C3%A1tis%20aproveite&sort_data[direction]=desc&sort_data[mode]=relevancy_monthly_grouped&search_type=keyword_unordered&media_type=all')
driver.maximize_window()
time.sleep(10)
driver.execute_script("""
for(let div of document.querySelectorAll('div._99s5')){
let match = div.innerText.match(/(\d+) ads? use this creative and text/)
let numAds = match ? parseInt(match[1]) : 0
if(numAds < 10){
div.querySelector(".tp-logo")?.remove()
}
}
""")
Since you're deleting them in javascript anyway:
driver.execute_script("""
for(let div of document.querySelectorAll('div._99s5')){
let match = div.innerText.match(/(\d+) ads? use this creative and text/)
let numAds = match ? parseInt(match[1]) : 0
if(numAds < 10){
div.querySelector(".tp-logo")?.remove()
}
}
""")
Note: Question and comments reads a bit confusing so it would be great to improve it a bit. Assuming you like to decompose() some elements, the reason why or what to do after this action is not clear. So this answer will only point out an apporache.
To decompose() the elements that do not contains ads use this creative and text just negate your selection and iterate the ResultSet:
for e in soup.select('div._99s5:has(:not(:-soup-contains("ads use this creative and text")))'):
e.decompose()
Now these elements will no longer be included in your soup and you could process it for your needs.
Related
I'm trying to grab the individual video lengths for all videos on one channel and store it in a list or something.
So first I tried Beautiful Soup by using requests library and doing findAll("div") but I get nothing useful. None of the elements look at all like the inspect element on the youtube channel page. Apparently it's because YouTube loads dynamically or something. So you have to use selenium. Idk what that means, but anyway I tried selenium and got this error:
Unable to locate element: {"method":"css selector","selector":"[id="video-title"]"}
from this code:
from selenium import webdriver
from selenium.webdriver.common.by import By
PATH = (path\chromedriver.exe)
driver = webdriver.Chrome(PATH)
driver.get(r"https://www.youtube.com/c/0214mex/videos?view=0&sort=dd&shelf_id=0")
print(driver.title)
search = driver.find_element(By.ID,"video-title")
print(search)
driver.quit()
I get the feeling I don't really understand how web scraping works. Usually if I wanted to grab elements from a webpage I'd just do the soup thing, findAll on div and then keep going down until I reached the a tag or whatever I needed. But I'm having no luck with doing that on YT channel pages.
Is there an easy way of doing this? I can clearly see the hierarchy when I do inspect element on the YouTube page. It goes:
body -> div id=content -> ytd-browse class... -> ytd-two-column-browse-results... -> div id=primary -> div id=contents -> div id =items -> div id = dismissible -> div id =details -> div id=meta -> h3 class... -> and inside an a tag there's all the information I need.
I'm probably naive for thinking that if I simply findAll on "div" it would just show me all the divs, I'd then go to the last one div id=meta and then searchAll "h3" and then search "a" tags and I'd have my info. But searching for "div" with findAll (in BeautifulSoup) has none of those divs and actually the ones it comes up with I can't even find in the select element thing.
So yeah, I seem to be misunderstanding how the findAll thing works. Can anyone provide a simple step-by-step way of getting the information which I'm looking for? Is it impossible without using selenium?
Problem explanation
YouTube is dynamic in nature what it means is, basically the more you load by doing scroll down, the more content it will show. So yes that's dynamic in nature.
So even Selenium understand the same thing, scroll down and add more items title into the list. Also they are typical take few seconds to load, so having an explicit waits will definitely help you get all the title.
You need to maximize the windows, put time.sleep(.5) for visibility and bit of stability. I have put range to 200, meaning grab 200 title, you can put any sensible arbitrary number and script should do the magic.
Also, since it is dynamic, I have defined number_of_title_to_scrape = 100, you can try with your desired number as well.
Solution
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
driver.get("https://www.youtube.com/c/0214mex/videos")
wait = WebDriverWait(driver, 20)
video_title = []
len_of_videos= len(driver.find_elements(By.XPATH, "//a[#id='video-title']"))
j = 0
number_of_title_to_scrape = 100
for i in range(number_of_title_to_scrape):
elements = driver.find_elements(By.XPATH, "//a[#id='video-title']")
driver.execute_script("arguments[0].scrollIntoView(true);", elements[j])
time.sleep(.5)
title = wait.until(EC.visibility_of((elements[j])))
print(title.text)
video_title.append(title.text)
j = j +1
if j == number_of_title_to_scrape:
break
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Output :
【遊戯王】千年パズルが難易度最高レベルでマジで無理wwwww
【ルームツアー】3億円の豪邸の中ってどうなってるの?
コムドットと心霊スポット行ったらヤンキー達に道を塞がれた。
【事件】コムドットやまと。盗撮した人が変装したはじめしゃちょーでもキレ過ぎて分からない説wwwww
はじめしゃちょーで曲を作ってみた【2021年ver.】
3億円の豪邸を買いました!
【コアラのマーチ】新種のコアラ考えたら商品化されてしまいましたwwwww
ヨギボー100万円分買って部屋に全部置いてみた結果wwwww
自販機からコーラが止まらなくなりました
絶対にむせる焼きそばがあるらしいぞ!んなわけねえよ!
はじめしゃちょーがsumika「Lovers」歌ってみた【THE FIRST TAKE】
Mr.マリックのマジックを全部失敗させたら空気が...
ビビりの後輩を永遠にビックリさせてみたwwwwwwww
【泣くなはじめ】大家さん。今まで6年間ありがとうございました。
液体窒素を口の中に入れてみたwwwwww
ヒカルに1億円の掴み取りさせたら大変な事になってしまった。
ヒカルさんに縁切られました。電話してみます。
玄関に透明なアクリル板があったらぶつかる説
なんとかしてください。
【+1 FES 3rd STAGE】誰が最強?YouTuber瓦割り対決!
【実況付き】アナウンサーとニュースっぽく心霊スポット行ったら怖くない説wwwww
体重が軽過ぎる男がシーソーに乗った結果wwwww
糸電話を10年間待ち続けた結果…
愛車を人にあげる事になったので最後に改造しました。
【閲覧注意】ゴギブリを退治したいのに勢いだけで結局何もできないヤツ
【12Kg】巨大なアルミ玉の断面ってどうなってるの?切断します。
打つのに10年間かかる超スローボールを投げた結果…
人の家の前にジンオウガがいた結果wwwwwwww
【野球】高く打ち上げたボールを10年間待った結果…
シャンプー泡立ったままでイルカショー見に行ってみたwwwwwww
水に10年間潜り続けた男
ランニングマシン10年間走り続けた結果…
コロナ流行ってるけど親友の結婚式行ってきた。
10年間タピオカ吸い続けた結果...
【バスケ】スリーポイント10000回練習したぞ。
かくれんぼ10年間放置した結果...
【危険】24時間エスカレーター生活してみた。
1日ヒカキンさんの執事として働いてみたwwwww
人が食ってたパスタを液体窒素でカチカチにしといたwwwwwwwwww
1人だけ乗ってる観覧車が檻になってるドッキリwwwwwww
【検証】コカ・コーラが1番美味しいのはどんなシチュエーション?
はじめしゃちょーをアニメ風に描いてもらった結果wwwww
コーラの容器がブツブツになった結果wwwww #shorts
絶対にケガする重い扉を倒してみた結果wwwwww
ショートケーキの缶売ってたwwwww #shorts
ガチの事故物件で1日生活してみたら何か起こるの?
【初公開】はじめしゃちょーの1日密着動画。
コーラに油を混ぜてメントス入れたらすごい事になる?! #shorts
【拡散希望】河野大臣…コロナワクチンって本当に大丈夫なん…?
ヤバい服見つけたんだがwwwwwwwwww
ヒカキンさんにチャンネル登録者数抜かれました。
What if the classroom you were hiding in the locker was a women's changing room?
コーラがトゲトゲになった結果wwwwww #shorts
エヴァンゲリオンが家の前に立っていたら...?
夏の始まり。1人で豪華客船を貸し切ってみた。女と行きたかった。
【検証】大食いYouTuber VS オレと業務用調理器。どっちが早いの?
天気良いのにオレの家だけ雨降らせたらバレる?バレない?wwwww
3000円ガチャ発見!PS5当たるまで回したらヤバい金額にwwwwww
カラオケで入れた曲のラスサビ永遠ループドッキリwwwwwwww
【ラーメン】ペヤング超超超超超超大盛りペタマックスの新作出たwwwwwww食います
深夜に急に家に入ってくる配達員。
オレは社会不適合者なのか
巨大なクマさん買い過ぎちゃった!
GACKTさん。オレGACKTさんみたいになりたいっス。
100万円の世界最強のスピーカー買ったんやけど全てがヤバいwwwww
【奇妙】ヒカキンさんにしか見えない人が1日中めっちゃ倒れてたらどうする?
ヘリウムガス吸い過ぎたら一生声が戻らなくなるドッキリwwwwwwww
スマブラ世界最強の男 VS 何でもして良いはじめしゃちょー
山田孝之とはじめしゃちょーの質問コーナー!そして消えた200万円。
山田孝之さんにめちゃくちゃ怒られました。
ヒカキンじゃんけんで絶対チョキを出させる方法を発見wwwwwwww
6年ぶりに銅羅で起こされたら同じ反応するの?
バイト先の後輩だった女性と結婚しました。
フォーエイトはエイトフォーで捕まえられるの?
ジムが素敵な女の子だらけだったら限界超えてバーベルめっちゃ上がる説
同棲?
はじめしゃちょー。バイクを買う。
【実話】過去に女性化していた事を話します。
【近未来】自走するスーツケースを買いました。もう乗り物。
ジェットコースターのレールの上を歩いてみた。
バカな後輩ならパイの実がめっちゃ大きくなってても気づかねえよwwwwwwww
久しぶりに他人を怒鳴ったわ
【42万円】Amazonですごいモノが売ってたので買いました。そして人の家の前へ
ペヤングの最新作がヤバ過ぎて全部食べれませんでした。
人の家の前で日本刀を持ったクマさんがずっと待ってる動画
3Pシュートを10000回練習したらどれくらい上手くなるの?【〜5000回】
おい佐藤二朗。オレとやり合おうや。
ひとりぼっちの君へ。
これはオレのしたかった東京の生活じゃない。
巨大なクマのぬいぐるみを浮かせたい
バスケットボール100個で試合したらプロに勝てるんじゃね?
【閲覧注意】100デシベル以上でねるねるねるね作ったら日本1うるさい動画になったwwwwww
オレずっと筋トレ続けてたんスよ。
収録までさせて勝手にDr.STONEの世界にオレがいる話作ってみた
失禁マシーンってのがあるらしいぞwwwwwwwww
謎の部屋に閉じ込められました。おや?真ん中になにかあるぞ?
家に来てほしくないから看板作りました。
これが未来のサウナです。
【恐怖映像】オレの後輩がガチでクズすぎる
就活あるある【ゲストが豪華】
If you want a specific number of videos- go for for loop as mentioned in another answer. Below code will keep scrolling until manually close the browser.
from selenium import webdriver
driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.implicitly_wait(20)
driver.get("https://www.youtube.com/c/0214mex/videos")
j=0
try:
while True: # Infinite loop, to keep scrolling.
videos = driver.find_elements_by_id("dismissible") # Find all the videos. Initially the length of videos will be 30, keeps increasing as we scroll down.
driver.execute_script("arguments[0].scrollIntoView(true);", videos[j]) # Scroll to all videos, more videos will be loaded and list of videos will be updated everytime.
video_name = videos[j].find_element_by_tag_name("h3").get_attribute("innerText") # videos[j] - Get the name of jth video indiviually
video_length = videos[j].find_element_by_xpath(".//span[#class='style-scope ytd-thumbnail-overlay-time-status-renderer']").text # videos[j] - Get the length of jth video indiviually.
print("{}: {}-{}".format(j+1,video_name,video_length)) # To print in specific format.
j+=1 # Increment j
except Exception as e:
print(e)
driver.quit()
Output: (Manually closed the browser)
1: 【遊戯王】千年パズルが難易度最高レベルでマジで無理wwwww-12:26
2: 【ルームツアー】3億円の豪邸の中ってどうなってるの?-8:20
3: コムドットと心霊スポット行ったらヤンキー達に道を塞がれた。-18:08
...
294: これがwww世界1巨大なwww人をダメにするソファwwwwww-8:06
295: 皆さまにお願いがあります。-4:18
Message: no such window: target window already closed
from unknown error: web view not found
The following code scrapes names, company and location of users on LinkedIn.
I want the link/Href per user
The code requires log in credentials for LinkedIn, you can use fake account if skeptical.
Or you can just look at the code/screenshot, anything helps.
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
productlinks=[]
test1=[]
options = Options()
driver = webdriver.Chrome(ChromeDriverManager().install())
url = "https://www.linkedin.com/uas/login?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fsearch%2Fresults%2Fpeople%2F%3FcurrentCompany%3D%255B%25221252860%2522%255D%26geoUrn%3D%255B%2522103644278%2522%255D%26keywords%3Dsales%26origin%3DFACETED_SEARCH%26page%3D2&fromSignIn=true&trk=cold_join_sign_in"
driver.get(url)
time.sleep(2)
username = driver.find_element_by_id('username')
username.send_keys('jazizi#lifesciencedynamics.com')
password = driver.find_element_by_id('password')
password.send_keys('Theboss3!')
password.submit()
element1 = driver.find_elements_by_class_name("name actor-name")
title=[t.text for t in element1]
print(title)
First, the worst thing you can do in web scrape is get element by class. Because in web development, we normally use class for almost any style decorating. Try xpath or id instead.
The second thing I noticed in your code is: you find element by class and the parameter inside is multi class name name actor-name. I haven't read the code nor try running it so I don't understand how it works at this moment. But you should be aware of it, because in web development, class="name actor-name" and class="actor-name name" are almost the same (I did say almost, this is the second time I am mentioning it), but in web scraping it will be entirely different.
I think this will be better using BeautifulSoup, but if you post more code about the page source it will be easyer to help you
with bs4 you can get all html structure of an element and then maybe with regex get the href attribute
Note: I particularly deal with this website
How can I use selenium with Python to get the reviews on this page to sort by 'Most recent'?
What I tried was:
driver.find_element_by_id('sort-order-dropdown').send_keys('Most recent')
from this didn't cause any error but didn't work.
Then I tried
from selenium.webdriver.support.ui import Select
select = Select(driver.find_element_by_id('sort-order-dropdown'))
select.select_by_value('recent')
select.select_by_visible_text('Most recent')
select.select_by_index(1)
I've got: Message: Element <select id="sort-order-dropdown" class="a-native-dropdown" name=""> is not clickable at point (66.18333435058594,843.7999877929688) because another element <span class="a-dropdown-prompt"> obscures it
This one
element = driver.find_element_by_id('sort-order-dropdown')
element.click()
li = driver.find_elements_by_css_selector('#sort-order-dropdown > option:nth-child(2)')
li.click()
from this caused the same error msg
This one from this caused the same error also
Select(driver.find_element_by_id('sort-order-dropdown')).select_by_value('recent').click()
So, I'm curious to know if there is any way that I can select the reviews to sort from the most recent first.
Thank you
This worked for me using Java:
#Test
public void amazonTest() throws InterruptedException {
String URL = "https://www.amazon.com/Harry-Potter-Slytherin-Wall-Banner/product-reviews/B01GVT5KR6/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews";
String menuSelector = ".a-dropdown-prompt";
String menuItemSelector = ".a-dropdown-common .a-dropdown-item";
driver.get(URL);
Thread.sleep(2000);
WebElement menu = driver.findElement(By.cssSelector(menuSelector));
menu.click();
List<WebElement> menuItem = driver.findElements(By.cssSelector(menuItemSelector));
menuItem.get(1).click();
}
You can reuse the element names and follow a similar path using Python.
The key points here are:
Click on the menu itself
Click on the second menu item
It is a better practice not to hard-code the item number but actually read the item names and select the correct one so it works even if the menu changes. This is just a note for future improvement.
EDIT
This is how the same can be done in Python.
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
URL = "https://www.amazon.com/Harry-Potter-Slytherin-Wall-Banner/product-reviews/B01GVT5KR6/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews";
menuSelector = ".a-dropdown-prompt";
menuItemSelector = ".a-dropdown-common .a-dropdown-item";
driver = webdriver.Chrome()
driver.get(URL)
elem = driver.find_element_by_css_selector(menuSelector)
elem.click()
time.sleep(1)
elemItems = []
elemItems = driver.find_elements_by_css_selector(menuItemSelector)
elemItems[1].click()
time.sleep(5)
driver.close()
Just to keep in mind, css selectors are a better alternative to xpath as they are much faster, more robust and easier to read and change.
This is the simplified version of what I did to get the reviews sorted from the most recent ones. As "Eugene S" said above, the key point is to click on the button itself and select/click the desired item from the list. However, my Python code use XPath instead of selector.
# click on "Top rated" button
driver.find_element_by_xpath('//*[#id="a-autoid-4-announce"]').click()
# this one select the "Most recent"
driver.find_element_by_xpath('//*[#id="sort-order-dropdown_1"]').click()
I am new to Selenium/Firefox. My goal is to go to my URL, fill in basic input, select a few items, let browser change the content and download a PDF from there. Ideally, I would love to do it repeatedly later by looping a number of new items. As a first step, I manage to get the browser to work and change content once. But I am stuck in getting the content out as find_elements_by_tag_name() seem to get me something funny rather than some usual HTML tag like what Beautifulsoup .find_all() would do. Appreciate very much any help here.
Here is my code:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
url ='http://www.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main.aspx'
browser = webdriver.Firefox(executable_path = 'C:\Program Files\Mozilla
Firefox\geckodriver.exe')
browser.get(url)
StockElem = browser.find_element_by_id('ctl00_txt_stock_code')
StockElem.send_keys('00772')
StockElem.click()
select = Select(browser.find_element_by_id('ctl00_sel_tier_1'))
select.select_by_value('3')
select = Select(browser.find_element_by_id('ctl00_sel_tier_2'))
select.select_by_value('153')
select = Select(browser.find_element_by_id('ctl00_sel_DateOfReleaseFrom_d'))
select.select_by_value('01')
select = Select(browser.find_element_by_id('ctl00_sel_DateOfReleaseFrom_m'))
select.select_by_value('01')
select = Select(browser.find_element_by_id('ctl00_sel_DateOfReleaseFrom_y'))
select.select_by_value('2000')
# select the search button
browser.execute_script("document.forms[0].submit()")
element = browser.find_elements_by_tag_name("a")
print(element)
After clicking on the Search button -- you have 5 links to download PDF files.
You should find those links by CSS selector: .news.
Then go through the list of links by index and click on each link to Download:
elements[0].click() -- by clicking on the first link.
I am using selenium to navigate to a webpage and store the page source in a variable.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://google.com")
html1 = driver.page_source
html1 now contains the page source of http://google.com.
My question is How can I return html selectors such as id="id" or name="name".
EDIT:
For example:
The webpage I navigated to with selenium has a menu bar with 4 tabs. Each tab has an id element; id="tab1", id="tab2", and so on. I would like to return each id value. So I want tab1, tab2, so on.
Edit#2:
Another example:
The homepage on my webpage (http://chrisarroyo.me) have several clickable links with ids. I would like to be able to return/print those ids to my console.
So I would like to return the ids for the Learn More button and the ids for the links in the footer (facebookLnk, githubLnk, etc..)
If you are looking for a list of WebElements that have an ID use:
elements = driver.find_elements_by_xpath("//*[#id]")
You can then iterate over that list and use get_attribute_("id") to pull out each elements specific ID.
For name, its pretty much the same code. Except change id to name and you're set.
Thank you #stewartm you comment helped.
This ended up giving me the results I was looking for:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://chrisarroyo.me")
id_elements = driver.find_elements_by_xpath("//*[#id]")
for eachElement in id_elements:
individual_ids = eachElement.get_attribute("id")
print(individual_ids)
After running the above ^^ the output listed each of the ids on the webpage specified.
output:
navbarNavAltMarkup
learnBtn
githubLnk
facebookLnk
linkedinLnk