Selenium page source doesn't match the actual one

Selenium page source doesn't match the actual one - python

I was trying to parse tweets (let's say https://twitter.com/Tesla), but I ran into a problem that once I download the source code using html = browser.page_source it does not match what I see when inspecting the element (Ctrl+Shift+I). It shows some of the tweets, but not nearly all of them, moreover, when saving the code to file and opening it in Chrome, I get something incomprehensible. I had experience working with selenium before and have never ran into such a problem. Maybe there is some other function to get the source?
By the way, I know that Twitter provides an API, but they declined my request without giving any reasons even though I do not plan to do anything against their terms.

Hey this is one of worst practice in selenium
For multiple reasons, logging into sites like Gmail and Facebook using WebDriver is not recommended. Aside from being against the usage terms for these sites (where you risk having the account shut down), it is slow and unreliable.
The ideal practice is to use the APIs that email providers offer, or in the case of Facebook the developer tools service which exposes an API for creating test accounts, friends and so forth. Although using an API might seem like a bit of extra hard work, you will be paid back in speed, reliability, and stability. The API is also unlikely to change, whereas webpages and HTML locators change often and require you to update your test framework.
Logging in to third party sites using WebDriver at any point of your test increases the risk of your test failing because it makes your test longer. A general rule of thumb is that longer tests are more fragile and unreliable.

Related

Getting current url after selenium navigation?

I have a Python script that runs Selenium and makes a search for me on YouTube. After my .send_keys() and .submit() commands I attempt to get the current url of the search page with print(driver.current_url) but it only gives me the original url from my driver.get('https://www.youtube.com') command.
How can I get the full current url path of the search page once I'm there? For example https://www.youtube.com/results?search_query=election instead of https://www.youtube.com.
Thank you.

As you have not shared the code you have tried. I am guessing issue is with your page load. After clicking on submit you are not giving any time for page to load before you get your url. Please give some wait time. The simplest ( No so good) way is to use :
time.sleep(5)
print(driver.current_url)
Above will wait for 5 sec.

Why are you practicing social media to automation
For multiple reasons, logging into sites like Gmail and Facebook using WebDriver is not recommended. Aside from being against the usage terms for these sites (where you risk having the account shut down), it is slow and unreliable.
The ideal practice is to use the APIs that email providers offer, or in the case of Facebook the developer tools service which exposes an API for creating test accounts, friends, and so forth. Although using an API might seem like a bit of extra hard work, you will be paid back in speed, reliability, and stability. The API is also unlikely to change, whereas webpages and HTML locators change often and require you to update your test framework.
Logging in to third-party sites using WebDriver at any point of your test increases the risk of your test failing because it makes your test longer. A general rule of thumb is that longer tests are more fragile and unreliable.
WebDriver implementations that are W3C conformant also annotate the navigator object with a WebDriver property so that Denial of Service attacks can be mitigated.

You can simply wait for a period of time.
driver.implicitly_wait(5)
print(driver.current_url)

To get the current URL after clicking on random videos on particular search, current_url is the only way.
The reason because of which you are getting the previous URL may be the page is not loaded, you may check for the page load by comparing the title of the page
For eg:
expectedURL = "demo class"
actualURL=driver.title
assert expectedURL == actualURL
If the assert gives you true then you may have the command to get the current URL
driver.current_url

How to access hidden instagram button using selenium python

I am making a project where I want to click on the button but the button is hidden. I am working on python. I want to access this 3 dot button which appears next to comment when i hover over it

This is worst practice in selenium.
For multiple reasons, logging into sites like Gmail and Facebook using WebDriver is not recommended. Aside from being against the usage terms for these sites (where you risk having the account shut down), it is slow and unreliable.
The ideal practice is to use the APIs that email providers offer, or in the case of Facebook the developer tools service which exposes an API for creating test accounts, friends and so forth. Although using an API might seem like a bit of extra hard work, you will be paid back in speed, reliability, and stability. The API is also unlikely to change, whereas webpages and HTML locators change often and require you to update your test framework.
Logging in to third party sites using WebDriver at any point of your test increases the risk of your test failing because it makes your test longer. A general rule of thumb is that longer tests are more fragile and unreliable.
Please check selenium Hq web site

How to grab the email value from Temp-Mail Selenium

I was trying to automate some email testing, through selenium and python, and I came across the Temp-Mail website: 'https://temp-mail.org/en/'. I was trying to grab the email from it, with the code:
driver.find_element_by_xpath('//*[#id="mail"]').text
although this comes up empty. I was wondering what method I should be using for this, as the html text is
<input id="mail" type="text" onclick="select(this);" data-original-title="Your Temporary Email Address" data-placement="bottom" data-value="Loading" class="emailbox-input opentip" readonly="">

I managed to fix it with
driver.find_element_by_xpath('//*[#id="mail"]').get_attribute('value')
and a while loop to make sure it wasn't grabbing too early.

For multiple reasons, logging into sites like Gmail and Facebook using WebDriver is not recommended. Aside from being against the usage terms for these sites (where you risk having the account shut down), it is slow and unreliable.
The ideal practice is to use the APIs that email providers offer, or in the case of Facebook the developer tools service which exposes an API for creating test accounts, friends and so forth. Although using an API might seem like a bit of extra hard work, you will be paid back in speed, reliability, and stability. The API is also unlikely to change, whereas webpages and HTML locators change often and require you to update your test framework.
Logging in to third party sites using WebDriver at any point of your test increases the risk of your test failing because it makes your test longer. A general rule of thumb is that longer tests are more fragile and unreliable.

Does selenium or other web scraper tools are mandatory for scraping data from chrome to python script

So I wanted to scrape an website data. I have used selenium in my python script to scrape the data. But i have noticed that in Network section of Google Chrome Inspect, the chrome can record the XmlHttpRequest to find out the json/xml file of websites. So i was wondering that can i directly use this data in my python script as selenium is quite heavy weight and needs more bandwidth. Does selenium or other web scraper tools should be used as a medium to communicate with browser? If not, please give out some information about scraping data to be used for my python file only by using chrome itself.

Definitely! Check out the requests module.
From there you can access the page source, and using data from it you can access the different aspects separately. Here are the things to consider though:
Pros:
Faster, less to download. For things like AJAX requests, is extremely more efficient.
Does not require graphic UI like selenium
More precise; Get exactly what you need
The ability to set Headers/Cookies/etc before making requests
Images may be downloaded separately, with no obligation to download any of them.
Allows as many sessions as you want to be opened in parallel, each
can have different options(proxies, no cookies, consistent cookies,
custom headers, block redirects, etc) without affecting the other.
Cons:
Much harder to get into as opposed to Selenium, requires
minimal knowledge of HTML's GET and POST , and a library
like re or BeautifulSoup to extract data.
For pages with javascript-generated data, depending how the
javascript is implemented(or obfuscated), while always possible,
could be extremely difficult to extract wanted data.
Conclusion:
I suggest you definitely learn requests, and use it for most cases; However if the javascript gets too complicated, then switch to selenium for an easier solution. Look for some tutorials online, and then check the official page for an overview of what you've learned.

Can I scrape all URL results using Python from a google search without getting blocked?

I realize that versions of this question have been asked and I spent several hours the other day trying a number of strategies.
What I would like to is use python to scrape all of the URLs from a google search that I can use in a separate script to do text analysis of a large corpus (news sites mainly). This seems relatively straightforward, but none of the attempts I've tried have worked properly.
This is as close as I got:
from google import search
for url in search('site:cbc.ca "kinder morgan" and "trans mountain" and protest*', stop=100):
print(url)
This returned about 300 URLs before I got kicked. An actual search using these parameters provides about 1000 results and I'd like all of them.
First: is this possible? Second: does anyone have any suggestions to do this? I basically just want a txt file of all the URLs that I can use in another script.

It seems that this package uses screen scraping to retrieve search results from google, so it doesn't play well with Google's Terms of Service which could be the reason why you've been blocked.
The relevant clause in Google's Terms of Service:
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct.
I haven't been able to find a definite number, but it seems like their limit for the number of search queries a day is rather strict too - at 100 search queries / day on their JSON Custom Search API documentation here.
Nonetheless, there's no harm trying out other alternatives to see if they work better:
BeautifulSoup
Scrapy
ParseHub - this one is not in code, but is a useful piece of software with good documentation. Link to their tutorial on how to scrape a list of URLs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selenium page source doesn't match the actual one - python

Related

Getting current url after selenium navigation?

How to access hidden instagram button using selenium python

How to grab the email value from Temp-Mail Selenium

Does selenium or other web scraper tools are mandatory for scraping data from chrome to python script

Can I scrape all URL results using Python from a google search without getting blocked?

Categories

Resources