I'm using Python's selenium to automate my university course registration but there are elements in a page that is part of the registration process and that page is currently unavailable for me ( I get a time ticket error ) so I cant copy the elements' xpaths and paste it into the find_element method, I have to wait for the registration to open and that part/page of the website to becomes available.
My questions is: Can I open this missing page from the page source or any other way before having direct access to it?
Yes, with selenium, you can download the complete source code when available and use that to identify the element by accessing it locally.
But it's not a good idea to do that because some parts of the websites might rely on API calls. As a result you might not get the whole picture.
My suggestion is to grab the code when the website is live
Related
I would like to scrape a website that does not have an API and is an "infinite scroller". I have been using selenium for this, but now I need to scrape a lot more pages and do that all at once. The problem is that selenium is very resource-dependant since I am running a full (headless) chrome browser in each instance and also not stable at all (probably because of limited resources but still). I know that there is a way to look for ajax requests that the site uses and access it with requests library, but I have two issues:
I can't seem to find the desired request
The ones that I try to use with requests library require the user to be logged in and I have no idea how to do that (maybe pass cookies and whatnot, I am not a web developer).
Let me take Twitter as an example since it is exactly the as what I am describing (except it has an API). You have to log in and then the feed is loaded infinitely. So the goal is to "scroll" and take the content of each tweet. How can this be done? If you can, please, provide a working example.
Thank you.
So, maybe I'm being paranoid.
I'm scraping my Facebook timeline for a hobby project using PhantomJS. Basically, I wrote a program that finds all of my ads by querying the page for the text Sponsored with XPATH inside of phantom's page.evaluate block. The text was being displayed as innerHTML of html a elements.
Things were working great for a few days and it was finding tons of ads.
Then it stopped returning any results.
When I logged into Facebook manually to inspect the elements again, I found that the word Sponsored was now appearing on the page in an ::after pseudoclass element with the css property content: sponsored. This means that an XPATH query for the text no longer yields any results. No joke, Facebook seemed to have changed the way they rendered this word after being scraped for a couple days.
Paranoid. I told you.
So, I offer this question to the community of Javascript, Web-Scraping, and PhantomJS developers out there. What the heck is going on. Can Facebook know what my PhantomJS program is doing inside of the page.evaluate block?
If so, how? Would my phantom commands appear in a key logger program embedded in the page, for instance?
What are some of your theories?
It is perfectly possible to detect PhantomJS even if the useragent is spoofed.
There are plenty of litte ways in which it differs from other browsers, among others:
Wrong order of headers
Lack of media plugins and latest JS capabilities
PhantomJS-specific methods, like window.callPhantom
PhantomJS name in the stack trace
and many others.
Please refer to this excellent article and presentation linked there for details: https://blog.shapesecurity.com/2015/01/22/detecting-phantomjs-based-visitors/
Maybe puppeteer would be a better fit for your needs as it is based on a real cutting-edge Chromium browser.
I am new to python just started on python web-scraping. I have to scrape data from this realtor site
I need to scrape all the details op read-state agents according to their real-state agency;
For this on the web-browser I have to follow the following instructions
Go to this site
click on agency offices button, enter 4000 pin in search box and then submit.
then we get list of the agencies.
go to our team tab and then we get agents their.
then we have to go to each agents page and record their information.
Can anyone tell me how to approach this. Whats the best way to make this type of scrapers.
Do i have to use selenium for the interaction with the pages.
I have worked on request, BeautifulSoup and simple form submit using mechanize
I would recommend on a searching site that you either use Selenium or Requests with sessions, the advantage of Selenium it it will probably work however it will be slow. For Selenium you should just use the Selenium IDE (Firefox add on) to record what you do then get the HTML from the webpage and use beautifulsoup to parse the data.
If you want to scrape the data quickly and without using much resources I usually use Requests with Sessions. To scrape a website like this you should open up a modern web browser (Firefox, Chrome) and use the network tools for that browser (usually located in developer tools or via right click inspect element). Once you are recording the network you can interact with the webpage to see the connections made to the server. In an example search they may use suggestions e.g
https://suggest.example.com.au/smart-suggest?query=4000&n=7®ions=false
The response then will probably be a JSON of the suggested results. Once you select a suggestion you can just submit a request with that search parameters e.g
https://www.example.com.au/find-agent/agents/petrie-terrace-qld-4000
The URLs for the agents will the be in that HTML page, you just then need to separately send a request to each page to get the information using BeautifulSoup.
You might wanna give Node and Jquery a try. I used to use Python all the time, but it gets messy and hard to maintain after a while.
Using node, you can turn the page HTML into a DOM object and then scrape all the data very easily using Jquery. I have done this for imdb here: “Using JQuery & NodeJS to scrape the web” #asimmittal https://medium.com/#asimmittal/using-jquery-nodejs-to-scrape-the-web-9bb5d439413b
You can modify this to scrape yelp
I am trying to automate the process of downloading webpages with technical documentation which I need to update every year or so.
Here is an example page: http://prod.adv-bio.com/ProductDetail.aspx?ProdNo=1197
From this page, the desired end result would be having all the html links saved as pdf's.
I am using wget to download the .pdf files
I can't use wget to download the html files, because the .html links on the page can only be accessed by clicking through from the previous page.
I tried using Selenium to open the links in Firefox and print them to pdf's, but the process is slow, frequently misses links, and my work proxy server forces me to re-authenticate every time I need to access a page for a different product.
I could open a chrome browser using chromedriver but could not handle the print dialog, even after trying pywinauto per an answer to a similar question here.
I tried taking screenshots of the html pages using Selenium, but could not find out how to get the whole webpage without capturing the entire screen.
I have been through a ton of links related to this topic but have yet to find a satisfying solution to this problem.
Is there a cleaner way to do this?
I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.
Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.
A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT
I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.
The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.
Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.
The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
See the corresponding javascript code on the original page:
<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals e,type:"once"});
</script>
The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.
As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.
The class gives you full access to the DOM tree, so you can do whatever you want with it.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(loband).aspx