This question already has answers here:
Programmatic Python Browser with JavaScript
(8 answers)
Closed 4 years ago.
I am new to this subject, so my question could prove stupid.. sorry in advance.
My challenge is to do web-scraping, say for this page: link (google)
I try to web-scrape it using Python,
My problem is that once I use Python requests.get, I don't seem to get the full content of the page. I guess it is because that page has many resources, and Python does not get them all. (more than that, once I scroll my mouse up - more data is reviled on Chrome. I can see from the source code that no more data is downloaded to be shown..)
How can I get the full content of a web page? what am I missing?
thanks
requests.get will get you the page web but only what the page decides to give a robot. If you want the full page web as you see it as a human you need to trick it by changing your headers. If you need to scroll or click on buttons in order to see the whole page web, which is what I think you'll need to do, I suggest you take a look at selenium.
Related
This question already exists:
How to add/edit data in request-payload available in google chrome dev tools [duplicate]
Closed 3 years ago.
I've been looking for this answer for quite long but still with no results. I'm working with selenium and I need to override one request which is generated after the submit button has been clicked. It contains data in json format under "Request payload" in chrome dev tools. I found something like seleniumwires which provides some functionality like request.overrides but I'm not sure it is working as I want. Can anyone give me some hint where to start or which tools are approporiate to do that ?
This question already has answers here:
Using python Requests with javascript pages
(6 answers)
Closed 3 years ago.
I am attempting to download many dot-bracket notations of RNA sequences from a url link with Python.
This is one of the links I am using: https://rnacentral.org/rna/URS00003F07BD/9606. To navigate to what I want, you have to click on the '2D structure' button, and only then does the thing I am looking for (right below the occurence of this tag)
<h4>Dot-bracket notation</h4>
appear in the Inspect Element tab.
When I use the get function from the requests package, the text and content fields do not contain that tag. Does anyone know how I can get the bracket notation item?
Here is my current code:
import requests
url = 'http://rnacentral.org/rna/URS00003F07BD/9606'
response = requests.get(url)
print(response.text)
Requests library does not render JS. You need to use a web browser-based solution like selenium. I have listed a pseudo-code below.
Use selenium to load the page.
then click the button 2D structure using selenium.
Wait for some time by adding a time.sleep().
And read the page source using selenium.
You should get what you want.
This question already has an answer here:
Anchor link to Firefox about:config?
(1 answer)
Closed 4 years ago.
I tried,
setup
element just shows up on the Firefox browser, but cannot open it (clicking won't work) or open it in a new tab.
Is there a way to create a link element to open up about:config page in Firefox using just HTML?
According to this it's imposible Anchor link to Firefox about:config?
right?
But, how about in Selenium with geckodriver?
I think this question answers your question: Anchor link to Firefox about:config?. Give it a try. Basically, you are not allowed to reference a local resource due to security issues (like the post says).
Sorry, it's not possible due to security concerns.
I'm not sure how useful this is, but if you can get the user to create a bookmark, it will work from a bookmark.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to download a file in python
I'm playing with Python for doing some crawling stuff. I do know there is urllib.urlopen("http://XXXX") That can help me to get the html for target website. However, The link to the original image in that webpage will usually make the image in the backup page unavailable. I am wondering is there a way that can also save the image in the local space, then we can read the full content on the website without internet connection. It's like back up the whole webpage, but I'm not sure is there any way to do that in Python. Also, if it can get rid of the advertisement stuff, it will be more awesome though. Thanks.
If you're looking to backup a single webpage, you're well on your way.
Since you mention crawling, if you want to backup an entire website, you'll need to do some real crawling and you'll need scrapy for that.
There are several ways of downloading files off the interwebs, just see these questions:
Python File Download
How to- download a file in python
Automate file download from http using python
Hope this helps
This question already has answers here:
Scrape a dynamic website [duplicate]
(8 answers)
Closed 6 months ago.
I need to scrape news announcements from this website, Link.
The announcements seem to be generated dynamically. They dont appear in the source. I usually use mechanize but I assume it wouldnt work. What can I do for this? I'm ok with python or perl.
If the content is generated dynamically, you can use Windmill or Seleninum to drive the browser and get the data once it's been rendered.
You can find an example here.
The polite option would be to ask the owners of the site if they have an API which allows you access to their news stories.
The less polite option would be to trace the HTTP transactions that take place while the page is loading and work out which one is the AJAX call which pulls in the data.
Looks like it's this one. But it looks like it might contain session data, so I don't know how long it will continue to work for.
There's also WWW::Scripter "For scripting web sites that have scripts" . Never used it.
In python you can use urllib and urllib2 to connect to a website and collect data. For example:
from urllib2 import urlopen
myUrl = "http://www.marketvectorsindices.com/#!News/List"
inStream = urlopen(myUrl)
instream.read(1024) # etc, in a while loop
# all your fun page parsing code (perhaps: import from xml.dom.minidom import parse)