Python Download Website HTML containing JS [duplicate] - python

This question already has answers here:
Using python Requests with javascript pages
(6 answers)
Closed 3 years ago.
I am attempting to download many dot-bracket notations of RNA sequences from a url link with Python.
This is one of the links I am using: https://rnacentral.org/rna/URS00003F07BD/9606. To navigate to what I want, you have to click on the '2D structure' button, and only then does the thing I am looking for (right below the occurence of this tag)
<h4>Dot-bracket notation</h4>
appear in the Inspect Element tab.
When I use the get function from the requests package, the text and content fields do not contain that tag. Does anyone know how I can get the bracket notation item?
Here is my current code:
import requests
url = 'http://rnacentral.org/rna/URS00003F07BD/9606'
response = requests.get(url)
print(response.text)

Requests library does not render JS. You need to use a web browser-based solution like selenium. I have listed a pseudo-code below.
Use selenium to load the page.
then click the button 2D structure using selenium.
Wait for some time by adding a time.sleep().
And read the page source using selenium.
You should get what you want.

Related

BeautifulSoup not returning children of element [duplicate]

This question already has answers here:
Web scraping program cannot find element which I can see in the browser
(2 answers)
Closed 2 years ago.
I'm new to web scraping and have been using BeautifulSoup to scrape numbers off a gambling website. I'm trying to get the text of a certain element but returned None.
Here is my code:
r=requests.get('https://roobet.com/crash')
soup = bs4.BeautifulSoup(r.text,'lxml')
crash = soup.find('div', class_='CrashHistory_2YtPR')
print(crash)
When I copied the content of my soup into a note pad and tried ctrl+f to find the element i could not find it.
The element I'm looking for is in the <div id="root"> element and when I looked closer at the copied soup in the notepad I saw that there was nothing inside the <div id="root"> element.
I don't understand what is happening
how can I get the element I'm looking for.
Right click on the page and view source. This is one sure way of knowing how the DOM looks like when the page loads. If you do this for the site https://roobet.com/crash you will notice that the <body> is almost empty besides some <script> elements.
This is because body of the webpage is dynamically loaded using Javascript. This is most likely done using a framework such as react
This is the reason BeautifulSoup is having trouble finding the element.
Your website seems to be dynamically loaded, meaning it uses Javascript and other components. You can test this by enabling/disabling Javascript. In order to scrape this site, try using Selenium and Chromedriver, you can also use other browsers, just look for their equivalent.

getting full content of web page (using Python-requests) [duplicate]

This question already has answers here:
Programmatic Python Browser with JavaScript
(8 answers)
Closed 4 years ago.
I am new to this subject, so my question could prove stupid.. sorry in advance.
My challenge is to do web-scraping, say for this page: link (google)
I try to web-scrape it using Python,
My problem is that once I use Python requests.get, I don't seem to get the full content of the page. I guess it is because that page has many resources, and Python does not get them all. (more than that, once I scroll my mouse up - more data is reviled on Chrome. I can see from the source code that no more data is downloaded to be shown..)
How can I get the full content of a web page? what am I missing?
thanks
requests.get will get you the page web but only what the page decides to give a robot. If you want the full page web as you see it as a human you need to trick it by changing your headers. If you need to scroll or click on buttons in order to see the whole page web, which is what I think you'll need to do, I suggest you take a look at selenium.

web scraping Beautiful soup [duplicate]

This question already has answers here:
Using python Requests with javascript pages
(6 answers)
Closed 4 years ago.
What I am trying is to get ingredients section from
https://www.walmart.com/ip/Nature-s-Recipe-Chicken-Wild-Salmon-Recipe-in-Broth-Dog-Food-2-75-oz/34199310
so what i did was
import requests
from bs4 import BeautifulSoup
x=requests.get("https://www.walmart.com/ip/Nature-s-Recipe-Chicken-Wild-Salmon-Recipe-in-Broth-Dog-Food-2-75-oz/34199310")
soup=BeautifulSoup(x.content)
print(soup.find_all("p",{"class":"Ingredients"})[0])
But its showing index out of range,i.e. no element found but on checking the website the element do exist 'p class="Ingredients"'
Bad news, looks like those elements are generated via JS. If you "view source" of that page, the elements aren't there, and this is the html that requests is getting.
I would use something like selenium to automate a browser to get the fully rendered html, then you can use beautifulsoup to parse out the ingredients.
I personally find it very annoying when websites use JS to generate large amounts of content rather than to make the page more interactive etc. But what are ya gonna do...

Always click on the first link on Google Search Using Selenium In Python [duplicate]

This question already has answers here:
How to click on the first result on google using selenium python
(4 answers)
Closed 4 years ago.
I would like always click on the first link on Google using selenium robot for python.
So for example if I write this :
driver.get("http://www.google.com")
elem = driver.find_element_by_name("q")
elem.send_keys("Tennis")
I would like to click on the first link that appears on Google
Thank you for your help !!
The accepted solution didn't work for me. So for anyone else that comes across that question driver.find_element_by_tag_name("cite").click() using python3 worked for me. However if you just want the link to the top search result it would be faster to use the requests and BeautifulSoup libraries as shown below
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
url = 'http://www.google.com/search?q=something'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
print(soup.find('cite').text)
The xpath which is working as of Aug 2019 is
(//h3)[1]/../../a
i.e. find the first h3 tag, move to its parents and find the first a.
driver.find_element(By.XPATH, '(//h3)[1]/../../a').click()
Google will surely change something in the future and then another approach will be needed.
Old answer
The results links all (used to have) have h3 as the parent element, you can use that
driver.find_element(By.XPATH, '(//h3)[1]/a').click()

Scraping dynamic content in a website [duplicate]

This question already has answers here:
Scrape a dynamic website [duplicate]
(8 answers)
Closed 6 months ago.
I need to scrape news announcements from this website, Link.
The announcements seem to be generated dynamically. They dont appear in the source. I usually use mechanize but I assume it wouldnt work. What can I do for this? I'm ok with python or perl.
If the content is generated dynamically, you can use Windmill or Seleninum to drive the browser and get the data once it's been rendered.
You can find an example here.
The polite option would be to ask the owners of the site if they have an API which allows you access to their news stories.
The less polite option would be to trace the HTTP transactions that take place while the page is loading and work out which one is the AJAX call which pulls in the data.
Looks like it's this one. But it looks like it might contain session data, so I don't know how long it will continue to work for.
There's also WWW::Scripter "For scripting web sites that have scripts" . Never used it.
In python you can use urllib and urllib2 to connect to a website and collect data. For example:
from urllib2 import urlopen
myUrl = "http://www.marketvectorsindices.com/#!News/List"
inStream = urlopen(myUrl)
instream.read(1024) # etc, in a while loop
# all your fun page parsing code (perhaps: import from xml.dom.minidom import parse)

Categories