How to scrapr active data generated by js on a map - python

I'm new python user and I want to scrape data from this website: https://www.telerad.be/Html5Viewer/index.html?viewer=telerad_fr
My problem is that the data are dynamically generated. I read few possibilities to fix but none is satisfying. With selenium I need a name or Xpath to click on button but here there is nothing.
import requests
from lxml import html
page = requests.get('https://www.telerad.be/Html5Viewer/index.html?viewer=telerad_fr')
tree = html.fromstring(page.content)
cities = tree.xpath('//*[#id="map-container"]/div[6]/div[2]/div/div[2]/div/div/div[1]/div/p[1]/text()[2]')
print('Cities: ', cities)

There actually IS an xpath to click on the buttons:
//*[#id='0_layer']/*[#fill]
Here, try this (selenium):
dotList = driver.find_elements_by_xpath("//*[#id='0_layer']/*[#fill]")
for dot in dotList:
dot.click()
cities = driver.find_element_by_xpath("//div[#data-region-name='NavigationMapRegion']//p[1]")
print("Cities: ", cities.text)
closeBtn = driver.find_element_by_xpath("//*[#class='panel-header-button right close-16']")
closeBtn.click(); #the modal can intercept clicks on some dots, thats why we close it here after extracting the info we need.
this code clicks (or at least tries to, if no StaleElementExceptions occur) all the orange dots on the map, and print the "Cities" content (based on your Xpath).
If anyone finds an error in the code, please edit this answer, i wrote this on notepad++.

Related

Selenium cant find class to iterate and scrape text

as you can see in the first picture, my objective is to click and open every "Ver detalles" and to get all the text within it (which is shown in the third picture.
Here is the HTML for this first screen:
And here is the screen that opens once you click "ver detalles"
And its HTML
So far this I have made up some lines of code but I know it is useless because I am looking by XPATH and not by Class (and this will only return data for one), but whenever I look by class and try to iterate it doesn't find the class.
Please let me know if I made myself clear. Thanks beforehand
EDIT:
Thanks #cruisepandey it helped me open "ver detalles". Now I'm stucked trying to get the text out of it and click de X to close it and move on to the next "ver detalles".
This is the code I have so far, i have tried looking up by class, tag, etc but can't seem to find a way :(.
def order_data():
list_of_ver_detalles = driver.find_elements(By.XPATH, "//span[contains(text(),'Ver detalle')]/..")
sleep(3)
for ver in list_of_ver_detalles:
ver.click()
print(driver.find_element_by_class_name("jss672").text)
sleep(2)
driver.find_element(By.XPATH, "/html[1]/body[1]/div[6]/div[3]/div[1]/div[1]/div[1]/img[1]").click
Here is the text I am trying to print
And here is the X I am trying to click
I want to clarify few of your doubts when you say it is useless because I am looking by XPATH and not by Class - no it is not. for finding more than one element with any locator (assuming that locator in DOM can represent multiple entity) all you have to do is to use find_elements instead of find_element.
store all of Ver detalle like this :
list_of_ver_links = driver.find_elements(By.XPATH, "//span[contains(text(),'Ver detalle')]/..")
for ver in list_of_ver_links:
ver.click()
#Now write the code to fetch order details here

Selenium Webdriver failing to click. Unsure why

I have the following code;
if united_states_hidden is not None:
print("Country removed successfully")
time.sleep(10)
print("type(united_states_hidden) = ")
print(type(united_states_hidden))
print("united_states_hidden.text = " + united_states_hidden.text)
print("united_states_hidden.id = " + united_states_hidden.id)
print(united_states_hidden.is_displayed())
print(united_states_hidden.is_enabled())
united_states_hidden.click()
The outputs to the console are as follows:
Country removed successfully
type(united_states_hidden) =
<class 'selenium.webdriver.remote.webelement.WebElement'>
united_states_hidden.text = United States
united_states_hidden.id = ccea7858-6a0b-4aa8-afd5-72f75636fa44
True
True
As far as I am aware this should work as it is a clickable web element, however, no click is delivered to the element. Any help would be appreciated as I can't seem to find anything anywhere else. The element I am attempting to click is within a selector box.
Seems like a valid WebElement given you can print all of the info. like you did in your example.
It's possible the element located is not the element that is meant to be clicked, so perhaps the click is succeeding, but not really clicking anything.
You could try using a Javascript click and see if that helps:
driver.execute_script("arguments[0].click();", united_states_hidden)
If this does not work for you, we may need to see the HTML on the page and the locator strategy you are using to find united_states_hidden so that we can proceed.

Getting the XPath from an HTML document

https://next.newsimpact.com/NewsWidget/Live
I am trying to code a python script that will grab a value from a HTML table in the link above. The link above is the site that I am trying to grab from, and this is the code I have written. I think that possibly my XPath is incorrect, because its been doing fine on other elements, but the path I'm using is not returning/printing anything.
from lxml import html
import requests
page = requests.get('https://next.newsimpact.com/NewsWidget/Live')
tree = html.fromstring(page.content)
#This will create a list of buyers:
value = tree.xpath('//*[#id="table9521"]/tr[1]/td[4]/text()')
print('Value: ', value)
What is strange is when I open the view source code page, I cant find the table I am trying to pull from.
Thank you for your help!
Required data absent in initial page source - it comes from XHR. You can get it as below:
import requests
response = requests.get('https://next.newsimpact.com/NewsWidget/GetNextEvents?offset=-120').json()
first_previous = response['Items'][0]['Previous'] # Current output - "2.632"
second_previous = response['Items'][1]['Previous'] # Currently - "0.2"
first_forecast = response['Items'][0]['Forecast'] # ""
second_forecast = response['Items'][1]['Forecast'] # "0.3"
You can parse response as simple Python dict and get all required data
Your problem is simple, request don't handle javascript at all. The values are JS generated !
If you really need to run this xpath, you need to use a module capable of understanding JS, like spynner.
You can test when you need JS or not by first using curl or by disabling JS in your browser. With firefox : about:config in navigation bar, then search javascript.enabled, then double click on it to switch between true or false
In chrome, open chrome dev tools, there's the option somewhere.
Check https://github.com/makinacorpus/spynner
Another (possible) problem, use tree = html.fromstring(page.text) not tree = html.fromstring(page.content)

Selenium Visible, Non Visible Elements (Drop Down)

I am trying to select all elements of a dropdown.
The site I am testing on is: http://jenner.com/people
The dropdown(checkbox list) I am trying to access is the "locations" list.
I am using Python. I am getting the following error: Message: u'Element is not currently visible and so may not be interacted with'
The code I am using is:
from selenium import webdriver
url = "http://jenner.com/people"
driver = webdriver.Firefox()
driver.get(url)
page = driver.page_source
element = driver.find_element_by_xpath("//div[#class='filter offices']")
elements = element.find_elements_by_tag_name("input")
counter = 0
while counter <= len(elements) -1:
driver.get(url)
element = driver.find_element_by_xpath("//div[#class='filter offices']")
elements1 = element.find_elements_by_tag_name("input")
elements1[counter].click()
counter = counter + 1
I have tried a few variations, including clicking the initial element before clicking on the dropdown options, that didnt work. Any ideas on how to make elements visible in Selenium. I have spent the last few hours searching for an answer online. I have seen a few posts regarding moving the mouse in Selenium, but havent found a solution that works for me yet.
Thanks a lot.
As input check-boxes are not visible at initial state,they get visible after click on "filter offices" option.Also there is change in class name changes from "filter offices" to "filter offices open",if you have observed in firebug.Below code works for me but it is in Java.But you can figure out python as it contain really basic code.
driver.get("http://jenner.com/people");
driver.findElement(By.xpath("//div[#class='filter offices']/div")).click();
Thread.sleep(2000L);
WebElement element = driver.findElement(By.xpath("//div[#class='filter offices open']"));
Thread.sleep(2000L);
List <WebElement> elements = element.findElements(By.tagName("input"));
for(int i=0;i<=elements.size()-1;i++)
{
elements.get(i).click();
Thread.sleep(2000L);
elements = element.findElements(By.tagName("input"));
}
I know this is an old question, but I came across it when looking for other information. I don't know if you were doing QA on the site to see if the proper cities were showing in the drop down, or if you were actually interacting with the site to get the list of people who should be at each location. (Side note: selecting a location then un-selecting it returns 0 results if you don't reset the filter - possibly not desired behavior.)
If you were trying to get a list of users at each location on this site, I would think it easier to not use Selenium. Here is a pretty simple solution to pull the people from the first city "Chicago." Of course, you could make a list of the cities that you are supposed to look for and sub them into the "data" variable by looping through the list.
import requests
from bs4 import BeautifulSoup
url = 'http://jenner.com/people/search'
data = 'utf8=%E2%9C%93&authenticity_token=%2BayQ8%2FyDPAtNNlHRn15Fi9w9OgXS12eNe8RZ8saTLmU%3D&search_scope=full_name' \
'&search%5Bfull_name%5D=&search%5Boffices%5D%5B%5D=Chicago'
r = requests.post(url, data=data)
soup = BeautifulSoup(r.content)
people_results = soup.find_all('div', attrs={'class': 'name'})
for p in people_results:
print p.text

Screen scraping in LXML with python-- extract specific data

I've been trying to write a program for the last several hours that does what I thought would be an incredibly simple task:
Program asks for user input (let's say the type 'happiness')
Program queries the website thinkexist using this format ("http://thinkexist.com/search/searchQuotation.asp?search=USERINPUT")
Program returns first quote from the website.
I've tried using Xpath with lxml, but have no experience and every single construction comes back with a blank array.
The actual meat of the quote appears to be contained in the class "sqq."
If I navigate the site via Firebug, click the DOM tab, it appears the quote is in a textNode attribute "wholeText" or "textContent"-- but I don't know how to use that knowledge programatically.
Any ideas?
import lxml.html
import urllib
site = 'http://thinkexist.com/search/searchquotation.asp'
userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})
root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[#class="sqq"]')
print quotes[0].text_content()
... and if you enter 'Shakespeare', it returns
In real life, unlike in Shakespeare, the sweetness
of the rose depends upon the name it bears. Things
are not only what they are. They are, in very important
respects, what they seem to be.
If it's not necessary for you to implement this via XPath, you may use BeautifilSoup library like this (let myXml variable contain the page HTML source):
soup = BeautifulSoup(myXml)
for a in soup.findAll(a,{'class' : 'sqq'}):
# this is your quote
print a.contents
Anyway, read the BS documentation, it may be very useful for some scraping needs that don't require the power of XPath.
You could open the html source to find out the exact class you are looking for. For example, to grab the first StackOverflow username encountered on the page you could do:
#!/usr/bin/env python
from lxml import html
url = 'http://stackoverflow.com/questions/4710307'
tree = html.parse(url)
path = '//div[#class="user-details"]/a[#href]'
print tree.findtext(path)
# -> Parseltongue
# OR to print text including the text in children
a = tree.find(path)
print a.text_content()
# -> Parseltongue

Categories