how to make python request.get wait a few seconds? - python

I wanted to do get some experience with html crawling, so I wanted to see if I could grab some values of the following site: http://www.iex.nl/Aandeel-Koers/11890/Royal-Imtech/koers.aspx
This site shows the price of imtech shares.
If you take a look at the site, you see there is 1 number shown in bold, this is the price of the share.
As you may have seen, this price changes, and that's okay. I only want the value at the time I run my script at this point in time.
but if you reload the page, you may notice how it first shows "laatste koers" and after a delay of 1 second it shows "realtime"
As you may have figured out by now, I'm interested in the "realtime" value.
Here is my question, how do I get this value, I've tried time.sleep(2) on different places. I've tried a timeout at the request. Both didn't work.
How can I fix this?
from lxml import html
import requests
pagina = 'http://www.iex.nl/Aandeel-Koers/11890/Royal-Imtech/koers.aspx'
page = requests.get(pagina)
tree = html.fromstring(page.text)
koers = tree.xpath('//span[#class="RealtimeLabel"]/text()')
prices = tree.xpath('//span[#id="ctl00_ctl00_Content_LeftContent_PriceDetails_lblLastPrice"]/text()')
print koers[0], pagina.split("/")[5], prices[0]
I get output like this
Laatste koers Royal-Imtech 0,093
While I want output like this
Realtime Royal-Imtech 0,093

I would suggest use a wait until the element changes.
Find the block of code below to help you.
def wait_while(condition, timeout, delta=1):
"""
#condition: lambda function which checks if the text contains "REALTIME"
#timeout: Max waiting time
#delta: time after which another check has to be made
"""
max_time = time.time() + timeout
while max_time > time.time():
if condition():
return True
time.sleep(delta)
return False

Related

Getting span text(value) every time it changes in selenium python

I am trying to print the value of a span every time it changes. To print the value of the span is quite easy:
popup = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="spot"]')))
Print(popup.text)
This will print the value at that moment, the problem is that the value will change every 2 seconds. I tried using:
# wait for the first popup to appear
popup = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="spot"]')))
# print the text
print(popup.text)
# wait for the first popup to disappear
wait.until(EC.staleness_of(popup))
# wait for the second popup to appear
popup = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="spot"]')))
# print the text
print(popup.text)
# wait for the second popup to disappear
wait.until(EC.staleness_of(popup))
No matter how long my wait value is, 10 or 20 or even 30 seconds, the process always times out. I do not know much about coding but I think this method does not work because the span as a whole does not change only the span value(text). One method that I tried was to loop the Print(popup) command and it partially worked. it printed the same value 489 times until it changed and printed the other one 489 times again.I have since tried this code:
popup = wait.until(EC.text_to_be_present_in_element_value((By.XPATH, '//*[#id="spot"]')))
print(popup.text)
but it returns:
TypeError: __init__() missing 1 required positional argument: 'text_'
.
Please help what it is I need to add or what method I need to use to get the changing value.
HTML code inspection
Please I beg you, please beware Im not trying to print the text of the span, I already know how to do that, I want print it everytime it changes
Assuming that the element does disappear and reappear again:
You can just go back and forth between waiting for the element being located and being located.
Assuming that the elements content changes, but doesn't disappear:
I don't know of any explicit way to wait for the change of the content of an element, so as far as I am concerned you would need to compare the change yourself. You might want to add an absolute wait of < 2 seconds to limit the amount of unnecessary comparisons you make.
# Init a list to contain the values later on
values = []
# Wait for the element to be loaded in the first place
popup = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="spot"]')))
values.append(popup.text)
while True:
# possibly wait here
new_value = driver.find_element(By.XPATH, '//*[#id="spot"]')
# look up if the value has changed based on the values you know and add the new value
if values[-1] != new_value:
values.append(new_value)
# add an exit condition unless you actually want to do it forever
Please be aware: This will only work if the value actually changes each and every time or if you don't need duplicates that follow one another.
If you need every value, you can leave out the comparison and add one value every ca. 2 seconds.
For your example:
The page on binary.com you provided uses websocket in order to refresh the content. This is a protocol that allows the server to send data to the client and the other way around.
So it's a different approach to the http protocol you are used to (you send a request, the server replies - let's say you ask for the webpage, then the server will just send it).
This protocol opens a connection and keeps it alive. There will hardly be a wait to anticipated this change. But: In your browser (assuming Chrome here) you can go into your developer tools, go into the "Network" Tab and filter for the WS (websocket). You'll see a connection with v3?app_id=1 (you might need to refresh the page to have output in the Network-Tab).
Click on that connection and you'll see the messages your client sent annd the ones you received. Naturally you only need those received so filter for those.
As those are quite a few steps have a look on that screenshots, it shows the correct settings:
Every message is in json format and you click on it to see its content. Under "tick" you'll see the ask and bid data.
In case that suffices, you can just leave the page open for as long as you need, then copy the output, save it as a file and read it with python for analysis.
It seems you can also automate this with selenium as demostrated here:
http://www.amitrawat.tech/post/capturing-websocket-messages-using-selenium/
Basically they do the same thing, they set the capability to record the log, then filter through it to get the data they need. Note that they use Java to do so - but it wont be hard to translate to python.

Safe dublets with Selenium in .txt.File

So, my goal was to write a script, that scrapes users, that used a specific hashtag on Instagram and writes their accounts into a .txt-file and it mostly works!
My problem is, that even though some accounts posted plural pictures, my script does show each name only once. Any idea, how it might be able to kind of count them or get my script to not delete doublets?
I looked for everything but can't find a solution.
This is my part of writing code:
def generate_initial_information_txt(initial_information):
initial_information_txt = open("initial_information", "w+")
for user in initial_information:
initial_information_txt.write(user + "\n")
This is the part to find the name:
for user in range(30):
el = self.driver.find_element_by_xpath('/html/body/div[4]/div[2]/div/article/header/div[2]/div[1]/div[1]')
el = el.find_element_by_tag_name('a')
time.sleep(2)
profile = el.get_attribute('href')
open_recent_posts_set.add(profile)
time.sleep(2)
next_button = self.driver.find_element_by_xpath('/html/body/div[4]/div[1]/div/div/a[2]')
next_button.click()
time.sleep(2)
THE URL would be
https://instagram.com/explore/tags/hansaviertel_ms
So I'm starting to scrape the "Recent" Posts and e.g. the "Hansaforum" posted like 5 of the first 6. If I insert a range of 6 it just throws out a .txt-file with two accounts, not 5 times the "Hansaforum". And I'd like to get the amount of times in any kind of way. –
Thanks :)

Getting data from html table with selenium (python): Submitting changes breaks loop

I want to scrape data from an HTML table for different combinations of drop down values via looping over those combinations. After a combination is chosen, the changes need to be submitted. This is, however, causing an error since it refreshes the page.
This it what I've done so far:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
browser.get('https://daten.ktbl.de/feldarbeit/entry.html')
# Selecting the constant values of some of the drop downs:
fertilizer = Select(browser.find_element_by_name("hgId"))
fertilizer.select_by_value("2")
fertilizer = Select(browser.find_element_by_name("gId"))
fertilizer.select_by_value("193")
fertilizer = Select(browser.find_element_by_name("avId"))
fertilizer.select_by_value("383")
fertilizer = Select(browser.find_element_by_name("hofID"))
fertilizer.select_by_value("2")
# Looping over different combinations of plot size and amount of fertilizer:
size = Select(browser.find_element_by_name("flaecheID"))
for size_values in size.options:
size.select_by_value(size_values.get_attribute("value"))
time.sleep(1)
amount= Select(browser.find_element_by_name("mengeID"))
for amount_values in amount.options:
amount.select_by_value(amount_values.get_attribute("value"))
time.sleep(1)
#Refreshing the page after the two variable values are chosen:
button = browser.find_element_by_xpath("//*[#type='submit']")
button.click()
time.sleep(5)
This leads to the error:selenium.common.exceptions.StaleElementReferenceException: Message: The element reference of <option> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed.
Obviously the issue is that I did indeed refresh the document.
After submitting the changes and the page has loaded the results, I want to retrieve the them with:
html_source = browser.page_source
df_list = pd.read_html(html_source, match = "Dieselbedarf")
(Shout-out to #bink1time who answered this part of my question here).
How can I update the page without breaking the loop?
I would very much appreciate some help here!
Stale Element Reference Exception often occurs upon page refresh because of an element UUID change in the DOM.
In order to avoid it, always try to search for an element before an interaction. In your particular case, you searched for size and amount, found them and stored them in variables. But then, upon refresh, their UUID changed, so old ones that you have stored are no longer attached to the DOM. When trying to interact with them, Selenium cannot find them in the DOM and throws this exception.
I modified your code to always re-search size and amount elements before the interaction:
# Looping over different combinations of plot size and amount of fertilizer:
size = Select(browser.find_element_by_name("flaecheID"))
for i in range(len(size.options)):
# Search and save new select element
size = Select(browser.find_element_by_name("flaecheID"))
size.select_by_value(size.options[i].get_attribute("value"))
time.sleep(1)
amount = Select(browser.find_element_by_name("mengeID"))
for j in range(len(amount.options)):
# Search and save new select element
amount = Select(browser.find_element_by_name("mengeID"))
amount.select_by_value(amount.options[j].get_attribute("value"))
time.sleep(1)
#Refreshing the page after the two variable values are chosen:
button = browser.find_element_by_xpath("//*[#type='submit']")
button.click()
time.sleep(5)
Try this? It worked for me. I hope it helps.

Scraping with lxml and python requests.

Okay, I am at it again and really trying to figure this stuff out with lxml and python. The last time I asked a question I was using xpath and had to figure out how to make a change in case that the direct xpath source itself would change. I have edited my code to try to go after the class instead. I keep running into a problem with it pulling the address up in memory and not the text that I want. Before anyone says there is a library for what I want to do, this is not about that but, rather, allowing me to understand this code. Here is what I have so far but when I print it out I get an error and I can add [0] behind the print[0].text but it still give me nothing. Any help would be cool.
from lxml import html
import requests
import time
while True:
page = requests.get('https://markets.businessinsider.com/index/realtime-chart/dow_jones')
content = html.fromstring(page.content)
#This will create a list of prices:
prices = content.find_class('price')
print(prices.text)
time.sleep(.5)
Probably a formatting issue from posting but your while loop is not indented.
Try my code below:
while True:
page = requests.get('https://markets.businessinsider.com/index/realtime-chart/dow_jones')
content = html.fromstring(page.content)
prices = content.find_class('price')
#You need to access the 'text_content' method
text = [p.text_content() for p in prices]
for t in text:
if not t.startswith(r"\"): # Prevents the multiple blank lines
print(t)
time.sleep(0.5)

Python Requests/Selenium with BeautifulSoup not returning find_all every time

I am trying to webscrape Airbnb, I had working code but it seems they have updated everything on the page. It intermittently returns the correct output and then sometimes it fails? It will return the NoneType error between the 3rd and 17th page randomly. Is there a way for it to keep trying or is my code incorrect?
for page in range(1,pages + 1):
#get page urls
page_url= url + '&page={0}'.format(page)
print(page_url)
#get page
# browser.get(page_url)
source = requests.get(page_url)
soup = BeautifulSoup(source.text,'html.parser')
#get all listings on page
div = soup.find('div',{'class':'row listing-cards-row'})
#loop through to get all info needed from cards
for pic in div.find_all('div',{'class':'listing-card-wrapper'}):
print(...)
the last for loop is where my error starts to occur. This happens sometimes in my other functions too where it sometimes works sometimes doesn't. I have already given lxml parser a try as well.
After reviewing the soup a couple of times i noticed that every couple of times the program would run the source code tags would change. I threw in some exceptions and it seems to have fixed my "None" issue.

Categories