I have never done website scraping before. Not even sure if this is the way to go.
I want to be able to collect data from the tables in the image, which changes 5 times in a second for every parameter. This data will be available on this webserver (IP accessible) created automatically by a microchip. I want to collect and save this data to a database quick enough.
Am I correct to be looking into beautiful soup/selenium? If not, what tools can I use to collect and store data and make sure it is updated every second?
Any help much appreciated!
PS: I only know Python and SQL.
Webpage
Open Inspect Element and see if its a WS connection or simple AJAX
If its WS, then use https://pypi.org/project/websocket-client/
If its AJAX, then copy that request as cURL and convert that to Python code using https://curl.trillworks.com/
PS: you do not need to use Selenium at all.
I advise you to check curl. Whatever the language you wanna use.
What you want to do is possible with Selenium though
With Selenium :
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
s = Service('chromedriver.exe')
driver = webdriver.Chrome(service=s)
url = ...
driver.get(url)
while True:
element = driver.find_elements(By.CLASS_NAME, "nameOfElement")
#or driver.find_elements(By.TAG_NAME, "nameOfElement") or id or w/e
content = element[0].text
time.sleep(1)
etc ...
Related
I am trying to scrape this mobile link https://www.tokopedia.com/now/sumo-beras-putih-kemasan-merah-5-kg using a simple requests. That can only be open in app on mobile phone on tokopedia only.
It should return the price and product name however I am not finding it in the content of the request. Do I have to use selenium to wait for it to load? Please do help.
Currently the code is just a simple
resp = requests.get("https://www.tokopedia.com/now/sumo-beras-putih-kemasan-merah-5-kg", headers = {'User-Agent':'Mozilla/5.0'})
I tried searching for the price using in however it's not there. What should I do?
The reason you are unable to get all the data you are expecting is because this website uses javascript. What this means for you is that you need a scraping tool that is capable of rendering javascript.
What you are doing right now is fetching the raw data as your browser would receive it, however you are currently not doing anything with the code written on the website, hence why your data is incomplete.
For starters, I would recommend using Selenium for the job. It'll look something like this:
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.tokopedia.com/now/sumo-beras-putih-kemasan-merah-5-kg')
print(driver.page_source)
To get started with Selenium and its installation, I recommend this resource
This isn't really a specific question i'm sorry for that. I'm trying to create a script that would take real time data from another site ( from table tag to be exact, make it an array and display it somewhere ). I've created a simple python script:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
import time
driver = webdriver.Chrome('C:/drivers/chromedriver.exe')
driver.set_page_load_timeout("10")
driver.get("link to the site")
driver.find_element_by_id("username-real").send_keys("login")
driver.find_element_by_id("pass-real").send_keys("pwd")
driver.find_element_by_xpath('//input[#class="button-login"]').submit()
#here potentially for loop that would refresh every second:
for elem in driver.find_elements_by_xpath('//[#class="table-body#"]'):
#do something
As you can see it's pretty simple, basically open chrome webdriver, log in to the website and do something with the table, I didn't try to properly get the data yet because i don't like this method.
I was wondering if there's another way to do it, without running the webdriver - some console like application? I'm pretty lost what should i look into in order to create a script like that. Other programming language? Some kind of framework/method?
If you want to use Selenium you have to use the WebDriver. See it as a "connection" between your Programm and Google Chrome. If you can use Safari you can use Selenium without any WebDrivers that have to be installed manually.
If you want to use other tools I can recommend Beautifulsoup. It's basically a HTML-Parser wich looks into the HTML-Code of the WebPage. With BS you don't have to install any Drivers etc. You also can use BS with Python.
A other Method I'm thinking of is, downloading the HTML-Text of the WebPage and search locally through the file. But I wouldn't recommend this Method.
For WebPages Selenium is really the way to go. I often use it for my own projects
I'm using Python Selenium to save data into a spreadsheet from a webpage using Firefox, but the page continually updates data causing errors relating to stale elements. How do I resolve this?
I've tried to turn off JavaScript but that's doesn't seem to do anything. Any suggestions would be great!
If you want to save the data at the page in the specific moment of time you can
get the current page HTML source using WebDriver.page_source function
write it into a file
open the file from the disk using WebDriver.get() function
that's it, you should be able to work with the local copy of the page which will never change
Example code:
driver.get("http://seleniumhq.org")
with open("mypage.html", "w") as mypage:
mypage.write(driver.page_source)
mypage.close()
driver.get(os.getcwd() + "/" + (mypage.name))
#do what you need with the page source
another approach is using WebDriver.find_element function wherever you need to interact with the element.
so instead of
myelement = driver.find_element_by_xpath("//your_selector")
# some other action
myelement.getAttribute("interestingAttribute")
perform find any time you need to interact with the element:
driver.find_element_by_xpath("//your_selector").getAttribute("interestingAttribute")
or even better go for Explicit Wait of the element you need:
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, "//your/selector"))).get_attribute("href")
I have already written a script that opens Firefox with a URL, scrape data and close. The page belongs to a gaming site where page refresh contents via Ajax.
Now one way is to fetch those AJAX requests and get data or refresh page after certain period of time within open browser.
For the later case, how should I do? Should I call the method after certain time period or what?
You can use the time library to do it. For example:
import time
from selenium import webdriver
driver = webdriver.Firefox()
while <condicion>:
driver.get("http://www.url.org")
# extract and save data
time.sleep(5000) # whaits 5000 seconds
driver.close()
You can implement so-called smart wait.
Indicate the most frequently updating and useful for you web element
at the page
Get data from it by using JavaScript since DOM model will not be updated without page refresh, eg:
driver.execute_script('document.getElementById("demo").innerHTML')
Wait for certain time, get it again and compare with previous result. If changed - refresh page, fetch data, etc.
Make sure to call findElement() again after waiting, bc otherwise you might not get a fresh instance. Or use page factory, which will get a fresh copy of the WebElement for you every time the instance is accessed.
Try with refresh the page time by time to get the updated result.
driver.navigate().refresh();
To refresh page in a period of time refer below link :-
Running a python script for a user-specified amount of time?
Hope it will help you :)
I have a webpage :
http://kff.org/womens-health-policy/state-indicator/ultrasound-requirements/#
and I need to extract the table from this webpage.
Problem Encountered : I have been using BeautifulSoup and requests to get the url content. The problem with these methods is that I am able to get the web content even before the table is being generated.
So I get empty table
< table>
< thead>
< /thead>
< tbody>
< /tbody>
< /table>
My approach : Now I am trying to open the url in the browser using
webbrowser.open_new_tab(url) and then get the content from the browser directly . This will give the server to update the table and then i will be able to get the content from the page.
Problem : I am not sure how to fetch information from Web browser directly .
Right now i am using Mozilla on windows system.
Closest link found website Link . But it gives which sites are opened and not the content
Is there any other way to let the table load in urllib2 or beautifulsoup and requests ? or is there any way to get the loaded content directly from the webpage.
Thanks
To add to Santiclause answer, if you want to scrape java-script populated data you need something to execute it.
For that you can use selenium package and webdriver such as Firefox or PhantomJS (which is headless) to connect to the page, execute the scripts and get the data.
example for your case:
from selenium import webdriver
driver = webdriver.Firefox() # You can replace this with other web drivers
driver.get("http://kff.org/womens-health-policy/state-indicator/ultrasound-requirements/#")
source = driver.page_source # Here is your populated data.
driver.quit() # don't forget to quit the driver!
of course if you can access direct json like user Santiclause mentioned, you should do that. You can find it by checking the network tab when inspecting the element on the website, which needs some playing around.
The reason the table isn't being filled is because Python doesn't process the page it receives with urllib2 - so there's no DOM, no Javascript that runs, et cetera.
After reading through the source, it looks like the information you're looking for can be found at http://kff.org/datacenter.json?post_id=32781 in JSON format.