I wrote a simple script which schedules the download of the file from web page once per every week with schedule module. Before downloading, it checks if the file was updated using BeautifulSoup. If yes, it downloads the file using wget. Further, other script uses the file to perform calculations.
The problem is that file won’t appear in the directory until I manually interrupt the script. So, each time I must interrupt script and rerun it again, so it’ll be scheduled for the next week.
Is there any chance to download and save the file "on the fly" without script interruption?
The code will be:
import wget
import ssl
import schedule
import time
from bs4 import BeautifulSoup
import datefinder
from datetime import datetime
# disable certificate checks
ssl._create_default_https_context = ssl._create_unverified_context
#checking if file was updated, if yes, download file, if not waiting until updated
def download_file():
if check_for_updates():
print("downloading")
url = 'https://fgisonline.ams.usda.gov/ExportGrainReport/CY2020.csv'
wget.download(url)
print("downloading complete")
else:
print("sleeping")
time.sleep(60)
download_file()
# Checking if website was updated
def check_for_updates():
url2 = 'https://fgisonline.ams.usda.gov/ExportGrainReport/default.aspx'
html = urlopen(url2).read()
soup = BeautifulSoup(html, "lxml")
text_to_search = soup.body.ul.li.string
matches = list(datefinder.find_dates(text_to_search[30:]))
found_date = matches[0].date()
today = datetime.today().date()
return found_date == today
schedule.every().tuesday.at('09:44').do(download_file)
while True:
schedule.run_pending()
time.sleep(1)
You need to specify the output directory. I think that unless doing this, PyCharm saves in temp directory somewhere, and when you stop the script PyCharm copy it.
Change to:
wget.download(url, out=output_directory)
Based on the following clue you should be able to solve your issue:
from bs4 import BeautifulSoup
import requests
import urllib3
urllib3.disable_warnings()
def main(url):
r = requests.head(url, verify=False)
print(r.headers['Last-Modified'])
main("https://fgisonline.ams.usda.gov/ExportGrainReport/CY2020.csv")
Output:
Mon, 28 Sep 2020 15:02:22 GMT
Now you can run your script via Cron job daily at the time which you prefer and looping over the file headers Last-Modified until it becomes equal to today's date and then download the file.
Be informed I used head request which will be 100x speedy to track it. and then you can use requests.get
I prefer to work under the same session as well
Related
I have some python code which scrapes a website and reports the live price of a specific crypto. When I use a while loop to keep printing the live price it keeps printing the same price over and over even when the live price on the website has changed. I thought that maybe my code was scraping it and coming to that website too fast so I added a delay using the time module but even after a 1 minute delay it will not display the correct price but instead prints the same price over and over. Manually ending and restarting the code seemed to make this bug go away but I want this program to run 24/7 and email me when a price reaches a certain point. This is my code so far: (BTW I am a beginner)
import requests
import bs4
import time
run = True
while run == True:
# time.sleep(60)
res = requests.get("https://coinmarketcap.com/currencies/gitcoin/")
soup_obj = bs4.BeautifulSoup(res.text, "lxml")
item = soup_obj.select(".priceValue___11gHJ")[0]
item = item.text
print(item)
exit()
This has a loop but I have added an exit() function so that it ends and so I can manually restart it. I just need a way for this code to automatically end itself and then restart repeatedly. I am also using the community edition of Pycharm (latest edition).
You can write your program to call a subprocess instead of doing the web call itself. That subprocess can call requests, return whatever you want via stdout and exit. There are multiple ways to do this. You could write separate scripts or use multiprocess.Process, but in this example I've written a script that calls itself and uses command line parameters to know which role it is playing.
import sys
if len(sys.argv) == 1:
# run poller as subprocess so it exits
import time
import subprocess as subp
while True:
result = subp.run([sys.executable, __file__, "called"], capture_output=True)
# assuming program returns ascii float in single line
item = result.stdout.decode("ascii").strip()
print(item)
time.sleep(60)
else:
import requests
import bs4
res = requests.get("https://coinmarketcap.com/currencies/gitcoin/")
soup_obj = bs4.BeautifulSoup(res.text, "lxml")
item = soup_obj.select(".priceValue___11gHJ")[0]
item = item.text
sys.stdout.write(item)
I started learning web scraping with Python. Currently, I would like to download a video of the Japanese Diet. (https://www.shugiintv.go.jp/jp/index.php?ex=VL&deli_id=40124&media_type=)
The video seems to have a mechanism to call chunklist.m3u8 from playlist.m3u8 and then call the ts files described in chunklist.m3u8 in order.
I want to download the contents from the playlist.m3u8 URL first, then call chunklist.m3u8 to download the ts files in order and concat.
However, I tried to download Playlist.m3u8, but it didn't produce the text I expected.
Also, the sample URL of playlist.m3u8 is here↓
http://hlsvod.shugiintv.go.jp/vod/_definst_/amlst:2011/2011-1207-0900-12/playlist.m3u8
code:
import requests
url = "http://hlsvod.shugiintv.go.jp/vod/_definst_/amlst:2011/2011-1207-0900-12/playlist.m3u8"
res = requests.get(url)
print(res.text)
excepted text:
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-STREAM-INF:BANDWIDTH=564000,NAME="500k",RESOLUTION=640x360
chunklist_w60346572_b564000_t64NTAwaw==.m3u8
actual text:
<html><head><title>Wowza Streaming Engine 4 Perpetual Bundle Unlimited Edition 4.7.7 build20181108145350</title></head><body>Wowza Streaming Engine 4 Perpetual Bundle Unlimited Edition 4.7.7 build20181108145350</body></html>
I think there is a problem with the colon in the URL, but I don't have a clear solution. I would like to know how to avoid URL issues and successfully download the text in playlist.m3u8. Thanks.
Version:
Python 3.7.9
requests 2.25.1
Something is wrong with your url:
>>> url = "http://hlsvod.shugiintv.go.jp/vod/_definst_/amlst:2011/2011-1207-0900-12/playlist.m3u8"
>>> res = requests.get(url)
>>> res.request.url
'https://hlsvod.shugiintv.go.jp/vod/_definst_/amlst:2011/2011-1207-0900-12/playlist.m3u8%20'
See the "%20" in the end?
I am not really sure how you got it wrong, but copy-paste this should work:
url = 'https://hlsvod.shugiintv.go.jp/vod/_definst_/amlst:2011/2011-1207-0900-12/playlist.m3u8'
I'm creating an application that downloads PDF's from a website and saves them to disk. I understand the Requests module is capable of this but is not capable of handling the logic behind the download (File size, progress, time remaining etc.).
I've created the program using selenium thus far and would like to eventually incorporate this into a GUI Tkinter app eventually.
What would be the best way to handle the downloading, tracking and eventually creating a progress bar?
This is my code so far:
from selenium import webdriver
from time import sleep
import requests
import secrets
class manual_grabber():
""" A class creating a manual downloader for the Roger Technology website """
def __init__(self):
""" Initialize attributes of manual grabber """
self.driver = webdriver.Chrome('\\Users\\Joel\\Desktop\\Python\\manual_grabber\\chromedriver.exe')
def login(self):
""" Function controlling the login logic """
self.driver.get('https://rogertechnology.it/en/b2b')
sleep(1)
# Locate elements and enter login details
user_in = self.driver.find_element_by_xpath('/html/body/div[2]/form/input[6]')
user_in.send_keys(secrets.username)
pass_in = self.driver.find_element_by_xpath('/html/body/div[2]/form/input[7]')
pass_in.send_keys(secrets.password)
enter_button = self.driver.find_element_by_xpath('/html/body/div[2]/form/div/input')
enter_button.click()
# Click Self Service Area button
self_service_button = self.driver.find_element_by_xpath('//*[#id="bs-example-navbar-collapse-1"]/ul/li[1]/a')
self_service_button.click()
def download_file(self):
"""Access file tree and navigate to PDF's and download"""
# Wait for all elements to load
sleep(3)
# Find and switch to iFrame
frame = self.driver.find_element_by_xpath('//*[#id="siteOutFrame"]/iframe')
self.driver.switch_to.frame(frame)
# Find and click tech manuals button
tech_manuals_button = self.driver.find_element_by_xpath('//*[#id="fileTree_1"]/ul/li/ul/li[6]/a')
tech_manuals_button.click()
bot = manual_grabber()
bot.login()
bot.download_file()
So in summary, I'd like to make this code download PDF's on a website, store them in a specific directory (named after it's parent folder in the JQuery File Tree) and keep tracking of the progress (file size, time remaining etc.)
Here is the DOM:
I hope this is enough information. Any more required please let me know.
I would recommend using tqdm and the request module for this.
Here is a sample code that effectively achieves that hard job of downloading and updating progress bar.
from tqdm import tqdm
import requests
url = "http://www.ovh.net/files/10Mb.dat" #big file test
# Streaming, so we can iterate over the response.
response = requests.get(url, stream=True)
total_size_in_bytes= int(response.headers.get('content-length', 0))
block_size = 1024 #1 Kibibyte
progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)
with open('test.dat', 'wb') as file:
for data in response.iter_content(block_size):
progress_bar.update(len(data)) #change this to your widget in tkinter
file.write(data)
progress_bar.close()
if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
print("ERROR, something went wrong")
The block_size is your file-size and the time-remaining can be calculated with the number of iterations performed per second with respect to the block-size that remains. Here is an alternative - How to measure download speed and progress using requests?
Struggling with what is I am sure a very straight forward problem. I have a scheduled task set up which launches a batch file, which in turn runs a Python script. All is well, however I cannot seem to close the Python shell once the script is finished. The result is lots of open windows.
If this is a Python issue, I have read the best way to close is to do the following:
import selenium
import json
import time
import datetime
import sys
from selenium import webdriver
from datetime import timedelta
today = datetime.datetime.today()
yesterday = today - timedelta(days=1)
yesterday = yesterday.strftime("%d.%m.%Y")
browser = webdriver.Chrome(executable_path = 'c:/xampp/htdocs/portal/functions/timon/chromedriver.exe')
browser.get('http://adventures.timon.is')
time.sleep(2)
browser.find_element_by_id('tbxNumerstarfsmanns').clear()
browser.find_element_by_id('tbxNumerstarfsmanns').send_keys('user')
browser.find_element_by_id('tbxUserLykilord').clear()
browser.find_element_by_id('tbxUserLykilord').send_keys('pass')
time.sleep(2)
browser.find_element_by_css_selector('input[type=\"submit\"]').click()
browser.find_element_by_css_selector("a[href*=reports]").click()
browser.find_element_by_link_text("Salary administrators").click()
browser.find_element_by_link_text("Punch-in report").click()
time.sleep(2)
browser.find_element_by_id('id_fromdate').clear()
browser.find_element_by_id('id_fromdate').send_keys(yesterday)
browser.find_element_by_id('id_todate').clear()
browser.find_element_by_id('id_todate').send_keys(yesterday)
time.sleep(2)
browser.find_element_by_css_selector("input[type=submit]").click()
time.sleep(2)
results = browser.find_elements_by_css_selector("table#resultstable td")
columns = [val.text for val in results]
data = json.dumps(columns)
text_file = open("c:/xampp/htdocs/portal/functions/timon/info.txt", "w")
text_file.write(data)
text_file.close()
browser.close()
sys.exit()
However this does not work.
Batch file looks like this...
start "extractTimon" "C:\xampp\Python36-32\python.exe" C:\xampp\htdocs\portal\functions\timon\extractTimon.py
If anyone could point me in the right direction, I'd really appreciate it.
I'm trying to manipulate a dynamic JSON from this site:
http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do
It has 3 elements, imagem, a base64, labelValorCaptcha, just a message, and uuidCaptcha, a value to pass by parameter to play a sound in this link bellow:
http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha=sajcaptcha_e7b072e1fce5493cbdc46c9e4738ab8a
When I enter in the first site through a browser and put in the second link the uuidCaptha after the equal ("..uuidCaptcha="), the sound plays normally. I wrote a simple code to catch this elements.
import urllib, json
url = "http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do"
response = urllib.urlopen(url)
data = json.loads(response.read())
urlSound = "http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha="
print urlSound + data['uuidCaptcha']
But I dont know what's happening, the caught value of the uuidCaptcha doesn't work. Open a error web page.
Someone knows?
Thanks!
It works for me.
$ cat a.py
#!/usr/bin/env python
# encoding: utf-8
import urllib, json
url = "http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do"
response = urllib.urlopen(url)
data = json.loads(response.read())
urlSound = "http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha="
print urlSound + data['uuidCaptcha']
$ python a.py
http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha=sajcaptcha_efc8d4bc3bdb428eab8370c4e04ab42c
As I said #Charlie Harding, the best way is download the page and get the JSON values, because this JSON is dynamic and need an opened web link to exist.
More info here.