I have some python code which scrapes a website and reports the live price of a specific crypto. When I use a while loop to keep printing the live price it keeps printing the same price over and over even when the live price on the website has changed. I thought that maybe my code was scraping it and coming to that website too fast so I added a delay using the time module but even after a 1 minute delay it will not display the correct price but instead prints the same price over and over. Manually ending and restarting the code seemed to make this bug go away but I want this program to run 24/7 and email me when a price reaches a certain point. This is my code so far: (BTW I am a beginner)
import requests
import bs4
import time
run = True
while run == True:
# time.sleep(60)
res = requests.get("https://coinmarketcap.com/currencies/gitcoin/")
soup_obj = bs4.BeautifulSoup(res.text, "lxml")
item = soup_obj.select(".priceValue___11gHJ")[0]
item = item.text
print(item)
exit()
This has a loop but I have added an exit() function so that it ends and so I can manually restart it. I just need a way for this code to automatically end itself and then restart repeatedly. I am also using the community edition of Pycharm (latest edition).
You can write your program to call a subprocess instead of doing the web call itself. That subprocess can call requests, return whatever you want via stdout and exit. There are multiple ways to do this. You could write separate scripts or use multiprocess.Process, but in this example I've written a script that calls itself and uses command line parameters to know which role it is playing.
import sys
if len(sys.argv) == 1:
# run poller as subprocess so it exits
import time
import subprocess as subp
while True:
result = subp.run([sys.executable, __file__, "called"], capture_output=True)
# assuming program returns ascii float in single line
item = result.stdout.decode("ascii").strip()
print(item)
time.sleep(60)
else:
import requests
import bs4
res = requests.get("https://coinmarketcap.com/currencies/gitcoin/")
soup_obj = bs4.BeautifulSoup(res.text, "lxml")
item = soup_obj.select(".priceValue___11gHJ")[0]
item = item.text
sys.stdout.write(item)
Related
I wrote a simple script which schedules the download of the file from web page once per every week with schedule module. Before downloading, it checks if the file was updated using BeautifulSoup. If yes, it downloads the file using wget. Further, other script uses the file to perform calculations.
The problem is that file won’t appear in the directory until I manually interrupt the script. So, each time I must interrupt script and rerun it again, so it’ll be scheduled for the next week.
Is there any chance to download and save the file "on the fly" without script interruption?
The code will be:
import wget
import ssl
import schedule
import time
from bs4 import BeautifulSoup
import datefinder
from datetime import datetime
# disable certificate checks
ssl._create_default_https_context = ssl._create_unverified_context
#checking if file was updated, if yes, download file, if not waiting until updated
def download_file():
if check_for_updates():
print("downloading")
url = 'https://fgisonline.ams.usda.gov/ExportGrainReport/CY2020.csv'
wget.download(url)
print("downloading complete")
else:
print("sleeping")
time.sleep(60)
download_file()
# Checking if website was updated
def check_for_updates():
url2 = 'https://fgisonline.ams.usda.gov/ExportGrainReport/default.aspx'
html = urlopen(url2).read()
soup = BeautifulSoup(html, "lxml")
text_to_search = soup.body.ul.li.string
matches = list(datefinder.find_dates(text_to_search[30:]))
found_date = matches[0].date()
today = datetime.today().date()
return found_date == today
schedule.every().tuesday.at('09:44').do(download_file)
while True:
schedule.run_pending()
time.sleep(1)
You need to specify the output directory. I think that unless doing this, PyCharm saves in temp directory somewhere, and when you stop the script PyCharm copy it.
Change to:
wget.download(url, out=output_directory)
Based on the following clue you should be able to solve your issue:
from bs4 import BeautifulSoup
import requests
import urllib3
urllib3.disable_warnings()
def main(url):
r = requests.head(url, verify=False)
print(r.headers['Last-Modified'])
main("https://fgisonline.ams.usda.gov/ExportGrainReport/CY2020.csv")
Output:
Mon, 28 Sep 2020 15:02:22 GMT
Now you can run your script via Cron job daily at the time which you prefer and looping over the file headers Last-Modified until it becomes equal to today's date and then download the file.
Be informed I used head request which will be 100x speedy to track it. and then you can use requests.get
I prefer to work under the same session as well
I built this function to tell me whether there have been changes to the website. I'm not sure if it works as I have tried it on a few websites that have not changed and it has given me the wrong output. Where is the issue and is there an issue at all?
This is the code:
I put the code into a function so that I could allow the user to input any site
userurl=input("Please enter a valid url")
def checksite(userurl):
change=False
import time
import urllib.request
import io
u = urllib.request.urlopen(userurl)
webContent1 = u.read()
time.sleep(60)
u = urllib.request.urlopen(userurl)
webContent2 = u.read()
if webContent1 == webContent2:
print("Everything is normal")
elif webContent1 !=webContent2:
print("Warning, there has been a change to the webite!")
change=True
return change
checksite(userurl)
Try making a small HTML Hello World page. Given that many websites have dynamic content that changes each time you access it (and might not necessarily be visible), that could lead to your "incorrect" results.
I have tested your code and it works perfectly fine in a Python webserver.
I have started one with
python -m http.server
and placed an index.html in the same directory with some content before starting the server.
and your code
import time
import urllib.request
import io
userurl='http://localhost:8000/index.html'
def checksite(userurl):
change=False
u = urllib.request.urlopen(userurl)
webContent1 = u.read()
print(webContent1)
time.sleep(15)
u = urllib.request.urlopen(userurl)
webContent2 = u.read()
print(webContent2)
if webContent1 == webContent2:
print("Everything is normal")
elif webContent1 !=webContent2:
print("Warning, there has been a change to the webite!")
change=True
return change
checksite(userurl)
and output
b'<html>\n\t<title> Hello </title>\n\t<body>\n\t\tTesting, Webcontent1 \n\t</body>\n\t</html>\n\n'
b'<html>\n\t<title> Hello </title>\n\t<body>\n\t\tTesting, Webcontent2\n\t</body>\n\t</html>\n\n'
Warning, there has been a change to the webite!
[Finished in 17.5s]
Your code is perfectly fine.
to know if a website or a page has changed you need to have a backup of it somewhere, in your code it was like you were comparing the site to itself... anyways. i recomend using the requests library in addition to BS4 and try parsing it line by line comparing to the backup you have.
So while the code is working (aka: the site you have as backup is showing the same lines as the site on the web) it will have a variable true. if it has changed it breaks the loop and simply shows the line where the site has changed.
Fairly new to python, I learn by doing, so I thought I'd give this project a shot. Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.
Here are the requirements:
Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)
Use selenium to open the two urls, use browsermobproxy/phantomJS to
get all HAR
Store the HAR as a list
From the list of all HAR files, find the google analytics request, including the payload
If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.
Issue: Sometimes for a website that I know has google analytics, i.e. nytimes.com - the HAR that I get is incomplete, i.e. my prog. will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there. This issue in intermittent and does not happen all the time. Any ideas?
I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured. I tried the "wait for traffic to stop" but maybe I didn't do something right.
Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow. As I mentioned, I'm new to python so go easy :)
This is what I've got thus far.
import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime
def cleanup():
s.stop()
driver.quit()
proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any'] # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)
urlLists = []
collectTags = []
gaCollect = 0
varList = []
for x in range(0,2): # I want to ask the user for 2 inputs
url = raw_input("Enter a website to find GA on: ")
time.sleep(2.0)
urlLists.append(url)
if not url:
print "You need to type something in...here"
sys.exit()
#gets the two user url and stores in list
for urlList in urlLists:
print urlList, 'start 2nd loop' #printing for debug purpose, no need for this
if not urlList:
print 'Your Url list is empty'
sys.exit()
proxy.new_har()
driver.get(urlList)
#proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything
for ent in proxy.har['log']['entries']:
gaCollect = (ent['request']['url'])
print gaCollect
if re.search(r'google-analytics.com/r\b', gaCollect):
print 'Found GA'
collectTags.append(gaCollect)
time.sleep(2.0)
break
else:
print 'No GA Found - Ending Prog.'
cleanup()
sys.exit()
cleanup()
This might be a stale question, but I found an answer that worked for me.
You need to change two things:
1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found
2 - call new_har with the captureContent option enabled to get the payload of requests:
proxy.new_har(options={'captureHeaders':True, 'captureContent': True})
See if that helps.
I am trying to test this demo program from lynda using Python 3. I am using Pycharm as my IDE. I already added and installed the request package, but when I run the program, it runs cleanly and shows a message "Process finished with exit code 0", but does not show any output from print statement. Where am I going wrong ?
import urllib.request # instead of urllib2 like in Python 2.7
import json
def printResults(data):
# Use the json module to load the string data into a dictionary
theJSON = json.loads(data)
# now we can access the contents of the JSON like any other Python object
if "title" in theJSON["metadata"]:
print(theJSON["metadata"]["title"])
# output the number of events, plus the magnitude and each event name
count = theJSON["metadata"]["count"];
print(str(count) + " events recorded")
# for each event, print the place where it occurred
for i in theJSON["features"]:
print(i["properties"]["place"])
# print the events that only have a magnitude greater than 4
for i in theJSON["features"]:
if i["properties"]["mag"] >= 4.0:
print("%2.1f" % i["properties"]["mag"], i["properties"]["place"])
# print only the events where at least 1 person reported feeling something
print("Events that were felt:")
for i in theJSON["features"]:
feltReports = i["properties"]["felt"]
if feltReports != None:
if feltReports > 0:
print("%2.1f" % i["properties"]["mag"], i["properties"]["place"], " reported " + str(feltReports) + " times")
# Open the URL and read the data
urlData = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.geojson"
webUrl = urllib.request.urlopen(urlData)
print(webUrl.getcode())
if webUrl.getcode() == 200:
data = webUrl.read()
data = data.decode("utf-8") # in Python 3.x we need to explicitly decode the response to a string
# print out our customized results
printResults(data)
else:
print("Received an error from server, cannot retrieve results " + str(webUrl.getcode()))
Not sure if you left this out on purpose, but this script isn't actually executing any code beyond the imports and function definition. Assuming you didn't leave it out on purpose, you would need the following at the end of your file.
if __name__ == '__main__':
data = "" # your data
printResults(data)
The check on __name__ equaling "__main__" is just so your code is only executing when the file is explicitly run. To always run your printResults(data) function when the file is accessed (like, say, if its imported into another module) you could just call it at the bottom of your file like so:
data = "" # your data
printResults(data)
I had to restart the IDE after installing the module. I just realized and tried it now with "Run as Admin". Strangely seems to work now.But not sure if it was a temp error, since even without restart, it was able to detect the module and its methods.
Your comments re: having to restart your IDE makes me think that pycharm might not automatically detect newly installed python packages. This SO answer seems to offer a solution.
SO answer
I am new to python and I am trying to run the following code using requests
import requests
import wiringpi2
import time
wiringpi2.wiringPiSetupGpio()
wiringpi2.pinMode(17,1)
wiringpi2.digitalWrite(17,1)
while 1:
relaystatus = requests.get('http://stevesolarhome.com/WaterControl.txt')
if relaystatus == "1":
wiringpi2.digitalWrite(17,1)
elif relaystatus == "0":
wiringpi2.digitalWrite(17,0)
time.sleep (2)
the GPIO pins do not react to the file being changed. The file only contains the number 1 or 0 at any time. I know the URL works and the request returns the correct number from the text file. I also know the GPIO pins work but this script does not work. I assume the file being read is not in the correct format to be used in the 'if' line
requests.get(url) will return a request object. To get the underlying content, call the text attribute.
while 1:
request = requests.get('http://stevesolarhome.com/WaterControl.txt')
if request.text == "1":
... do stuff ...