header If-Modified-Since does not give 304 code - python

I am using below code to save an html file with a time stamp in its name:
import contextlib
import datetime
import urllib2
import lxml.html
import os
import os.path
timestamp=''
filename=''
for dirs, subdirs, files in os.walk("/home/test/Desktop/"):
for f in files:
if "_timestampedfile.html" in f.lower():
timestamp=f.split('_')[0]
filename=f
break
if timestamp is '':
timestamp=datetime.datetime.now()
with contextlib.closing(urllib2.urlopen(urllib2.Request(
"http://www.google.com",
headers={"If-Modified-Since": timestamp}))) as u:
if u.getcode() != 304:
myfile="/home/test/Desktop/"+str(datetime.datetime.now())+"_timestampedfile.html"
file(myfile, "w").write(urllib2.urlopen("http://www.google.com").read())
if os.path.isfile("/home/test/Desktop/"+filename):
os.remove("/home/test/Desktop/"+filename)
html = lxml.html.parse(myfile)
else:
html = lxml.html.parse("/home/test/Desktop/"+timestamp+"_timestampedfile.html")
links=html.xpath("//a/#href")
print u.getcode()
When I run this code every time I get the code 200 from If-Modified-since header. Where am I doing mistake? My goal here is to save and use an html file and if it is modified after last time it is accessed, html file should be overwritten.

The problem is that If-Modified-Since is supposed to be a formatted date string:
If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT
but you're passing in a datetime tuple.
Try something like this:
timestamp = time.time()
...
time.strftime('%a, %d %b %Y %H:%M:%S GMT', time.gmtime(timestamp))
The second reason your code isn't working as you expect:
http://www.google.com/ does not seem to honor If-modified-since. That's allowed per the RFC, and they may have various reasons for choosing that behavior.
c) If the variant has not been modified since a valid If-
Modified-Since date, the server SHOULD return a 304 (Not
Modified) response.
If you try http://www.stackoverflow.com/, for example, you'll see a 304. (I just tried it.)

Related

Regular file download with Python scheduler and wget

I wrote a simple script which schedules the download of the file from web page once per every week with schedule module. Before downloading, it checks if the file was updated using BeautifulSoup. If yes, it downloads the file using wget. Further, other script uses the file to perform calculations.
The problem is that file won’t appear in the directory until I manually interrupt the script. So, each time I must interrupt script and rerun it again, so it’ll be scheduled for the next week.
Is there any chance to download and save the file "on the fly" without script interruption?
The code will be:
import wget
import ssl
import schedule
import time
from bs4 import BeautifulSoup
import datefinder
from datetime import datetime
# disable certificate checks
ssl._create_default_https_context = ssl._create_unverified_context
#checking if file was updated, if yes, download file, if not waiting until updated
def download_file():
if check_for_updates():
print("downloading")
url = 'https://fgisonline.ams.usda.gov/ExportGrainReport/CY2020.csv'
wget.download(url)
print("downloading complete")
else:
print("sleeping")
time.sleep(60)
download_file()
# Checking if website was updated
def check_for_updates():
url2 = 'https://fgisonline.ams.usda.gov/ExportGrainReport/default.aspx'
html = urlopen(url2).read()
soup = BeautifulSoup(html, "lxml")
text_to_search = soup.body.ul.li.string
matches = list(datefinder.find_dates(text_to_search[30:]))
found_date = matches[0].date()
today = datetime.today().date()
return found_date == today
schedule.every().tuesday.at('09:44').do(download_file)
while True:
schedule.run_pending()
time.sleep(1)
You need to specify the output directory. I think that unless doing this, PyCharm saves in temp directory somewhere, and when you stop the script PyCharm copy it.
Change to:
wget.download(url, out=output_directory)
Based on the following clue you should be able to solve your issue:
from bs4 import BeautifulSoup
import requests
import urllib3
urllib3.disable_warnings()
def main(url):
r = requests.head(url, verify=False)
print(r.headers['Last-Modified'])
main("https://fgisonline.ams.usda.gov/ExportGrainReport/CY2020.csv")
Output:
Mon, 28 Sep 2020 15:02:22 GMT
Now you can run your script via Cron job daily at the time which you prefer and looping over the file headers Last-Modified until it becomes equal to today's date and then download the file.
Be informed I used head request which will be 100x speedy to track it. and then you can use requests.get
I prefer to work under the same session as well

Incorrect conversion from epoch time when scraping web

import praw,time
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
username=""
password=""
r = praw.Reddit(user_agent='')
r.login(username,password,disable_warning=True)
posts=r.search('china disaster', subreddit=None, sort=None, syntax=None, period=None,limit=7)
title=[];created=[]
for index,post in enumerate(posts):
date=time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(post.created))
title.append(post.title);created.append(post.created)
print date,title[index]
break #added so it prints one post as a example
Error:
I get incorrect times.
<time title="Fri Jan 23 01:22:20 2015 UTC" datetime="2015-01-22T17:22:20-08:00" class="">5 months ago</time>
I don't understand the issue, I think I making a mistake in time-zone conversion. But reddit posts mention UTC, thus I don't get the error.
I didn't get your exact problem about how is the "Incorrect".
There are two attributes about the created time "created" and "created_utc". Maybe you want to try the second one instead.

Does django mess up with python's datetime?

I have a file that I check it's creation time using ctime. Here is the snippet of code (not complete, just the important part):
import time
import pytz
import os
from datetime import datetime
myfile = SOMEWHERE
myfile_ctime = os.path.getctime(myfile)
d = datetime.strptime(time.ctime(myfile_ctime), '%a %b %d %H:%M:%S %Y')
# d here is Tue Mar 25 00:33:40 2014 for example
ny = pytz.timezone("America/New_York")
d_ny = ny.localize(d)
mytz = pytz.timezone(MY_TZ_WHATEVER)
myd = d_ny.astimezone(mytz)
final_date = myd.strftime('%Y-%m-%d %H:%M:%S')
print(final_date + "some string")
# is now 2014-03-25 01:33:40some string, correctly with the timezone.
When this is run as a simple python script, everything is ok. But when I run the same code inside a function in a templatetags/myfile.py that renders to a template in a Django App, when trying to get the date from time.ctime(myfile_ctime), then I get Tue Mar 25 04:33:40 instead of Tue Mar 25 00:33:40 from the snippet above (the code is the same in the standalone script and in Django - and I concatenate the date with another string).
My question is: I'm using just Python standard libraries, same snippet of code in both places, reading the same file in the same environment. Why the difference? Do settings in settings.py mangles up something in the standard libraries? Just being in a Django environment it changes how standard libraries should work? Why when calling standalone everything works as it should?
(I'm behind apache, don't know if this is relevant)
Make sure of the Time Zone settings in settings.py, for more info about Django Time Zone Settings, check this page: https://docs.djangoproject.com/en/1.6/ref/settings/#time-zone
In ./django/conf/__init__.py:126:, TZ environment variable is set based on settings.py.
os.environ['TZ'] = self.TIME_ZONE
My TIME_ZONE is UTC.
That's why a standalone script result is different from a snippet inside Django: when running standalone, this environment variable TZisn't set.
Now, when creating a datetime object from a myfile_ctime, I just need to add tzinfo from my server (/etc/sysconfig/clock). My code now looks like this:
import time
import pytz
import os
from datetime import datetime
myfile = SOMEWHERE
myfile_ctime = os.path.getctime(myfile)
ny = pytz.timezone("America/New_York")
d = datetime.fromtimestamp(myfile_ctime, tz=ny)
mytz = pytz.timezone(MY_TZ_WHATEVER)
myd = d.astimezone(mytz)
final_date = myd.strftime('%Y-%m-%d %H:%M:%S')
I hope this is useful to someone. As always, read the source. :)

Trying to Parse JSON date to POST to another System (Python)

I am trying to write a script to GET project data from Insightly and post to 10000ft. Essentially, I want to take any newly created project in one system and create that same instance in another system. Both have the concept of a 'Project'
I am extremely new at this but I only to GET certain Project parameters in Insightly to pass into the other system (PROJECT_NAME, LINKS:ORGANIZATION_ID, DATE_CREATED_UTC) to name a few.
I plan to add logic to only POST projects with a DATE_CREATED_UTC > yesterday, but I am clueless on how to setup the script to grab the JSON strings and create python variables (JSON datestring to datetime). Here is my current code. I am simply just printing out some of the variables I require to get comfortable with the code.
import urllib, urllib2, json, requests, pprint, dateutil
from dateutil import parser
import base64
#Set the 'Project' URL
insightly_url = 'https://api.insight.ly/v2.1/projects'
insightly_key =
api_auth = base64.b64encode(insightly_key)
headers = {
'GET': insightly_url,
'Authorization': 'Basic ' + api_auth
}
req = urllib2.Request(insightly_url, None, headers)
response = urllib2.urlopen(req).read()
data = json.loads(response)
for project in data:
project_date = project['DATE_CREATED_UTC']
project_name = project['PROJECT_NAME']
print project_name + " " + project_date
Any help would be appreciated
Edits:
I have updated the previous code with the following:
for project in data:
project_date = datetime.datetime.strptime(project['DATE_CREATED_UTC'], '%Y-%m-%d %H:%M:%S').date()
if project_date > (datetime.date.today() - datetime.timedelta(days=1)):
print project_date
else:
print 'No New Project'
This returns every project that was created after yesterday, but now I need to isolate these projects and post them to the other system
Here is an example of returning a datetime object from a parsed string. We will use the datetime.strptime method to accomplish this. Here is a list of the format codes you can use to create a format string.
>>> from datetime import datetime
>>> date_string = '2014-03-04 22:30:55'
>>> format = '%Y-%m-%d %H:%M:%S'
>>> datetime.strptime(date_string, format)
datetime.datetime(2014, 3, 4, 22, 30, 55)
As you can see, the datetime.strptime method returns a datetime object.

How do you add datetime to a logfile name?

When I create my logfile, I want the name to contain the datetime.
In Python you can get the current datetime as:
>>> from datetime import datetime
>>> datetime.now()
datetime.datetime(2012, 2, 3, 21, 35, 9, 559000)
The str version is
>>> str(datetime.now())
'2012-02-03 21:35:22.247000'
Not a very nice str to append to the logfile name! I would like my logfile to be something like:
mylogfile_21_35_03_02_2012.log
Is there something Python can do to make this easy? I am creating the log file as:
fh = logging.FileHandler("mylogfile" + datetimecomp + ".log")
You need datetime.strftime(), this allows you to format the timestamp using all of the directives of C's strftime(). In your specific case:
>>> datetime.now().strftime('mylogfile_%H_%M_%d_%m_%Y.log')
'mylogfile_08_48_04_02_2012.log'
You could also use a TimedRotatingFileHandler that will handle the date and the rollover every day (or whenever you want) for you.
from logging.handlers import TimedRotatingFileHandler
fh = TimedRotatingFileHandler('mylogfile', when='midnight')
By default the format will be depending on the rollover interval:
The system will save old log files by appending extensions to the filename. The extensions are date-and-time based, using the strftime format %Y-%m-%d_%H-%M-%S or a leading portion thereof, depending on the rollover interval.
But you can modify that as showed here, by doing something like:
from logging.handlers import TimedRotatingFileHandler
fh = TimedRotatingFileHandler('mylogfile', when='midnight')
fh.suffix = '%Y_%m_%d.log'
Yes. Have a look at the datetime API, in particular strftime.
from datetime import datetime
print datetime.now().strftime("%d_%m_%Y")
Another Solution using format():
#generates a date for a generic filename
import datetime
date_raw = datetime.datetime.now()
date_processed = "{}-{}-{}_{}-{}-{}".format(date_raw.year, date_raw.month,
date_raw.day, date_raw.hour, date_raw.minute, date_raw.second)
#example value: date_processed = 2020-1-7_17-17-48
I used this in my own project
edit: as I found out about f(ormatted)-strings, this would be another solution:
date_processed = f"{date_raw.year}-{date_raw.month}-{date_raw.day}_{date_raw.hour}-{date_raw.minute}-{date_raw.second}"
We can use datetime.now() to get current timestamp. Here is my code that I am using to create log file with timestamp -
import logging
from datetime import datetime
LOG_FILENAME = datetime.now().strftime('D:/log/logfile_%H_%M_%S_%d_%m_%Y.log')
for handler in logging.root.handlers[:]:
logging.root.removeHandler(handler)
logging.basicConfig(filename=LOG_FILENAME,level=logging.DEBUG)
logging.info('Forecastiong Job Started...')
logging.debug('abc method started...')
from time import strftime
fh = logging.FileHandler(strftime("mylogfile_%H_%M_%m_%d_%Y.log"))
To print hour, minutes, day, month and year, use the following statement
from datetime import datetime
print datetime.now().strftime("%H_%M_%d_%m_%Y")

Categories