I currently have setup a Python script that uses feedparser to read a feed and parse it. However, I have recently come across a problem with the date parsing. The feed I am reading contains <modified>2010-05-05T24:17:54Z</modified> - which comes up in Python as a datetime object - 2010-05-06 00:17:54. Notice the discrepancy: the feed entry was modified on the 5th of may, while python reads it as the 6th.
So the question is why this is happening. Is the ATOM feed (that is, the one who created the feed) wrong by putting the time as 24:17:54, or is my python script wrong in the way it treats it.
And can I solve this?
There are some interesting special cases in the rfc here (https://www.rfc-editor.org/rfc/rfc3339), however, typically its for the 00:00:60 vs 00:00:59 to allow for leap seconds. It may be though that that is legal. My guess is that its doing the "right thing". In all honesty, date/time things get really messy due to things like DST and local timezones. If its 24:17:54, that might be the right thing after all.
I think today at 24:17 is intelligently parsed as tomorrow at 00:17.... I'm thinking you are well handling the producer's bug.
Related
During my current project, I have been receiving data from a set of long-range sensors, which are sending data as a series of bytes. Generally, due to having multiple types of sensors, the bytes structures and data contained are different, hence the need to make the functionality more dynamic as to avoid having to hard-code every single setup in the future (which is not practical).
The server will be using Django, which I believe is irrelevant to the issue at hand but I have mentioned just in case it might have something that can be used.
The bytes data I am receiving looks like this:
b'B\x10Vu\x87%\x00x\r\x0f\x04\x01\x00\x00\x00\x00\x1e\x00\x00\x00\x00ad;l'
And my current process looks like this:
Take the first bytes to get the deviceID (deviceID = val[0:6].hex())
Look up the format to be used in the struct.unpack() (here: >BBHBBhBHhHL after removing the first bytes for the id.
Now, the issue is the next step. Many of the datas I have have different forms of per-processing that needs to be done. F.e. some values need to be ran with a join statement (e.g. ".".join(str(values[2]) ) while others need some simple mathematical changes (-113 + 2 * values[4]) and finally, others needs a simple logic check (values[7]==0x80) to return a boolean value.
My question is, what's the best way to code those methods? I would really like to avoid hardcoding them, but it almost seems like the best idea. another idea I saw was to store the functionalities as a string and execute them such as seen here, but I've been reading that its a very bad idea, and that it also slows down execution. The last idea I had was to hardcode some general functions only and use something similar to here, but this doesn't solve the issue of having to hard-code every new sensor-type, which is not realistic in a live-installation. Are there any better methods to achieve the same thing?
I have also looked at here, with the idea that some functionality can be somehow optimized as an equation, but I didn't see that a possibility for every occurrence, especially when any string manipulation is needed at all.
Additionally, is there a possibility of using some maths to apply some basic string manipulation? I can hard-code one string manipulation maybe, but to be honest this whole thing has been bugging me...
Finally, I am considering if I go with the function storing as string then executing, is there a way to set some "security" to avoid any malicious exploitation? Since such a method is... awful insecure to say the least.
However, after almost a week total of searching I am so far unable to find a better solution than storing functions as a string and running eval on them, despite not liking that option. If anyone finds a better option before then, I would be extremely grateful to any tips or ideas.
Appendum: Minimum code that can be used to show-case and test different methods:
import struct
def decode(input):
val = bytearray(input)
deviceID = val[0:6].hex()
del(val[0:6])
print(deviceID)
values = list(struct.unpack('>BBHBBhBHhHL', val))
print(values)
# Now what?
decode(b'B\x10Vu\x87%\x00x\r\x0f\x04\x01\x00\x00\x00\x00\x1e\x00\x00\x00\x00ad;l')
Working from the command line I wrote a function called go(). When called it receives input asking the user for a directory address in the format drive:\directory. No need for extra slashes or quotes or r literal qualifiers or what have you. Once you've provided a directory, it lists all the non-hidden files and directories under it.
I want to update the function now with a statement that stores this location in a variable, so that I can start browsing my hierarchy without specifying the full address every time.
Unfortunately I don't remember what statements I put in the function in the first place to make it work as it does. I know it's simple and I could just look it up and rebuild it from scratch with not too much effort, but that isn't the point.
As someone who is trying to learn the language, I try to stay at the command line as much as possible, only visiting the browser when I need to learn something NEW. Having to refer to obscure findings attached to vaguely related questions to rediscover how to do things I've already done is very cumbersome.
So my question is, can I see the contents of functions I have written, and how?
Unfortunately no. Python does not have this level of introspection. Best you can do is see the compiled byte code.
The inspect module details what information is available at runtime: https://docs.python.org/3.5/library/inspect.html
I've read a bunch of posts on how flaky parsing time can be. I believe I have come up with a reliable way of converting an ISO8601-formatted timestamp here:
https://gist.github.com/3702066
The most important part being the astimezone(LOCALZONE) call when the date is parsed. This allowed time.mktime() to do the right thing and appears to handle daylight savings properly.
Are there obvious gotchas I've missed?
Your code does seem to work even for times that fall just before or just after daylight savings time transitions, but I am afraid it might still fail on those rare occasions when a location's timezone offset actually changes. I don't have an example to test with though.
So even if if does work (or almost always work), I think it's crazy to convert a UTC time string to a UTC timestamp in a manner which involves or passed through local time in any way. The local time zone should be irrelevant. It's an unwanted dependency. I'm not saying that you're crazy. You're just trying to work with the APIs you are given, and the C library's time APIs are badly designed.
Luckily, Python provides an alternative to mktime() that is what the C library should have provided: calendar.timegm(). With this function, I can rewrite your function like this:
parsed = parse_date(timestamp)
timetuple = parsed.timetuple()
return calendar.timegm(timetuple)
Because local time is not involved, this also removes the dependency on pytz and the nagging doubt that an obscure artifact of somebody's local timezone will cause an unwanted effect.
Say you have a some meta data for a custom file format that your python app reads. Something like a csv with variables that can change as the file is manipulated:
var1,data1
var2,data2
var3,data3
So if the user can manipulate this meta data, do you have to worry about someone crafting a malformed meta data file that will allow some arbitrary code execution? The only thing I can imagine if you you made the poor choice to make var1 be a shell command that you execute with os.sys(data1) in your own code somewhere. Also, if this were C then you would have to worry about buffers being blown, but I don't think you have to worry about that with python. If your reading in that data as a string is it possible to somehow escape the string "\n os.sys('rm -r /'), this SQL like example totally wont work, but is there similar that is possible?
If you are doing what you say there (plain text, just reading and parsing a simple format), you will be safe. As you indicate, Python is generally safe from the more mundane memory corruption errors that C developers can create if they are not careful. The SQL injection scenario you note is not a concern when simply reading in files in python.
However, if you are concerned about security, which it seems you are (interjection: good for you! A good programmer should be lazy and paranoid), here are some things to consider:
Validate all input. Make sure that each piece of data you read is of the expected size, type, range, etc. Error early, and don't propagate tainted variables elsewhere in your code.
Do you know the expected names of the vars, or at least their format? Make sure the validate that it is the kind of thing you expect before you use it. If it should be just letters, confirm that with a regex or similar.
Do you know the expected range or format of the data? If you're expecting a number, make sure it's a number before you use it. If it's supposed to be a short string, verify the length; you get the idea.
What if you get characters or bytes you don't expect? What if someone throws unicode at you?
If any of these are paths, make sure you canonicalize and know that the path points to an acceptable location before you read or write.
Some specific things not to do:
os.system(attackerControlledString)
eval(attackerControlledString)
__import__(attackerControlledString)
pickle/unpickle attacker controlled content (here's why)
Also, rather than rolling your own config file format, consider ConfigParser or something like JSON. A well understood format (and libraries) helps you get a leg up on proper validation.
OWASP would be my normal go-to for providing a "further reading" link, but their Input Validation page needs help. In lieu, this looks like a reasonably pragmatic read: "Secure Programmer: Validating Input". A slightly dated but more python specific one is "Dealing with User Input in Python"
Depends entirely on the way the file is processed, but generally this should be safe. In Python, you have to put in some effort if you want to treat text as code and execute it.
In Python, I want a Python program to be able to determine the current date and time in NYC . Is that practical? While datetime.datetime.now() can tell me the local time, and datetime.utcnow() can tell me the UTC (GMT). However just looking at the difference will not help me as DST changes.
I try things like "dt=datetime.now() " and "dt.timetuple()"
I get tm_isdst=-1 even if I change the computer date.
I change my computer clock from a January date to a July date. I still get tm_isdst=-1
Why not use pytz? I want the users to not have to go thru the step of downloading an extra library.
I suspect some sort of problems in your use of the datetime, time, etc. modules, but without knowing more, not much help can be provided.
The following suggestion has some definite drawbacks, and I really recommend more pursuit to solving the problems with datetime, etc. However, if you're sure to have a web connection and need to get something done fast, you could query USNO time with something like:
import urllib
f = urllib.urlopen("http://tycho.usno.navy.mil/cgi-bin/timer.pl")
time_page = f.readlines()
for line in time_page:
if line.find("Eastern Time") != -1:
ny_time = line[4:24]
break
print ny_time
The output looks like:
Jan. 19, 05:18:04 PM
This makes use of the fact that NYC is in the Eastern Time zone. Also, it assumes the USNO server is available to your user. Furthermore, it has assumptions about the format of the content returned. I don't know if/how frequently that format changes. Also, if this is going to be used a lot, please find another server, as you don't want to sink the USNO server! (Pun not originally intended, but recognized and kept. :-).
If you are not in the same timezone as NYC, it's in practice impossible without knowing the timezone and when DST changes. You can't hardcode it for NYC, of course, but it is way easier to just install pytz or dateutil, and then you aren't limited to NYC.