Caching system for dynamically created files? - python

I have a web server that is dynamically creating various reports in several formats (pdf and doc files). The files require a fair amount of CPU to generate, and it is fairly common to have situations where two people are creating the same report with the same input.
Inputs:
raw data input as a string (equations, numbers, and
lists of words), arbitrary length, almost 99% will be less than about 200 words
the version of the report creation tool
When a user attempts to generate a report, I would like to check to see if a file already exists with the given input, and if so return a link to the file. If the file doesn't already exist, then I would like to generate it as needed.
What solutions are already out there? I've cached simple http requests before, but the keys were extremely simple (usually database id's)
If I have to do this myself, what is the best way. The input can be several hundred words, and I was wondering how I should go about transforming the strings into keys sent to the cache.
//entire input, uses too much memory, one to one mapping
cache['one two three four five six seven eight nine ten eleven...']
//short keys
cache['one two'] => 5 results, then I must narrow these down even more
Is this something that should be done in a database, or is it better done within the web app code (python in my case)
Thanks you everyone.

This is what Apache is for.
Create a directory that will have the reports.
Configure Apache to serve files from that directory.
If the report exists, redirect to a URL that Apache will serve.
Otherwise, the report doesn't exist, so create it. Then redirect to a URL that Apache will serve.
There's no "hashing". You have a key ("a string (equations, numbers, and lists of words), arbitrary length, almost 99% will be less than about 200 words") and a value, which is a file. Don't waste time on a hash. You just have a long key.
You can compress this key somewhat by making a "slug" out of it: remove punctuation, replace spaces with _, that kind of thing.
You should create an internal surrogate key which is a simple integer.
You're simply translating a long key to a "report" which either exists as a file or will be created as a file.

The usual thing is to use a reverse proxy like Squid or Varnish

Related

Extracting only few columns from a FITS file that is freely available online to download using python

I'm working on a model of universe for which I'm using data available on Sloan Digital Sky Survey site. Problem is some files are more than 4GB large(total more than 50GB) and I know those files contain a lot of data columns but I want data only from few columns. I have heard about web scraping so I thought to search about how to do it but it didn't help as all the tutorials explained how to download the whole file using python. I want know that is there any way through which I can extract only few columns from that file so that I only have the data I need and I won't have to download the whole larges file just for a small fraction of its data?
Sorry, my question is just words and no codes because I'm not that pro in python. I just searched online and learned how to do basic web-scraping but it didn't solve my problem.
It will be even more helpful if you could suggest me some more ways to reduce the size of data I'll have to download.
Here is the URL to download FITS files: https://data.sdss.org/sas/dr12/boss/lss/
I only want to extract columns that have coordinates(ra, dec), distance, velocity and redshifts from the files.
Also, is there a way to do the same thing with CSV files or a general way to do it with any file?
I'm afraid what you're asking is generally not possible, at least not with significant effort and software support both on the client and server side.
First of all, the way FITS tables are stored in binary is row-oriented meaning if you wanted to stream a portion of a FITS table you can read it one row at a time. But to read individual columns you need to make partial reads of each row for every single row in the table. Some web servers support what are called "range requests" meaning you can request only a few ranges of bytes from a file, instead of the whole file. The web server has to have this enabled, and not all servers do. If FITS tables were stored column-oriented this could be feasible, as you could download just the header of the file to determine the ranges of the columns, and then download just the ranges for those columns.
Unfortunately, since FITS tables are row-oriented, if you wanted to load say 3 columns from it, and the table contains a million rows, that would involve 3 million range requests which would likely involve enough overhead that you wouldn't gain anything from it (and I'm honestly not sure what limits web servers place on how many ranges you can request in a single request but I suspect most won't allow something so extreme.
There are other astronomy data formats (e.g. I think CASA Tables) that can store tables in a column-oriented format, and so are more feasible for this kind of use case.
Further, even if the HTTP limitations could be overcome, you would need software support for loading the file in this manner. This has been discussed to a limited extent here but for the reasons discussed above it would mostly be useful for a limited set of cases, such as loading one HDU at a time (not so helpful in your case if the entire table is in one HDU) or possibly some other specialized cases such as sections of tile-compressed images.
As mentioned elsewhere, Dask supports loading binary arrays from various cloud-based filesystems, but when it comes to streaming data from arbitrary HTTP servers it runs into similar limitations.
Worse still, I looked at the link you provided and all the files there are gzip-compressed, so it is especially difficult to deal with since you can't know what ranges of them to request without decompressing them first.
As an aside, since you asked, you will have the same problem with CSV, only worse since CSV fields are not typically in fixed-width format, so there is no way to know how to extract individual columns without downloading the whole file.
For FITS maybe it would be helpful to develop a web service capable of serving arbitrary extracts from larger FITS files. If such a thing already exists I don't know, but I don't think it exists in a very general sense. So this would a) have to be developed, and b) you would have to ask anyone hosting the files you want to access to host such a service.
Your best bet is to just download the whole file, extract the data you need from it, and delete the original file assuming you no longer need it. It's possible the information you need is also already accessible through some online database.

Iteratively Re-Checking a Huge list

I have a list of about 100,000 URLs saved in my computer. ( that 100,000 can very quickly multiply into several million.) For every url, i check that webpage and collect all additional urls on that page, but only if each additional link is not already in my large list. The issue here is reloading that huge list into memory iteratively so i can consistently have an accurate list. where the amount of memory used will probably very soon become much too much, and even more importantly, the time it takes inbetween reloading the list gets longer which is severely holding up the progress of the project.
My list is saved in several different formats. One format is by having all links contained in one single text file, where i open(filetext).readlines() to turn it straight into a list. Another format i have saved which seems more helpful, is by saving a folder tree with all the links, and turning that into a list by using os.walk(path).
im really unsure of any other way to do this recurring conditional check more efficiently, without the ridiculous use of memory and loadimg time. i tried using a queue as well, but It was such a benefit to be able to see the text output of these links that queueing became unneccesarily complicated. where else can i even start?
The main issue is not to load the list in memory. This should be done only once at the beginning, before scrapping the webpages. The issue is to find if an element is already in the list. The in operation will be too long for large list.
You should try to look into several thinks; among which sets and pandas. The first one will probably be the optimal solution.
Now, since you thought of using a folder tree with the urls as folder names, I can think of one way which could be faster. Instead of creating the list with os.walk(path), try to look if the folder is already present. If not, it means you did not have that url yet. This is basically a fake graph database. To do so, you could use the function os.path.isdir(). If you want a true graph DB, you could look into OrientDB for instance.
Have you considered mapping a table of IP addresses to URL? Granted this would only work if you are seeking unique domains vs thousands of pages on the same domain. The advantage is you would be dealing with a 12 integer address. The downside is the need for additional tabulated data structures and additional processes to map the data.
Memory shouldn't be an issue. If each url takes 1KiB to store in memory (very generous), 1 million urls will be 1GiB. You say you have 8GiB of RAM.
You can keep known urls in memory in a set and check for containment using in or not in. Python's set uses hashing, so searching for containment is O(1) which is a lot quicker in general than the linear search of a list.
You can recursively scrape the pages:
def main():
with open('urls.txt') as urls:
known_urls = set(urls)
for url in list(known_urls):
scrape(url, known_urls)
def scrape(url, known_urls):
new_urls = _parse_page_urls(url)
for new_url in new_urls:
if new_url not in known_urls:
known_urls.add(new_url)
scrape(new_url, known_urls)

Can I use MD5 or SHA1 hashes for filenames?

Let's consider a site where users can upload files. Can I use MD5 or SHA1 hashes of their contents as filenames? If not, what should I use? To avoid collisions.
You can use almost anything as a filename, minus reserved characters. Those particular choices tell you nothing about the file itself, aside from its hash value. Provided they aren't uploading identical files, that should prevent file naming collisions. If you don't care about that, have at it.
Usually people upload files in order for someone to pull them back down. So you'd need to have a descriptor of some kind; otherwise users would need to open a mass of files to get the one they want. Perhaps a better option would be to let the user select a name (up to a character limit) and then append the datetime code. Then, in order to have a collision, you'd need to have 2 users select the exact same name at the exact same time. Include seconds in the datetime code, and the chances of collision approach (but never equal) zero.
Despite the SHA1 collision attack previously, SHA1 hash collision probability is still so low that can be assumed to be safe to use as filenames in most cases.
The other common approach is using GUID/UUID for every file. So the only question left is how do you want to handle two identical files uploaded by two users. The easiest way is treat them as two separate files and neither of them will be affected by each other.
Though sometimes you might be concerned about storage space. For example, if the files uploaded are really big, you might want to consider storing the two identical files as one to save space. Depending on the user experience of your system, you might need to handle some situations afterwards, such as when one of the two users removed the file. However these are not difficult to handle and just depend on the rest of your system.

Link encryption with django and python

I'm having a download application and I want to encrypt the links for file downloads, so that the user doesn't know the id of the file. Furthermore I'd like to include date/time in the link, and check when serving the file if the link is still valid.
There's a similar question asked here, but I'm running into problems with the character encodings, since I'd like to have urls like /file/encrypted_string/ pointing to the views for downloading, so best would be if the encrypted result only contains letters and numbers. I prefer not using a hash, because I do not want to store a mapping hash <> file somewhere. I do not know if there's a good encryption out there that fulfills my needs...
Sounds like it would be easy, especially if you don't mind using the same encryption key forever. Just delimit a string (/ or : works as well as anything) for the file name, the date/time, and anything else you want to include, then encrypt and b64 it! Remember to use urlsafe_b64encode, not the regular b64encode, which will produce broken urls. It'll be a long string, but so what?
I've done this a few times, using a slight variation: Add a few random characters as the last piece of the key and include that at the beginning or end of the string - more secure than always reusing the same key, without the headaches of a database mapping. As long as your key is complex enough the exposed bits won't be enough to let crackers generate requests at will.
Of course, if the file doesn't exist, don't let them see the decoded result...
By far the easiest way to handle this is to generate a random string for each file, and store a mapping between the key strings and the actual file name or file id. No complex encryption required.
Edit:
You will need to store the date anyway to implement expiring the links. So, you can store the expiration date, a long with the key, and periodically cull out expired links from the table.
If your problem is just one of encryption and decryption of short strings, Python's Crypto module makes it a breeze.
You can encode any character into the url, with django, you may use it's urlencode filter.
However, generating a random string and saving the mapping is more secure.

I need a class that creates a dictionary file that lives on disk

I want to create a very very large dictionary, and I'd like to store it on disk so as not to kill my memory. Basically, my needs are a cross between cPickle and the dict class, in that it's a class that Python treats like a dictionary, but happens to live on the disk.
My first thought was to create some sort of wrapper around a simple MySQL table, but I have to store types in the entries of the structure that MySQL can't even hope to support out of the box.
The simplest way is the shelve module, which works almost exactly like a dictionary:
import shelve
myshelf = shelve.open("filename") # Might turn into filename.db
myshelf["A"] = "First letter of alphabet"
print myshelf["A"]
# ...
myshelf.close() # You should do this explicitly when you're finished
Note the caveats in the module documentation about changing mutable values (lists, dicts, etc.) stored on a shelf (you can, but it takes a bit more fiddling). It uses (c)pickle and dbm under the hood, so it will cheerfully store anything you can pickle.
I don't know how well it performs relative to other solutions, but it doesn't require any custom code or third party libraries.
Look at dbm in specific, and generally the entire Data Persistence chapter in the manual. Most key/value-store databases (gdbm, bdb, metakit, etc.) have a dict-like API which would probably serve your needs (and are fully embeddable so no need to manage an external database process).
File IO is expensive in terms of CPU cycles. So my first thoughts would be in favor of a database.
However, you could also split your "English dictionary" across multiple files so that (say) each file holds words that start with a specific letter of the alphabet (therefore, you'll have 26 files).
Now, when you say I want to create a very very large dictionary, do you mean a python dict or an English dictionary with words and their definitions, stored in a dict (with words as keys and definitions as values)? The second can be easily implemented with cPickle, as you pointed out.
Again, if memory is your main concern, then you'll need to recheck the number of files you want to use, because, if you're pickling dicts into each file, then you want the dicts to not get too big
Perhaps a usable solution for you would be to do this (I am going to assume that all the English words are sorted):
Get all the words in the English language into one file.
Count how many such words there are and split them into as many files as you see fit, depending on how large the files get.
Now, these smaller files contain the words and their meanings
This is how this solution is useful:
Say that your problem is to lookup the definition of a particular word. Now, at runtime, you can read the first word in each file, and determine if the word that you are looking for is in the previous file that you read (you will need a loop counter to check if you are at the last file). Once you have determined which file the word you are looking for is in, then you can open that file and load the contents of that file into the dict.
It's a little difficult to offer a solution without knowing more details about the problem at hand.

Categories