I have a list of about 100,000 URLs saved in my computer. ( that 100,000 can very quickly multiply into several million.) For every url, i check that webpage and collect all additional urls on that page, but only if each additional link is not already in my large list. The issue here is reloading that huge list into memory iteratively so i can consistently have an accurate list. where the amount of memory used will probably very soon become much too much, and even more importantly, the time it takes inbetween reloading the list gets longer which is severely holding up the progress of the project.
My list is saved in several different formats. One format is by having all links contained in one single text file, where i open(filetext).readlines() to turn it straight into a list. Another format i have saved which seems more helpful, is by saving a folder tree with all the links, and turning that into a list by using os.walk(path).
im really unsure of any other way to do this recurring conditional check more efficiently, without the ridiculous use of memory and loadimg time. i tried using a queue as well, but It was such a benefit to be able to see the text output of these links that queueing became unneccesarily complicated. where else can i even start?
The main issue is not to load the list in memory. This should be done only once at the beginning, before scrapping the webpages. The issue is to find if an element is already in the list. The in operation will be too long for large list.
You should try to look into several thinks; among which sets and pandas. The first one will probably be the optimal solution.
Now, since you thought of using a folder tree with the urls as folder names, I can think of one way which could be faster. Instead of creating the list with os.walk(path), try to look if the folder is already present. If not, it means you did not have that url yet. This is basically a fake graph database. To do so, you could use the function os.path.isdir(). If you want a true graph DB, you could look into OrientDB for instance.
Have you considered mapping a table of IP addresses to URL? Granted this would only work if you are seeking unique domains vs thousands of pages on the same domain. The advantage is you would be dealing with a 12 integer address. The downside is the need for additional tabulated data structures and additional processes to map the data.
Memory shouldn't be an issue. If each url takes 1KiB to store in memory (very generous), 1 million urls will be 1GiB. You say you have 8GiB of RAM.
You can keep known urls in memory in a set and check for containment using in or not in. Python's set uses hashing, so searching for containment is O(1) which is a lot quicker in general than the linear search of a list.
You can recursively scrape the pages:
def main():
with open('urls.txt') as urls:
known_urls = set(urls)
for url in list(known_urls):
scrape(url, known_urls)
def scrape(url, known_urls):
new_urls = _parse_page_urls(url)
for new_url in new_urls:
if new_url not in known_urls:
known_urls.add(new_url)
scrape(new_url, known_urls)
Related
I am looking for a kind of database which can search in separate files eg. pdf, xls, doc that I get from different suppliers. My idea is something like this:
For example, I need to search for a part number and check different data about it. The file containing the part number must then be opened with the part number marked. If there are multiple hits, the database should display a list of the various files containing the searched item number. The list should act as links that open the file with the item number selected when selecting one from the list.
Does this already exist or how do I approach it?
Today, it's all assembled into a single PDF file of more than 1000 pages, and it's a time-consuming and laborious process to maintain.
I've only used vba in connection with Excel, so maybe it's too complicated for me. But is it possible for a programmer without spending 1000 hours on it?
Please help me :-)
Either Access or Excel could do this. I noticed the Python tag. I'm sure Python could handle this as well, although it seems more like a database solution would be best. It sounds like a one-to-many scenario. See the link below for some ideas of how this technique works.
https://www.tutorialspoint.com/ms_access/ms_access_one_to_many_relationship.htm
Also, below is a link with a whole bunch of MS Access templates. Take a look at that and hopefully that will give you some ideas of how to get started.
https://www.microsoftaccessexpert.com/Microsoft-Access-Templates.aspx
I agree, keeping this in a PDF with 1000 pages is NOT the way to go!!
I am currently working on a data scraping project, which requires me to load and save my data in every loop. You might wonder why I would do that? Well, before I scraped without loading and saving my data in between every loop and lost all my data if the script crashed anywhere before the last iteration (which happened every time due to timeout, weird URL or anything you can imagine).
All in all the method applied right now works fine, but after some 20k iterations my storage files increase to a length of ~90mb resulting in the script becoming slower and slower, forcing me to create a new data saving file. The code below shows the basic functioning of my script.
import numpy as np
#List with URLS to scrape (220k URLS)
URLS = np.load("/some_directory/URLS.npy").tolist()
#Already checked URLS
Checked_URLS=np.load("/some_directory/checked_URLS.npy").tolist()
#Perform scraping and add new URL to checked list
for i in URLS:
if i not in Checked_URLS:
Checked_already.append(i)
np.save("some_directory/checked_URLS.npy", Checked_already)
NEW_DATA=Bunch_of_scraping_code
#Load numpy list with data collected from previous URLS
#Append new scraping data and save list
FOUND_DATA=np.load("/some_directory/FOUND_DATA.npy", allow_pickle=True).tolist()
FOUND_DATA.append(NEW_DATA)
np.save("some_directory/FOUND_DATA.npy", LOT_DATA)
I am sure there must be a more pythonic way, in which it is not required to load the entire lists into python every loop? Or perhaps another way which does not require writing at all? I write my lists to .npy because, as far as I am aware, this is the most efficient way to parse large list files.
I have tried to save my data directly into pandas but this made everything much worse and slower. All help will be appreciated!
I think that a much more efficient approach would be to wrap the code that might crash with try/except and log the problematic urls for further investigation instead of repeating the same operations over and over again:
for url in URLS:
try:
FOUND_DATA=np.load("/some_directory/FOUND_DATA.npy", allow_pickle=True).tolist()
FOUND_DATA.append(NEW_DATA)
except Exception as ex:
# for any reason the code crashes, the following message will be printed and the loop will continue
print(f"Failed to process url: {url})
I hope the question is not too unspecific: I have a huge database-like list (~ 900,000 entries) which I want to use for processing text files. (More details below.) Since this list will be edited and used with other programs as well, I would prefer to keep it in one separate file and include it in the python code, either directly or by dumping it to some format that python can use. I was wondering if you can advice on what would be the quickest and most efficient way. I have looked at several options, but may not have seen what is best:
Include the list as a python dictionary in the form
my_list = { "key": "value" }
directly into my python code.
Dump the list to an sqlite database and use the sqlite3 module.
Have the list as a yml file and use the yaml module.
Any ideas how these approaches would scale if I process a text file and want to do replacements on something like 30,000 lines?
For those interested: this is for linguistic processing, in particular ancient Greek. The list is an exhaustive list of Greek forms and the head words that they are derived from. For every word form in a text file, I want to add the dictionary head word.
Point 1 is much faster than using either YAML or SQL as #b4hand and #DeepSpace indicated. What you should do though is not include the list in the rest of the rest of the python code you are developing, as you indicated, but make a separate .py file with just the that dictionary definition.
That way the list in that file is more easy to write from a program (or extend by a program). And, on first import, a .pyc will be created which speeds up re-reading on further runs of your program. This is actually very similar in performance
to using the pickle module and pickling the dictionary to file and reading it back from there, while keeping the dictionary in an easy human readable and editable form.
Less than one million entries is not huge and should fit in memory easily. Thus, your best bet is option 1.
If you are looking for speed, option 1 should be the fastest because the other 2 will need to repeatedly access the HD which will be the bottleneck.
I would use a caching mechanism to hold this data or maybe a data structure storage like redis. Loading all of this in memory might become too expensive.
I have a web server that is dynamically creating various reports in several formats (pdf and doc files). The files require a fair amount of CPU to generate, and it is fairly common to have situations where two people are creating the same report with the same input.
Inputs:
raw data input as a string (equations, numbers, and
lists of words), arbitrary length, almost 99% will be less than about 200 words
the version of the report creation tool
When a user attempts to generate a report, I would like to check to see if a file already exists with the given input, and if so return a link to the file. If the file doesn't already exist, then I would like to generate it as needed.
What solutions are already out there? I've cached simple http requests before, but the keys were extremely simple (usually database id's)
If I have to do this myself, what is the best way. The input can be several hundred words, and I was wondering how I should go about transforming the strings into keys sent to the cache.
//entire input, uses too much memory, one to one mapping
cache['one two three four five six seven eight nine ten eleven...']
//short keys
cache['one two'] => 5 results, then I must narrow these down even more
Is this something that should be done in a database, or is it better done within the web app code (python in my case)
Thanks you everyone.
This is what Apache is for.
Create a directory that will have the reports.
Configure Apache to serve files from that directory.
If the report exists, redirect to a URL that Apache will serve.
Otherwise, the report doesn't exist, so create it. Then redirect to a URL that Apache will serve.
There's no "hashing". You have a key ("a string (equations, numbers, and lists of words), arbitrary length, almost 99% will be less than about 200 words") and a value, which is a file. Don't waste time on a hash. You just have a long key.
You can compress this key somewhat by making a "slug" out of it: remove punctuation, replace spaces with _, that kind of thing.
You should create an internal surrogate key which is a simple integer.
You're simply translating a long key to a "report" which either exists as a file or will be created as a file.
The usual thing is to use a reverse proxy like Squid or Varnish
I have a directory full (~103, 104) of XML files from which I need to extract the contents of several fields.
I've tested different xml parsers, and since I don't need to validate the contents (expensive) I was thinking of simply using xml.parsers.expat (the fastest one) to go through the files, one by one to extract the data.
Is there a more efficient way? (simple text matching doesn't work)
Do I need to issue a new ParserCreate() for each new file (or string) or can I reuse the same one for every file?
Any caveats?
Thanks!
Usually, I would suggest using ElementTree's iterparse, or for extra-speed, its counterpart from lxml. Also try to use Processing (comes built-in with 2.6) to parallelize.
The important thing about iterparse is that you get the element (sub-)structures as they are parsed.
import xml.etree.cElementTree as ET
xml_it = ET.iterparse("some.xml")
event, elem = xml_it.next()
event will always be the string "end" in this case, but you can also initialize the parser to also tell you about new elements as they are parsed. You don't have any guarantee that all children elements will have been parsed at that point, but the attributes are there, if you are only interested in that.
Another point is that you can stop reading elements from iterator early, i.e. before the whole document has been processed.
If the files are large (are they?), there is a common idiom to keep memory usage constant just as in a streaming parser.
The quickest way would be to match strings (with, e.g., regular expressions) instead of parsing XML - depending on your XMLs this could actually work.
But the most important thing is this: instead of thinking through several options, just implement them and time them on a small set. This will take roughly the same amount of time, and will give you real numbers do drive you forward.
EDIT:
Are the files on a local drive or network drive? Network I/O will kill you here.
The problem parallelizes trivially - you can split the work among several computers (or several processes on a multicore computer).
If you know that the XML files are generated using the ever-same algorithm, it might be more efficient to not do any XML parsing at all. E.g. if you know that the data is in lines 3, 4, and 5, you might read through the file line-by-line, and then use regular expressions.
Of course, that approach would fail if the files are not machine-generated, or originate from different generators, or if the generator changes over time. However, I'm optimistic that it would be more efficient.
Whether or not you recycle the parser objects is largely irrelevant. Many more objects will get created, so a single parser object doesn't really count much.
One thing you didn't indicate is whether or not you're reading the XML into a DOM of some kind. I'm guessing that you're probably not, but on the off chance you are, don't. Use xml.sax instead. Using SAX instead of DOM will get you a significant performance boost.