i am scraping data through multiple websites.
To do that i have written multiple web scrapers with using selenium and PhantomJs.
Those scrapers return values.
My question is: is there a way i can feed those values to a single python program that will sort through that data in real time.
What i want to do is not save that data to analyze it later i want to send it to a program that will analyze it in real time.
what i have tried: i have no idea where to even start
Perhaps a named pipe would be suitable:
mkfifo whatever (you can also do this from within your python script; os.mkfifo)
You can write to whatever like a normal file (it will block until something reads it) and read from whatever with a different process (it will block if there is no data available)
Example:
# writer.py
with open('whatever', 'w') as h:
h.write('some data') # Blocks until reader.py reads the data
# reader.py
with open('whatever', 'r') as h:
print(h.read()) # Blocks until writer.py writes to the named pipe
You can try writing the data you want to share to a file and have the other script read and interpret it. Have the other script run in a loop to check if there is a new file or the file has been changed.
Simply use files for data exchange and a trivial locking mechanism.
Each writer or reader (only one reader, it seems) gets a unique number.
If a writer or reader wants to write to the file, it renames it to its original name + the number and then writes or reads, renaming it back after that.
The others wait until the file is available again under its own name and then access it by locking it in a similar way.
Of course you have shared memory and such, or memmapped files and semaphores. But this mechanism has worked flawlessly for me for over 30 years, on any OS, over any network. Since it's trivially simple.
It is in fact a poor man's mutex semaphore.
To find out if a file has changed, look to its writing timestamp.
But the locking is necessary too, otherwise you'll land into a mess.
Related
I would like to check every minute if there was a file like "RESULTS.ODB" generated and if this file is bigger than 1.5 Gigabyte there starts another subprocess to get the Data from this file. How can i make sure that the file isnĀ“t in progress to be written and everything is included?
I hope you know what i mean. Any ideas how to handle that?
Thank you very much. :)
If you have no control over the writing process, then you are at some point bound to fail somewhere.
If you do have control over the writer, a simple way to "lock" files is to create a symlink. If your symlink creation fails, there is already a write in progress. If it succeeds, you just acquired the "lock".
But if you do not have any control over writing and creation of the file, there will be trouble. You can try the approach as outlined here: Ensuring that my program is not doing a concurrent file write
This will read timestamps of the file and "guess" from them if writing has completed or not. This is more reliable than checking the file size, as you could end up with a file over your size threshold but writing still in progress.
In this case the problem would be the writer starting to write before you have read the file in its entirety. Now your reader would fail when the file it was reading disappeared half way through.
If you are on a Unix platform, you have no control over write and you absolutely need to do this, I would do something like this:
Check if file exists and if it does, if the "last written" timestamp
is "old enough" for me to assume the file is there
Rename the file to a different name
Check the renamed file that it still matches your criteria
Get data from the renamed file
Nevertheless, this will eventually fail and you will lose an update, as there is no way to make this atomic. Renaming will remove the problem of overwriting the file before you have read it, but if the writer decides to start writing between 1 and 2, you not only will receive an incomplete file but you might also break the writer if it does not like the file disappearing half way through.
I would rather try to find a way to somehow chain the actions together. Either your writer triggering the read process or adding a locking mechanism. Writing 1.5GB of data is not instantaneous and eventually the unexpected will happen.
Or if you definitely cannot do anything like that, could you ensure for example that your writer writes maximum once in N minutes or so? If you could be sure it never writes twice within a 5 minute window, you would wait in your reader until the file is 3 minutes old and then rename it and read the renamed file. You could also check if you could prevent the writer from overwriting. If you can do this, then you can safely process the file in your reader when it is "old enough" and has not changed in whatever grace period you decide to give it, and when you have read it, you will delete the file allowing the next update to appear.
Without knowing more about your environment and processes involved this is the best I can come up with. But there is no universal solution to this problem. It needs a workaround that is tailored to your particular environment.
So I have a python script (let's call it file_1.py) that overwrites the content of a text file with new content, and it works just fine. I have another python script (file_2.py) that reads the file and performs actions on the data in the file. With file_2.py I've been trying to get when the text file is edited by file_1.py and then do some stuff with the new data as soon as it's added. I looked into the subprocess module but I couldn't really figure out how to use it across different files. Here's what I have so far:
file_1.py:
with open('text_file.txt','w') as f:
f.write(assemble(''.join(data))) # you can ignore what assemble does, this part already works.
file_2.py:
while True:
f = open('text_file.txt','r')
data = f.read()
function(data)
f.close()
I thought that since I close and reopen the file every loop, the data in the file would be updated. However, it appears I was wrong, as the data remains the same even though the file was updated. So how can I go about doing this?
Are you always overwiting the data in the first file, with the same data?
I mean, instead of appending or actually changing the data over time?
I see it working here when I change
with open('text_file.txt','wt') as f:
to
with open('text_file.txt','at') as f:
and I append some data. 'w' will overwrite and if data doesn't change you will see the same data over and over.
Edit:
Another possibility (as discussed in the comments to OP self-answer) is the need to use f.flush() after writing to the file. Despite the buffers being written to disk automatically when closing a file (or leaving a with block), that write can take a moment, and if the file is read again before that moment, the updates will not be there (yet). To remove that uncertainty call flush after updating, wich forces the disk write.
If you sleep your reading code for enough time between readings (that is the reads are slow enough), the manual flush might not be needed. But if in doubt, or to do it the simple way and be sure, just use flush().
Okay, so it looks like I've solved my problem. According to this website, it says:
Python automatically flushes the files when closing them. But you may want to flush the data before closing any file.
Since the "automatic flushing" thing wasn't working, I tried to manually flush the I/O using file.flush(), and it worked. I call that function every time right write to the file in file_1.py.
EDIT: It seems that when time.sleep() is called between readings of the file, it interferes and you have to manually flush the buffer.
I'm a Python beginner and facing the following : I have a script periodically reading a settings file and doing something according to those settings . I have another script triggered by some UI that writes the settings file with user input values. I use the ConfigParser module to both read and write the file.
I am wondering if this scenario is capable of leading into an inconsistent state (like in middle of reading the settings file, the other script begins writing). I am unaware if there are any mechanism behind the scene to automatically protect against this situations.
If such inconsistencies are possible, what could I use to synchronize both scripts and mantain the integrity of the operations ?
I'm a Python beginner and facing the following : I have a script periodically reading a settings file and doing something according to those settings . I have another script triggered by some UI that writes the settings file with user input values.
There may be a race condition when the reader reads while the writer writes to the file, so that the reader may read the file while it is incomplete.
You can protect from this race by locking the file while reading and writing (see Linux flock() or Python lockfile module), so that the reader never observes the file incomplete.
Or, better, you can first write into a temporary file and when done rename it to the final name atomically. This way the reader and writer never block:
def write_config(config, filename):
tmp_filename = filename + "~"
with open(tmp_filename, 'wb') as file:
config.write(file)
os.rename(tmp_filename, filename)
When the writer uses the above method no changes are required to the reader.
When you write the config file write it to a temporary file first. When it's done, rename it to the correct name. The rename operation (os.rename) is normally implemented as an atomic operation on Unix systems, Linux and Windows, too, I think, so there will be no risk of the other process trying to read the config while the writing has not been finished yet.
There are al least two ways to address this issue (assuming you are on a unix-ish system):
If you want to write, write to a temporary file first, then do something unix can do atomically, especially rename the temporary file into place.
Lock the file during any operation, e.g. with the help of this filelock module.
Personally, I like the first option because it utilizes the OS, although some systems have had problems with the atomicity: On how rename is broken in Mac OS X - another limitation: the rename system call can not rename files across devices.
Apologies if this kind of thing has been answered elsewhere. I am using Python to run a Windows executable file using subprocess.Popen(). The executable file produces a .txt file and some other output files as part of its operation. I then need to run another executable file using subprocess.Popen() that uses the output from the original .exe file.
The problem is, it is the .exe file and not Python that is controlling the creation of the output files, and so I have no control over knowing how long it takes the first text file to write to disk before I can use it as an input to the second .exe file.
Obviously I cannot run the second executable file before the first text file finishes writing to disk.
subprocess.wait() does not appear to be helpful because the first executable terminates before the text file has finished writing to disk. I also don't want to use some kind of function that waits an arbitrary period of time (say a few seconds) then proceeds with the execution of the second .exe file. This would be inefficient in that it may wait longer than necessary, and thus waste time. On the other hand it may not wait long enough if the output text file is very large.
So I guess I need some kind of listener that waits for the text file to finish being written before it moves on to execute the second subprocess.Popen() call. Is this possible?
Any help would be appreciated.
UPDATE (see Neil's suggestions, below)
The problem with os.path.getmtime() is that the modification time is updated more than once during the write, so very large text files (say ~500 Mb) require a relatively large wait time in between os.path.getmtime() calls. I use time.sleep() to do this. I guess this solution is workable but is not the most efficient use of time.
On the other hand, I am having bigger problems with trying to open the file for write access. I use the following loop:
while True:
try:
f = open(file, 'w')
except:
# For lack of something else to put in here
# (I don't want to print anything)
os.path.getmtime(file)
else:
break
This approach seems to work in that Python essentially pauses while the Windows executable is writing the file, but afterwards I go to use the text file in the next part of the code and find that the contents that were just written have been wiped.
I know they were written because I can see the file size increasing in Windows Explorer while the executable is doing its stuff, so I can only assume that the final call to open(file, 'w') (once the executable has done its job) causes the file to be wiped, somehow.
Obviously I am doing something wrong. Any ideas?
There's probably many ways to do what you want. One that springs to mind is that you could poll the modification time with os.path.getmtime(), and see when it changes. If the modification date is after you called the executable, but still a couple seconds ago, you could assume it's done.
Alternatively, you could try opening the file for write access (just without actually writing anything). If that fails, it means someone else is writing it.
This all sounds so fragile, but I assume your hands are somewhat tied, too.
One suggestion that comes to mind is if the text file that is written might have a recognizable end-of-file marker to it. I created a text file that looks like this:
BEGIN
DATA
DATA
DATA
END
Given this file, I could then tell if "END" had been written to the end of the file by using os.seek like this:
>>> import os
>>> fp = open('test.txt', 'r')
>>> fp.seek(-4, os.SEEK_END)
>>> fp.read()
'END\n'
I am writing a script that will be polling a directory looking for new files.
In this scenario, is it necessary to do some sort of error checking to make sure the files are completely written prior to accessing them?
I don't want to work with a file before it has been written completely to disk, but because the info I want from the file is near the beginning, it seems like it could be possible to pull the data I need without realizing the file isn't done being written.
Is that something I should worry about, or will the file be locked because the OS is writing to the hard drive?
This is on a Linux system.
Typically on Linux, unless you're using locking of some kind, two processes can quite happily have the same file open at once, even for writing. There are three ways of avoiding problems with this:
Locking
By having the writer apply a lock to the file, it is possible to prevent the reader from reading the file partially. However, most locks are advisory so it is still entirely possible to see partial results anyway. (Mandatory locks exist, but a strongly not recommended on the grounds that they're far too fragile.) It's relatively difficult to write correct locking code, and it is normal to delegate such tasks to a specialist library (i.e., to a database engine!) In particular, you don't want to use locking on networked filesystems; it's a source of colossal trouble when it works and can often go thoroughly wrong.
Convention
A file can instead be created in the same directory with another name that you don't automatically look for on the reading side (e.g., .foobar.txt.tmp) and then renamed atomically to the right name (e.g., foobar.txt) once the writing is done. This can work quite well, so long as you take care to deal with the possibility of previous runs failing to correctly write the file. If there should only ever be one writer at a time, this is fairly simple to implement.
Not Worrying About It
The most common type of file that is frequently written is a log file. These can be easily written in such a way that information is strictly only ever appended to the file, so any reader can safely look at the beginning of the file without having to worry about anything changing under its feet. This works very well in practice.
There's nothing special about Python in any of this. All programs running on Linux have the same issues.
On Unix, unless the writing application goes out of its way, the file won't be locked and you'll be able to read from it.
The reader will, of course, have to be prepared to deal with an incomplete file (bearing in mind that there may be I/O buffering happening on the writer's side).
If that's a non-starter, you'll have to think of some scheme to synchronize the writer and the reader, for example:
explicitly lock the file;
write the data to a temporary location and only move it into its final place when the file is complete (the move operation can be done atomically, provided both the source and the destination reside on the same file system).
If you have some control over the writing program, have it write the file somewhere else (like the /tmp directory) and then when it's done move it to the directory being watched.
If you don't have control of the program doing the writing (and by 'control' I mean 'edit the source code'), you probably won't be able to make it do file locking either, so that's probably out. In which case you'll likely need to know something about the file format to know when the writer is done. For instance, if the writer always writes "DONE" as the last four characters in the file, you could open the file, seek to the end, and read the last four characters.
Yes it will.
I prefer the "file naming convention" and renaming solution described by Donal.