I have two separate scripts, one written in Python and one in Ruby, which run on a schedule to achieve a single goal. Ruby isn't my code of choice, but it is all I can use for this task.
The Python script is run every 30 seconds, talks to a few scientific instruments, gather certain data, writes the data to a text file (one per instrument).
The ruby script then reads these files every 20 seconds and displays the information on a dashing dashboard.
The trouble I have is that sometimes the file is being written to by Python at the same time as Ruby is trying to read it. You can see obvious problems here...
Despite putting several checks in my ruby code such as:
If myFile.exists? and myFile.readable? and not myFile.zero?
I still get these clashes every now and then.
Is there a better way in ruby to avoid reading open files / files being written to?
Related
its just a question to understand if maybe the function could create some problems/fails in the large file.
i have >10 users who want to read/write not exactly in the same time but nearly as a background progress with a .py script the same large file. each user has his own line where huge relation information to one other user has been written as a really large string. as example:
[['user1','user2','1'],['user6','user50','2'],['and so on']]
['user1','user2','this;is;the;really;long;string;..(i am 18k letters long)...']
['user6','user50','this;is;the;really;long;string;..(i am 16k letters long)...']
...and so on
now user 1 want just to read his line 1 and user 6 wants to remove his own line 2.
so now my questions:
i cant find any problems if all users just read the file, but if user 6 wants to delete his own line information and rewrite the line 0 with the new information and rewrite the other lines to a newline position, how would the other users >10 would read the file if user 6 needs more time to write the new file as the other users >10? they dont need so long to open the file and if they down wait to let user 6 finished his job, the others would read the wrong information for the file
would be enough to write the .py script
f = open(fileNameArr, "rw")
....
f.close()
to solve that problem? or maybe "rwb+" or what would be needed to do for that?
Should i insert some temp timeout function in the .py script as example i have to insert it in php set_time_limit(300); for long calculations and outputs?
for any input to understand a big thx up to you.
You should look up Unix file management - Unix doesn't give you a great out-of-the-box solution to this problem.
The short version is that any number of processes can read the same file at once, but under most sets of permissions, any process can overwrite the file. Unlike on, say, Windows, where the OS prevents multiple programs from editing the same file at once, on Unix any write will overwrite all previous writes - if two users start from the same base file and make different changes, then whoever calls .write() most recently will win. Yes, this does cause concurrency issues.
The answer above mentions some countermeasures - namely, enforcing file-locking at a software level in your program, which is essentially what I suggested in a comment - but to my knowledge there's no generalized solution to this issue.
Google Docs and the rest of Drive have collaborative file editing that, though the code is obviously not public, seems to use Operational Transformation as its main approach, in which, essentially, no user can directly modify the file, and instead of using typical file I/O commands each user sends the server its desired modifications and the server sorts out concurrency issues.
Maybe you should rethink the way you've designed this system? Why is all of this information stored within a single file, with each line dedicated to a specific user? Why not have multiple, smaller files, one for each user, which would cut down on the concurrency issues with reading/writing? Why not use a database to store this information instead, and let the database handle the concurrency issues? Most databases can handle arbitrarily large strings, and though some aren't easily scalable to the 30GB you mention in your question, others definitely are.
I am currently running evaluations with multiple parameter configurations in a medium sized project.
I set certain parameters and change some code parts and run the main file with python.
Since the execution will take several hours, after starting it I make changes to some files (comment out some lines and change parameter) and start it again in a new tmux session.
While doing this, I observed behaviour, where the first execution will use configuration options of the second execution, so it seems like python was not done parsing the code files or maybe lazy loads them.
Therefore I wonder how python loads modules / code files and if changing them after I started the execution will have an impact on the execution?
Part of my job is to print several hundred labels per day with a Dymo labelwriter. We use macOS for everything, and installing Windows is out of the question as it would break tons of workflow. We have a template file that I can modify using Python's xml.etree module. How would I print the labels to our labelwriter using solely Python? I've read up on using os.system to send the data straight to the printer, but I'm not sure how that would work with an XML file and a labelwriter, which seems incredibly proprietary. I don't mind using other languages or APIs as long as they're easy to explain to the non-technical people in the office. Also open to using AppleScript.
I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.
I'm working on a side project where we want to process images in a hadoop mapreduce program (for eventual deployment to Amazon's elastic mapreduce). The input to the process will be a list of all the files, each with a little extra data attached (the lat/long position of the bottom left corner - these are aerial photos)
The actual processing needs to take place in Python code so we can leverage the Python Image Library. All the Python streaming examples I can find use stdin and process text input. Can I send image data to Python through stdin? If so, how?
I wrote a Mapper class in Java that takes the list of files and saves the names, the extra data, and the binary contents to a sequence file. I was thinking maybe I need to write a custom Java mapper that takes in the sequence file and pipes it to Python. Is that the right approach? If so, what should the Java to pipe the images out and the Python to read them in look like?
In case it's not obvious, I'm not terribly familiar with Java OR Python, so it's also possible I'm just biting off way more than I can chew with this as my introduction to both languages...
There are a few possible approaches that I can see:
Use both the extra data and the file contents as input to your python program. The tricky part here will be the encoding. I frankly have no idea how streaming works with raw binary content, and I'm assuming that basic answer is "not well." The main issue is that the stdin/stdout communication between processes is very text-based, relying on delimiting input with tabs and newlines, and things like that. You would need to worry about the encoding of the image data, and probably have some sort of pre-processing step, or a custom InputFormat so that you could represent the image as text.
Use only the extra data and the file location as input to your python program. Then the program can independently read the actual image data from the file. The hiccup here is making sure that the file is available to the python script. Remember this is a distributed environment, so the files would have to be in HDFS or somewhere similar, and I don't know if there are good libraries for reading files from HDFS in python.
Do the java-python interaction yourself. Write a java mapper that uses the Runtime class to start the python process itself. This way you get full control over exactly how the two worlds communicate, but obviously its more code and a bit more involved.