Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
I have a Python script which loads binary data from any targeted file and stores in inside
itself, in a list. The problem is that the bigger the stored file is, the longer it takes to open it the next time. Let's say that I want to load a 700 MB movie and store it in my script file. Then imagine that I open it next day with the 700 MB data stored in that script. It takes an eternity to open it!
Here is a simplified layout of how the script file looks.
Line 1: "The 700 MB movie is stored here inside a list."
Everything below : "All functions that the end-user uses."
Before the interpreter reaches the functions that the user is waiting for to be called,
it has to interpret a 700 MB data that is on line 1 first! This is a problem
because who wants to wait for an hour just to open a script?
So, would it help if I changed the layout of the file like this?
First lines: "All functions that the end-user uses." Below : "The 700
MB movie is stored here inside a list."
Would that help? Or would the interpreter have to plow through all the 700 MBs before the functions were called anyways?
Python compiler works in a way that makes what you are looking to do very very hard to say the least.
First, every-time you change the script (by adding the file for example), it will trigger a new cycle of compilation before the execution (turning a .py file in a .pyc one).
Second, every time you import the module, you will have that large block of data loaded into memory (whether it is on import or when you first access the data).
This is not just slow, it's also unsafe and error prone.
I'm guessing that what you intend to do is, distribute one single file with the data in it.
You might be able to do that using this little trick:
Making an executable python package (a zip file basically).
Building the zip file is very easy using the zipfile module.
Related
its just a question to understand if maybe the function could create some problems/fails in the large file.
i have >10 users who want to read/write not exactly in the same time but nearly as a background progress with a .py script the same large file. each user has his own line where huge relation information to one other user has been written as a really large string. as example:
[['user1','user2','1'],['user6','user50','2'],['and so on']]
['user1','user2','this;is;the;really;long;string;..(i am 18k letters long)...']
['user6','user50','this;is;the;really;long;string;..(i am 16k letters long)...']
...and so on
now user 1 want just to read his line 1 and user 6 wants to remove his own line 2.
so now my questions:
i cant find any problems if all users just read the file, but if user 6 wants to delete his own line information and rewrite the line 0 with the new information and rewrite the other lines to a newline position, how would the other users >10 would read the file if user 6 needs more time to write the new file as the other users >10? they dont need so long to open the file and if they down wait to let user 6 finished his job, the others would read the wrong information for the file
would be enough to write the .py script
f = open(fileNameArr, "rw")
....
f.close()
to solve that problem? or maybe "rwb+" or what would be needed to do for that?
Should i insert some temp timeout function in the .py script as example i have to insert it in php set_time_limit(300); for long calculations and outputs?
for any input to understand a big thx up to you.
You should look up Unix file management - Unix doesn't give you a great out-of-the-box solution to this problem.
The short version is that any number of processes can read the same file at once, but under most sets of permissions, any process can overwrite the file. Unlike on, say, Windows, where the OS prevents multiple programs from editing the same file at once, on Unix any write will overwrite all previous writes - if two users start from the same base file and make different changes, then whoever calls .write() most recently will win. Yes, this does cause concurrency issues.
The answer above mentions some countermeasures - namely, enforcing file-locking at a software level in your program, which is essentially what I suggested in a comment - but to my knowledge there's no generalized solution to this issue.
Google Docs and the rest of Drive have collaborative file editing that, though the code is obviously not public, seems to use Operational Transformation as its main approach, in which, essentially, no user can directly modify the file, and instead of using typical file I/O commands each user sends the server its desired modifications and the server sorts out concurrency issues.
Maybe you should rethink the way you've designed this system? Why is all of this information stored within a single file, with each line dedicated to a specific user? Why not have multiple, smaller files, one for each user, which would cut down on the concurrency issues with reading/writing? Why not use a database to store this information instead, and let the database handle the concurrency issues? Most databases can handle arbitrarily large strings, and though some aren't easily scalable to the 30GB you mention in your question, others definitely are.
I am using Python 3.6.3.
A problem remains in my script, which is fully operational.
Main modules are pandas and xslxwriter (with easygui for GUI).
From a single master file (Excel 2010), this script can generate dozens of Excel files (with xlsxwriter), each of them can contain hundreds of columns of data (depending of parameters of the master file).
Indentation, logic and results are OK.
But the last Excel file is not committed to disk, and I have to restart Python to get it.
For example, if one run produces 100 files, only 99 will be written on disk. The last one is calculated, but not visible.
If Python is not restarted, this file is written to disk at the beginning of a next run of the script.
I identified maybe a flush problem, and tried some solutions, but this problem still remains.
Are there some tricks to force the buffer? I am not allowed to modify the environment variables on my professional computer.
Thank you for your help and your time :)
Thank you Mark Tolonen
You were right : the file was not closed properly, and it was because I made a mistake.
My code was a bit difficult to summarize, and I could not post a résumé of my script.
First, a keyword continue (for the main loop) was bad-indented and I replaced it at the right place.
But just before this keyword, I was closing the xlsxwriter file with: workbook.close (only one in the script exists for the main loop).
But this was not mentioned as an error at run-time.
Each xlsxwriter file was committed to disk except the last one, as mentioned in my question above.
I then reviewed the documentation at https://xlsxwriter.readthedocs.io/workbook.html, and I noticed that parenthesis were missing when worbook is closed.
After correction by adding the missing parenthesis: workbook.close(), all is fine now :)
I would like to share this information, because some may have meet the same problem.
Than you also to progmatico for your information on flush properties.
Greetings from Paris, France :)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have been looking at some python modules and powershell capabilities to try and import some data a database recently kicked out in the form of folders and text files.
File Structure:
Top Level Folder > FOLDER (Device hostname) > Text File (also contains the hostname of device) (with data I need in a single cell in Excel)
The end result I am trying to accomplish is have the first cell be the FOLDER (device name) and the second column contain the text of the text file within that folder.
I found some python modules but they all focus on pulling directly from a text doc...I want to have the script or powershell function iterate through each folder and pull both the folder name and text out.
This is definitely do-able in Powershell. If I understand your question correctly, you're going to want to use Get Child-Itemand Get Content then -recurse if necessary. As far as an export you're going to want to use Out-File which can be a hassle when exporting directly to xlsx. If you had some code to work with I could help better but until then this should get you started in the right direction. I would read up on the Getcommands because Powershell is very simple to write but powerful.
Well, Since you're asking in a general sense -- you can do this project simply in either choice of script languages. If it were me -- and I had to do this job once -- I'd probably just bang out a PoSH script to do it. If this script had to be run repeatedly, and potentially end up with more complex functions I'd probably then switch to Python (but that's just based on my personal preference as PoSH is pretty powerful in a Windows env). Really this seems like a 15 line recursive function in either choice.
You can look up "recursive function in powershell" and the same for python and get plenty of code examples, as File Tree Walks (FTW :) ) are one of the most solved problems on sites like this. You'd just replace whatever the other person is doing in their walk with a read of the file and write. You probably also want to output in CSV format because it's easier and imports into excell just fine.
I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.
This question already has answers here:
How do I watch a file for changes?
(28 answers)
Closed 5 years ago.
I am relatively new to python, but I am trying to create an automated process where my code will listen for new file entries in a directory. For example, someone can manually copy a zip file into a particular folder, and I want my code to recognize the file once it has completely been copied into the folder. The code can then do some manipulations, but that is irrelevant. I currently have my code just checking for a new file every 5 seconds, but this seems inefficient to me. Can someone suggest something that is more asynchronous?
Checking for new file every several seconds is actually not that bad an approach — in fact, it is the only really portable way to monitor for filesystem changes. If you're running on Linux, answers in the question linked by #sdolan will help you check for new files more efficiently, but they won't help you with the other part of your question.
Detecting that the file has been copied completely is much more difficult than it first appears. Your best best is, when a new file is detected, wait until it hasn't been touched for a while before processing it. The length of the wait period is best determined experimentally. It's a balancing act: make the interval too short, and you're risking operating on incomplete files; make it too long, and the user will notice a delay between the copy operation finishing and your code processing it.