I am using the output streams from the io module and writing to files. I want to be able to detect when I have written 1G of data to a file and then start writing to a second file. I can't seem to figure out how to determine how much data I have written to the file.
Is there something easy built in to io? Or might I have to count the bytes before each write manually?
if you are using this file for a logging purpose i suggest using the RotatingFileHandler in logging module like this:
import logging
import logging.handlers
file_name = 'test.log'
test_logger = logging.getLogger('Test')
handler = logging.handlers.RotatingFileHandler(file_name, maxBytes=10**9)
test_logger.addHandler(handler)
N.B: you can also use this method even if you don't use it for logging if you like doing hacks :)
See the Python documentation for File Objects, specifically tell().
Example:
>>> f=open('test.txt','w')
>>> f.write(10*'a')
>>> f.tell()
10L
>>> f.write(100*'a')
>>> f.tell()
110L
See the tell() method on the stream object.
One fairly straight-forward approach is to subclass the builtinfileclass and have it keep track of the amount of output which is written to the file. Below is a some sample code showing how that might be done which appears to mostly work.
I say mostly because the size of the files produced is sometimes slightly over the maximum while testing it, but that's because the test the file was opened in "text" mode and on Windows this means that all the'\n' linefeed characters get converted into'\r\n'(carriage-return, linefeed) pairs, which throws the size accumulator off. Also, as currently written, thebufsizeargument that the standardfile()andopen() functions accept is not supported, so the system's default size and mode will always be used.
Depending on exactly what you're doing, the size issue may not be big problem -- however for large maximum sizes it might be off significantly. If anyone has a good platform-independent fix for this, by all means let us know.
import os.path
verbose = False
class LtdSizeFile(file):
''' A file subclass which limits size of file written to approximately "maxsize" bytes '''
def __init__(self, filename, mode='wt', maxsize=None):
self.root, self.ext = os.path.splitext(filename)
self.num = 1
self.size = 0
if maxsize is not None and maxsize < 1:
raise ValueError('"maxsize: argument should be a positive number')
self.maxsize = maxsize
file.__init__(self, self._getfilename(), mode)
if verbose: print 'file "%s" opened' % self._getfilename()
def close(self):
file.close(self)
self.size = 0
if verbose: print 'file "%s" closed' % self._getfilename()
def write(self, text):
lentext =len(text)
if self.maxsize is None or self.size+lentext <= self.maxsize:
file.write(self, text)
self.size += lentext
else:
self.close()
self.num += 1
file.__init__(self, self._getfilename(), self.mode)
if verbose: print 'file "%s" opened' % self._getfilename()
self.num += 1
file.write(self, text)
self.size += lentext
def writelines(self, lines):
for line in lines:
self.write(line)
def _getfilename(self):
return '{0}{1}{2}'.format(self.root, self.num if self.num > 1 else '', self.ext)
if __name__=='__main__':
import random
import string
def randomword():
letters = []
for i in range(random.randrange(2,7)):
letters.append(random.choice(string.lowercase))
return ''.join(letters)
def randomsentence():
words = []
for i in range(random.randrange(2,10)):
words.append(randomword())
words[0] = words[0].capitalize()
words[-1] = ''.join([words[-1], '.\n'])
return ' '.join(words)
lsfile = LtdSizeFile('LtdSizeTest.txt', 'wt', 100)
for i in range(100):
sentence = randomsentence()
if verbose: print ' writing: {!r}'.format(sentence)
lsfile.write(sentence)
lsfile.close()
I noticed an ambiguity in your question. Do you want the file to be (a) over (b) under (c) exactly 1GiB large, before switching?
It's easy to tell if you've gone over. tell() is sufficient for that kind of thing; just check if tell() > 1024*1024*1024: and you'll know.
Checking if you're under 1GiB, but will go over 1GiB on your next write, is a similar technique. if len(data_to_write) + tell > 1024*1024*1024: will suffice.
The trickiest thing to do is to get the file to exactly 1GiB. You will need to tell() the length of the file, and then partition your data appropriately in order to hit the mark precisely.
Regardless of exactly which semantics you want, tell() is always going to be at least as slow as doing the counting yourself, and possibly slower. This doesn't mean that it's the wrong thing to do; if you're writing the file from a thread, then you almost certainly will want to tell() rather than hope that you've correctly preempted other threads writing to the same file. (And do your locks, etc., but that's another question.)
By the way, I noticed a definite direction in your last couple questions. Are you aware of #twisted and #python IRC channels on Freenode (irc.freenode.net)? You will get timelier, more useful answers.
~ C.
I recommend counting. There's no internal language counter that I'm aware of. Somebody else mentioned using tell(), but an internal counter will take roughly the same amount of work and eliminate the constant OS calls.
#pseudocode
if (written + sizeOfNew > 1G) {
rotateFile()
}
Related
I have following code to print to the System-Default printer:
def printFile(file):
print("printing file...")
with open(file, "rb") as source:
printer = subprocess.Popen('/usr/bin/lpr', stdin=subprocess.PIPE)
printer.stdin.write(source.read())
This function works quite well if I use it on its own. But if use it in a loop construct like this:
while True:
printFile(file)
(...)
the printing job won't run (although) the loop will continue without error...
I tried to build in a time delay, but it didn't helped...
[Edit]: further investigations showed me that the printing function (when called from the loop) will put the printing jobs on hold...?
In modern Python3, it is advised to use subprocess.run() in most cases instead of using subprocess.Popen directly. And I would leave it to lpr to read the file, rather than passing it to standard input:
def printFile(file):
print("printing file...")
cp = subprocess.run(['\usr\bin\lpr', file])
return cp.returncode
Using subprocess.run allows you to ascertain that the lpr process finished correctly. And this way you don't have to read and write the complete file. You can even remove the file once lpr is finished.
Using Popen directly has some disadvantages here;
Using Popen.stdin might produce a deadlock if it overfills the OS pipe buffers (according to the Python docs).
Since you don't wait() for the Popen process to finish, you don't know if it finished without errors.
Depending on how lpr is set up, it might have rate controls. That is, it might stop printing if it gets a lot of print requests in a short span of time.
Edit: I just thought of something. Most lpr implementations allow you to print more than one file at a time. So you could also do:
def printFile(files):
"""
Print file(s).
Arguments:
files: string or sequence of strings.
"""
if isinstance(files, str):
files = [files]
# if you want to be super strict...
if not isinstance(files (list, tuple)):
raise ValueError('files must be a sequence type')
else:
if not all(isinstance(f, str) for f in files):
raise ValueError('files must be a sequence of strings')
cp = subprocess.run(['\usr\bin\lpr'] + files)
return cp.returncode
That would print a single file or a whole bunch of them in one go...
I have a class that I have written that will open a text document and search it line by line for the keywords that are input from a GUI that I have created in a different file. It works great, the only problem is the text document that I am searching is long (over 60,000 entries). I was looking at ways to make the search faster and have been playing around with multithreading but have not had any success yet. Basically, the main program calls the search function which takes the line and breaks it into individual words. Then over a loop checks each of the words against the keywords from the user. If the keyword is in that word then it says its true and adds a 1 to a list. At the end, if there is the same number of keywords as true statements then it adds that line to a set that is returned at the end of main.
What I would like to do is incorporate multithreading into this so that it will run much faster but at the end of the main function will still return results. Any advice or direction with being able to accomplish this will be very helpful. I have tried to read a bunch of examples and watched a bunch of youtube videos but it didn't seem to transfer over when I tried. Thank you for your help and your time.
import pdb
from threading import Thread
class codeBook:
def __init__(self):
pass
def main(self, search):
count = 0
results = set()
with open('CodeBook.txt') as current_CodeBook:
lines = current_CodeBook.readlines()
for line in lines:
line = line.strip()
new_search = self.change_search(line,search)
line = new_search[0]
search = new_search[1]
#if search in line:
if self.speed_search(line,search) == True:
results.add(line)
else:
pass
count = count + 1
results = sorted(list(results))
return results
def change_search(self, current_line, search):
current_line = current_line.lower()
search = search.lower()
return current_line, search
def search(self,line,keywords):
split_line = line.split()
split_keywords = keywords.split()
numberOfTrue = list()
for i in range(0,len(split_keywords)):
if split_keywords[i] in line:
numberOfTrue.append(1)
if len(split_keywords) == len(numberOfTrue):
return True
else:
return False
You can split the file into several parts and create a new thread that reads and processes a specific part. You can keep a data structure global to all threads and add lines that match the search query from all the threads to it. This structure should either be thread-safe or you need to use some kind of synchronization (like a lock) to work with it.
Note: CPython interpreter has a global interpreter lock (GIL), so if you're using it and your application is CPU-heavy (which seems to the case here), you might not get any benefits from multithreading whatsoever.
You can use the multiprocessing module instead. It comes with means of interprocess communitation. A Queue looks like the right structure for your problem (each process could add matching lines to the queue). After that, you just need to get all lines from the queue and do what you did with the results in your code.
While threading and/or multiprocessing can be beneficial and speed up execution, I would want to direct your attention to looking into the possibility to optimize your current algorithm, running in a single thread, before doing that.
Looking at your implementation I believe a lot of work is done several times for no reason. To the best of my understanding the following function will perform the same operation as your codeBook.main but with less overhead:
def search_keywords(keyword_string, filename='CodeBook.txt'):
results = set()
keywords = set()
for keyword in keyword_string.lower():
keywords.add(keyword)
with open(filename) as code_book:
for line in code_book:
words = line.strip().lower()
kws_present = True
for keyword in keywords:
kws_present = keyword in words
if not kws_present:
break
if kws_present:
results.add(line)
return sorted(list(results))
Try this function, as is, or slightly modified for your needs and see if that gives you a sufficient speed-up. First when that is not enough, you should look into more complex solutions, as it invariably will increase the complexity of your program to introduce more threads/processes.
This question already has answers here:
Disable output buffering
(16 answers)
Closed 9 years ago.
I am looking for a way to redirect output from Standard output to a file without a delay .
Writing to a file seems ok using following code :
import time
import sys,os
def test():
j = 1
while j < 10 :
time.sleep(1)
print("Python is good .Iteration ",j )
j +=1
if __name__ == "__main__":
myFile= open( "logFile.log", "w", 0 )
sys.stdout= myFile
test()
However , This only writes to the file on completion of the code i.e. after 9th iteration . I want to know if we can write data to file before completion of whole code and see the output in the file by maybe doing a tail -f logFile.log
Thanks in advance
A simple solution is to add a -u option for python command to force unbuffered stdin, stdout and stderr.
python -u myscript.py
This is because nothing is flushing the output buffer.
Try adding this to your code once in a while:
sys.stdout.flush()
It's not perfect but should work.
Also it's early in the morning and there might be a better solution than this but i came up with this idea just now:
class log():
def __init__(self, file):
self.file = open(file, 'r', 0)
def write(self, what):
self.file.write(what)
self.file.flush()
def __getattr__(self, attr):
return getattr(self.file, attr)
sys.stdout = log()
Haha, and yea that's the marked solution in the dupe, so i'll point the scores to that post :P beaten to it by 30 sec :)
For every iteration, you must add this.
sys.stdout.flush()
This flushes the output buffer, and does the equivalent of opening and closing the file so the changes are appended.
However, I don't see what's wrong with it appending all the data at the end, as you still get the same result and you won't be able to access that file externally while that program is using it anyway.
The output is buffered since it's more efficient to write larger chunks of data to a file.
You can either flush() the buffer explicitly or you use sys.stderr as output. A third approach, which might be a better solution for a larger project is, to use the logging module included in python. It allows you to raise messages with different log levels and to redirect these log levels differently, including directly flushing to a file.
I have a "not so" large file (~2.2GB) which I am trying to read and process...
graph = defaultdict(dict)
error = open("error.txt","w")
print "Reading file"
with open("final_edge_list.txt","r") as f:
for line in f:
try:
line = line.rstrip(os.linesep)
tokens = line.split("\t")
if len(tokens)==3:
src = long(tokens[0])
destination = long(tokens[1])
weight = float(tokens[2])
#tup1 = (destination,weight)
#tup2 = (src,weight)
graph[src][destination] = weight
graph[destination][src] = weight
else:
print "error ", line
error.write(line+"\n")
except Exception, e:
string = str(Exception) + " " + str(e) +"==> "+ line +"\n"
error.write(string)
continue
Am i doing something wrong??
Its been like an hour.. since the code is reading the file.. (its still reading..)
And tracking memory usage is already 20GB..
why is it taking so time and memory??
To get a rough idea of where the memory is going, you can use the gc.get_objects function. Wrap your above code in a make_graph() function (this is best practice anyway), and then wrap the call to this function with a KeyboardInterrupt exception handler which prints out the gc data to a file.
def main():
try:
make_graph()
except KeyboardInterrupt:
write_gc()
def write_gc():
from os.path import exists
fname = 'gc.log.%i'
i = 0
while exists(fname % i):
i += 1
fname = fname % i
with open(fname, 'w') as f:
from pprint import pformat
from gc import get_objects
f.write(pformat(get_objects())
if __name__ == '__main__':
main()
Now whenever you ctrl+c your program, you'll get a new gc.log. Given a few samples you should be able to see the memory issue.
There are a few things you can do:
Run your code on a subset of data. Measure time required. Extrapolate to the full size of your data. That will give you an estimate how long it will run.
counter = 0
with open("final_edge_list.txt","r") as f:
for line in f:
counter += 1
if counter == 200000:
break
try:
...
On 1M lines it runs ~8 sec on my machine, so for 2.2Gb file with about 100M lines it suppose to run ~15 min. Though, once you get over you available memory, it won't hold anymore.
Your graph seems symmetric
graph[src][destination] = weight
graph[destination][src] = weight
In your graph processing code use symmetry of graph, reduce memory usage by half.
Run profilers on you code using subset of the data, see what happens there. Simplest would be to run
python -m cProfile --sort cumulative youprogram.py
There is a good article on speed and memory profilers: http://www.huyng.com/posts/python-performance-analysis/
Python's numeric types use quite a lot of memory compared to other programming languages. For my setting it appears to be 24 bytes for each number:
>>> import sys
>>> sys.getsizeof(int())
24
>>> sys.getsizeof(float())
24
Given you have hundreds of millions of lines in that 2.2 GB input file the reported memory consumption should not come unexpected.
To add another thing, some versions of the Python interpreter (including CPython 2.6) are known for keeping so called free lists for allocation performance, especially for objects of type int and float. Once allocated, this memory will not be returned to the operating system until your process terminates. Also have a look at this question I posted when I first discovered this issue:
Python: garbage collection fails?
Suggestions to work around this include:
use a subprocess to do the memory hungry computation, e.g., based on the multiprocessing module
use a library that implements the functionality in C, e.g., numpy, pandas
use another interpreter, e.g., PyPy
You don't need graph to be defaultdict(dict), user dict instead; graph[src, destination] = weight and graph[destination, src] = weight will do. Or only one of them.
To reduce memory usage, try store resulting dataset in scipy.sparse matrix, it consumes less memory and might be compressed.
What do you plan to do with your nodes list afterwards?
I really don't know why my code is not saving for me the readings from the adc and gps receiver to the file I already open it in the first line in the code. it save only one record from both adc and gps receiver.
this is my code:
import MDM
f = open("cord+adc.txt", 'w')
def getADC():
res = MDM.send('AT#ADC?\r', 0)
res = MDM.receive(100)
if(res.find('OK') != -1):
return res
else:
return ""
def AcquiredPosition():
res = MDM.send('AT$GPSACP\r', 0)
res = MDM.receive(30)
if(res.find('OK') != -1):
tmp = res.split("\r\n")
res = tmp[1]
tmp = res.split(" ")
return tmp[1]
else:
return ""
while (1):
cordlist = []
adclist = []
p = AcquiredPosition()
res = MDM.receive(60)
cordlist.append(p)
cordlist.append("\r\n")
f.writelines(cordlist)
q = getADC()
res = MDM.receive(60)
adclist.append(q)
adclist.append("\r\n")
f.writelines(adclist)
and this is the file called "cord+adc.txt":
174506.000,2612.7354N,05027.5971E,1.0,23.1,3,192.69,0.18,0.09,191109,07
#ADC: 0
if there is another way to write my code, please advise me or just point to me the error in the above code.
thanks for any suggestion
You have two problems here, you are not closing you file. There is a bigger problem in your program though your while loop will go forever (or until something else goes wrong in your program) there is no terminating condition. You are looping while 1 but never explicitly breaking out of the loop. I assume that when the function AcquiredPosition() returns an empty string you want the loop to terminate so I added the code if not p: break after the call to the function if it returns an empty string the loop will terminate the file will be closed thanks to the with statement.You should restructure your while loop like below:
with open("cord+adc.txt", 'w') as f:
while (1):
cordlist = []
adclist = []
p = AcquiredPosition()
if not p:
break
res = MDM.receive(60)
cordlist.append(p)
cordlist.append("\r\n")
f.writelines(cordlist)
q = getADC()
res = MDM.receive(60)
adclist.append(q)
adclist.append("\r\n")
f.writelines(adclist)
Because you never explicitly flush() or close() your file, there's no guarantee at all about what will wind up in it. You should probably flush() it after each packet, and you must explicitly close() it when you wish your program to exit.
If your modem connection is a socket,
make sure your socket is functioning by calling getADC() and AcquiredPosition() directly from the interactive interpreter. Just drop the while(1) loop in a function (main() is the common practice), then import the module from the interactive prompt.
Your example is missing the initialization of the socket object, MDM. Make sure it is correctly set up to the appropriate address, with code like:
import socket
MDM = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
MDM.connect((HOST, PORT))
If MDM doesn't refer to a TCP socket, you can still try calling the mentioned methods interactively.
I don't see you closing the file anywhere. Add this as the last line of your code:
f.close()
That should contribute to fixing your problem. I don;t know much about sockets, etc, so I can't help you there.
When you write a line into a file, it is actualy buffered into memory first (this is the C way of handling files). When the maximum size for the buffer is hit or you close the file, the buffer is emptyed into the specified file.
From the explanation so far i think you got the scary image of file manipulation. Now, the best way to solve any and all problems is to flush the buffer's content to the file (meaning after the flush() function is executed and the buffer is empty you have all the content safely saved into your file). Of cource it wold be a great thing to close the file also, but in an infinite loop it's hardly possible (you could hardcode an event maybe, send it to the actual function and when the infinite loop stops - closing the program - close the file also; just a sugestion of cource, the flush () thing shold do the trick.