I wrote a routine that browses through a bunch of files and adds database entries (if not already present) according to the data, found in each line of those files.
for root, dirs, files in walk_dir(path):
for file_name in files:
file_path = join_path(root, file_name)
with transaction.commit_on_success():
with open(file_path, "r") as f:
for i, line in enumerate(f):
handle_line(line)
def handle_line(line):
phrase, translated_phrase, context = get_params(line)
try:
ph = Phrase.objects.get(name=phrase)
except Phrase.DoesNotExist:
ph = Phrase(name=phrase)
ph.save()
try:
tr = Translation.object.get(phrase=ph, name=translated_phrase)
except Translation.DoesNotExist:
tr = Translation(phrase=ph, name=translated_phrase)
tr.save()
try:
tm = TMTranslation.object.get(translation=tr, context=context)
except TMTranslation.DoesNotExist:
tm = TMTranslation(translation=tr, context=context)
tm.save()
There might be a lot of data to be processed (maybe 1000 files with several 1000 lines in each file). But still, I don't see why I'm always running into memory problems. Shouldn't memory be freed at least after each file (after each transaction)?
But what I am experiencing is that this process will slowly eat up all my memory and will start using up the swap space as well. So what am I doing wrong?
Related
I am try to create some temporal files and make some operations on them inside a loop. Then I will access the information on all of the temporal files. And do some operations with that information. For simplicity I brought the following code that reproduces my issue:
import tempfile
tmp_files = []
for i in range(40):
tmp = tempfile.NamedTemporaryFile(suffix=".txt")
with open(tmp.name, "w") as f:
f.write(str(i))
tmp_files.append(tmp.name)
string = ""
for tmp_file in tmp_files:
with open(tmp_file, "r") as f:
data = f.read()
string += data
print(string)
ERROR:
with open(tmp_file, "r") as f: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpynh0kbnw.txt'
When I look on /tmp directory (with some time.sleep(2) on the loop) I see that the file is deleted and only one is preserved. And for that the error.
Of course I could handle to keep all the files with the flag tempfile.NamedTemporaryFile(suffix=".txt", delete=False). But that is not the idea. I would like to hold the temporal files just for the running time of the script. I also could delete the files with os.remove. But my question is more why this happen. Because I expected that the files hold to the end of the running. Because I don't close the file on the execution (or do I?).
A lot of thanks in advance.
tdelaney does already answer your actual question.
I just would like to offer you an alternative to NamedTemporaryFile. Why not creating a temporary folder which is removed (with all files in it) at the end of the script?
Instead of using a NamedTemporaryFile, you could use tempfile.TemporaryDirectory. The directory will be deleted when closed.
The example below uses the with statement which closes the file handle automatically when the block ends (see John Gordon's comment).
import os
import tempfile
with tempfile.TemporaryDirectory() as temp_folder:
tmp_files = []
for i in range(40):
tmp_file = os.path.join(temp_folder, f"{i}.txt")
with open(tmp_file, "w") as f:
f.write(str(i))
tmp_files.append(tmp_file)
string = ""
for tmp_file in tmp_files:
with open(tmp_file, "r") as f:
data = f.read()
string += data
print(string)
By default, a NamedTemporaryFile deletes its file when closed. its a bit subtle, but tmp = tempfile.NamedTemporaryFile(suffix=".txt") in the loop causes the previous file to be deleted when tmp is reassigned. One option is to use the delete=False parameter. Or, just keep the file open and seek to the beginning after the write.
NamedTemporaryFile is already a file object - you can write to it directly without reopening. Just make sure the mode is "write plus" and in text, not binary mode. Put the code an a try/finally block to make sure the files are really deleted at the end.
import tempfile
tmp_files = []
try:
for i in range(40):
tmp = tempfile.NamedTemporaryFile(suffix=".txt", mode="w+")
tmp.write(str(i))
tmp.seek(0)
tmp_files.append(tmp)
string = ""
for tmp_file in tmp_files:
data = tmp_file.read()
string += data
finally:
for tmp_file in tmp_files:
tmp_file.close()
print(string)
Trying to read some lines of a file (as done here) and having problem where even though I can count the lines of the file, the readline() method for the file appears to return nothing.
Code snippet below (I have additional code showing how I am initially reading in these files to the system (via FTP download with ftplib) in case that is relevant to the problem (since I don't really know what could be causing this weirdness)). The with open(localname)... line near the bottom is where I start trying to read the lines.
# connect to ftp location
ftp = ftplib.FTP(MY_IP)
ftp.login(CREDS["source"]["ftp_creds"]["username"], CREDS["source"]["ftp_creds"]["password"])
print(ftp.pwd())
print(ftp.dir())
ftp.cwd(CONF["reporting_configs"]["from"]["share_folder_path"])
print(ftp.pwd())
print(ftp.dir())
# download all files from ftp location
print(f"Retrieving files from {CONF['reporting_configs']['from']['share_folder_path']}...")
files = ftp.nlst()
files = [f for f in files if f not in ['.', '..']]
pp.pprint(files)
for f in files:
# timestamp = int(time.time())
# basename = os.path.split(f)[0]
localname = os.path.join(CONF["stages_base_dir"], CONF["stages"]["imported"], f"{f}")
fd = open(localname, 'wb')
ftp.retrbinary(f"RETR {f}", callback=fd.write)
fd.close()
SAMPLE_LINES = 5
with open(localname) as newfile:
print(f"Sampling {newfile.name}")
for i, _ in enumerate(newfile):
pass
lines = i+1
print(lines)
for i in range(min(SAMPLE_LINES, lines)):
l = newfile.readline()
print(l)
The output looks like...
<after some other output>
.
.
.
Retrieving files from /test_source...
['TEST001.csv', 'TEST002.csv']
Sampling /path/to/stages/imported/TEST001.csv
5
Sampling /path/to/stages/imported/TEST002.csv
5
Notice that it is able to recognize that there are more than 0 lines in each file, but printing the readline() shows nothing, yet viewing the textfile on my system I can see that there is not nothing.
Anyone know what could be going on here? Anything more I can do to debug?
After testing on my machine if you add newfile.seek(0) to reset the pointer back to the start of the file, it works.
SAMPLE_LINES = 5
with open("test.txt") as newfile:
print(f"Sampling {newfile.name}")
for i, _ in enumerate(newfile):
pass
lines = i+1
print(lines)
newfile.seek(0)
for i in range(min(SAMPLE_LINES, lines)):
l = newfile.readline()
print(l)
The only reason I can think that this could fix it is that when you run the for loop with the enumerate(newfile) that function itself is using the newfile.readline() function and is therefore moving the pointer.
I have the following situation:
-different users (all on windows OS) that run a python script that can either read or write to pickle file located on a shared folder.
-the "system" is designed in way that only one user at a time will be writing to the file (therefore no race condition of more processes trying to WRITE at the same time on the file)
-the basic code to write would be this:
with open(path + r'\final_db.p', 'wb') as f:
pickle.dump((x, y), f)
-while code to read would be:
with open(path + r'\final_db.p', 'rb') as f:
x, y = pickle.load(f)
-x is list of 5K or plus elements, where each element is a class instance containing many attributes and functions; y is a date
QUESTION:
am i correct assuming that there is a race condition when a reading and a writing process overlap? and that the reading one can end up with a corrupt file?
PROPOSED SOLUTIONS:
1.a possible solution i thought of is using filelock:
code to write:
file_path = path + r'\final_db.p'
lock_path = file_path + '.lock'
lock = filelock.FileLock(lock_path, timeout=-1)
with lock:
with open(file_path, 'wb') as f:
pickle.dump((x, y), f)
code to read:
file_path = path + r'\final_db.p'
lock_path = file_path + '.lock'
lock = filelock.FileLock(lock_path, timeout=-1)
with lock:
with open(file_path, 'rb') as f:
x, y = pickle.load(f)
this solution should work (??), but if a process crash, the file remains blocked till the "file_path + '.lock'" is cancelled
2.another solution could be to use portalocker
code to write:
with open(path + r'\final_db.p', 'wb') as f:
portalocker.lock(f, portalocker.LOCK_EX)
pickle.dump((x, y), f)
code to read:
segnale = True
while segnale:
try:
with open(path + r'\final_db.p', 'rb') as f:
x, y = pickle.load(f)
segnale = False
except:
pass
the reading process, if another process started writing before it, will keep looping till the file is unlocked (except PermissionError).
if the writing process started after the reading process, the reading should loop if the file is corrupt.
what i am not sure about is if the reading process could end up reading a partially written file.
Any advice? better solutions?
I have a (large) set of XML files that I want to search for a set of strings all being present within - I am trying to use the following Python code to do this:
import collections
thestrings = []
with open('Strings.txt') as f:
for line in f:
text = line.strip()
thestrings.append(text)
print('Searching for:')
print(thestrings)
print('Results:')
try:
from os import scandir
except ImportError:
from scandir import scandir
def scantree(path):
"""Recursively yield DirEntry objects for given directory."""
for entry in scandir(path):
if entry.is_dir(follow_symlinks=False) and (not entry.name.startswith('.')):
yield from scantree(entry.path)
else:
yield entry
if __name__ == '__main__':
for entry in scantree('//path/to/folder'):
if ('.xml' in entry.name) and ('.zip' not in entry.name):
with open(entry.path) as f:
data = f.readline()
if (thestrings[0] in data):
print('')
print('****** Schema found in: ', entry.name)
print('')
data = f.read()
if (thestrings[1] in data) and (thestrings[2] in data) and (thestrings[3] in data):
print('Hit at:', entry.path)
print("Done!")
Where Strings.txt is a file with the strings I am interested to find, and the first line is the schema URI.
This seems to run OK at first, but after some seconds gives me a:
FileNotFoundError: [WinError 3] The system cannot find the path specified: //some/path
Which is confusing me, since the path is being built during runtime?
Note, if I instrument the code as follows:
with open(entry.path) as f:
data = f.readline()
if (thestrings[0] in data):
To become:
with open(entry.path) as f:
print(entry.name)
data = f.readline()
if (thestrings[0] in data):
Then I see a number of potential files being found before the error occurs.
I realised that my script is finding some very long UNC path names, too long for Windows it seems, so I am now also checking the path length before attempting to open the file, as follows:
if name.endswith('.xml'):
fullpath = os.path.join(root, name)
if (len(fullpath) > 255): ##Too long for Windows!
print('File-extension-based candidate: ', fullpath)
else:
if os.path.isfile(fullpath):
with open(fullpath) as f:
data = f.readline()
if (thestrings[0] in data):
print('Schema-based candidate: ', fullpath)
Note, I also decided to check if the file really is a file, and I altered my code to use os.walk, as suggested above. Along with simplifying the check for a .xml file-extension by using .endswith()
Everything now seems to work OK...
I have a series of .csv files with some data, and I want a Python script to open them all, do some preprocessing, and upload the processed data to my postgres database.
I have it mostly complete, but my upload step isn't working. I'm sure it's something simple that I'm missing, but I just can't find it. I'd appreciate any help you can provide.
Here's the code:
import psycopg2
import sys
from os import listdir
from os.path import isfile, join
import csv
import re
import io
try:
con = db_connect("dbname = '[redacted]' user = '[redacted]' password = '[redacted]' host = '[redacted]'")
except:
print("Can't connect to database.")
sys.exit(1)
cur = con.cursor()
upload_file = io.StringIO()
file_list = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in file_list:
id_match = re.search(r'.*-(\d+)\.csv', file)
if id_match:
id = id_match.group(1)
file_name = format(id_match.group())
with open(mypath+file_name) as fh:
id_reader = csv.reader(fh)
next(id_reader, None) # Skip the header row
for row in id_reader:
[stuff goes here to get desired values from file]
if upload_file.getvalue() != '': upload_file.write('\n')
upload_file.write('{0}\t{1}\t{2}'.format(id, [val1], [val2]))
print(upload_file.getvalue()) # prints output that looks like I expect it to
# with thousands of rows that seem to have the right values in the right fields
cur.copy_from(upload_file, '[my_table]', sep='\t', columns=('id', 'col_1', 'col_2'))
con.commit()
if con:
con.close()
This runs without error, but a select query in psql still shows no records in the table. What am I missing?
Edit: I ended up giving up and writing it to a temporary file, and then uploading the file. This worked without any trouble...I'd obviously rather not have the temporary file though, so I'm happy to have suggestions if someone sees the problem.
When you write to an io.StringIO (or any other file) object, the file pointer remains at the position of the last character written. So, when you do
f = io.StringIO()
f.write('1\t2\t3\n')
s = f.readline()
the file pointer stays at the end of the file and s contains an empty string.
To read (not getvalue) the contents, you must reposition the file pointer to the beginning, e.g. use seek(0)
upload_file.seek(0)
cur.copy_from(upload_file, '[my_table]', columns = ('id', 'col_1', 'col_2'))
This allows copy_from to read from the beginning and import all the lines in your upload_file.
Don't forget, that you read and keep all the files in your memory, which might work for a single small import, but may become a problem when doing large imports or multiple imports in parallel.