Update strings in a text file at a specific location - python

I would like to find a better solution to achieve the following three steps:
read strings at a given row
update strings
write the updated strings back
Below are my code which works but I am wondering is there any better (simple) solutions?
new='99999'
f=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP','r+')
lines=f.readlines()
#the row number we want to update is given, so just load the content
x = lines[95]
print(x)
f.close()
#replace
f1=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP')
con = f1.read()
print con
con1 = con.replace(x[2:8],new) #only certain columns in this row needs to be updated
print con1
f1.close()
#write
f2 = open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'w')
f2.write(con1)
f2.close()
Thanks!
UPDATE: get an idea from jtmoulia this time it becomes easier
def replace_line(file_name, line_num, col_s, col_e, text):
lines = open(file_name, 'r').readlines()
temp=lines[line_num]
temp = temp.replace(temp[col_s:col_e],text)
lines[line_num]=temp
out = open(file_name, 'w')
out.writelines(lines)
out.close()

The problem with textual data, even when tabulated, is that the byte offsets are not predictable. For example, when representing numbers with strings you have one byte per digit, whereas when using binary (e.g. two's complement) you always need four or eight bytes either for small and large integers.
Nevertheless, if your text format is strict enough you can get along by replacing bytes without changing the size of the file, you can try using the standard mmap module. With it, you'll be able to treat a file as a mutable byte string and modify parts of it inplace and letting the kernel do the file saving for you.
Otherwise, whatever of the other answers are much better suited for the problem.

Well, to begin with you don't need to keep reopening and reading from the file every time. The r+ mode allows you to read and write to the given file.
Perhaps something like
with open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'r+') as f:
lines = f.readlines()
#... Perform whatever replacement you'd like on lines
f.seek(0)
f.writelines(lines)
Also, Editing specific line in text file in python

When I had to do something similar (for a Webmin customization), I did it entirely in PERL because that's what the Webmin framework used, and I found it quite easy. I assume (but don't know for sure) there are equivalent things in Python. First read the entire file into memory all at once (the PERL way to do this is probably called "slurp"). (This idea of holding the entire file in memory rather than just one line used to make little sense {or even be impossible}. But these days RAM is so large it's the only way to go.) Then use the split operator to divide the file into lines and put each line in a different element of a giant array. You can then use the desired line number as an index into the array (remember array indices usually start with 0). Finally, use "regular expression" processing to change the text of the line. Then change another line, and another, and another (or make another change to the same line). When you're all done, use join to put all the lines in the array back together into one giant string. Then write the whole modified file out.
While I don't have the complete code handy, here's an approximate fragment of some of the PERL code so you can see what I mean:
our #filelines = ();
our $lineno = 43;
our $oldstring = 'foobar';
our $newstring = 'fee fie fo fum';
$filelines[$lineno-1] =~ s/$oldstring/$newstring/ig;
# "ig" modifiers for case-insensitivity and possible multiple occurences in the line
# use different modifiers at the end of the s/// construct as needed

FILENAME = 'C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP'
lines = list(open(FILENAME))
lines[95][2:8] = '99999'
open(FILENAME, 'w').write(''.join(lines))

Related

How to prevent loss/miscount of data using File I/O with buffer/size hint Python

I am new to Python and writing several methods to process large log files (bigger than 5GB). THrough the research i did, I saw a lot of people using "with open" and specificying a size hint/buffer like so:
with open(filename, 'rb', buffering=102400) as f:
time_data_count = 0
logbinset = set()
#def f(n):print('{:0b}'.format(n)) #check what non iteratable function means
search_pattern = regex.compile(b'\d+\((.)+\)\s+\d+\((.)+\)')
for line in f:
if search_pattern.search(line):
x = search_pattern.search(line)
#print(x.group(1)+" "+ x.group(2))
print((x.group(1)).decode())
print((x.group(2)).decode())
Another method (this one always returns none for some reason. Could use some help finding out why:
with open(filename, 'rb') as f:
#text = []
while True:
memcap = f.read(102400)
if not memcap:
break
text = re.search(b'\d+\(.+\)\s+\d+\(.+\)',memcap)
if text is None:
print("none")
print(text.group())
In these method, I am trying to extract regex patterns from a 6GB log file. My question is, I am worried using buffers to chop the file into chunks could result in situations where the line containing the pattern is chopped in half which would result in some data being missing.
How do I make sure line intergrity is kept? How do I make sure it only breaks up my file at the end of a line? How do I make sure I don't lose data in between chunks? Or does the "with open" and read(102400) method ensure lines are not split in half when breaking the file into chunks.
First of all dont use 'rb' use just 'r' wich is used for text. 'rb' is for binary data.
the read method reads as many characters as you specify, so you woul end up with chopped lines. Use readline.
The first variant is the correct one: set a buffer size when opening the file to get fewer read operations without loosing matches that span read blocks.
If you are concerned with runtime it would be a good idea to just search once in each line and not one search to determine if there is a match and then doing the exact same search again to get at the values:
regex = re.compile(rb"\d+\((.)+\)\s+\d+\((.)+\)")
with open(filename, "rb", buffering=102400) as lines:
for line in lines:
match = regex.search(line)
if match:
print((match.group(1)).decode())
print((match.group(2)).decode())
The for loop and the filtering can be moved into functions that are implemented in C (in CPython):
regex = re.compile(rb"\d+\((.)+\)\s+\d+\((.)+\)")
with open(filename, "rb", buffering=102400) as lines:
for match in filter(bool, map(regex.search, lines)):
print((match.group(1)).decode())
print((match.group(2)).decode())
On a 64 bit Python you could also try the mmap module to map the file into memory and apply the regular expression to the whole content.

How to read a very large integer (1000+ digits) that is saved in **multiple lines** in a text file using Python?

Suppose I have a very large integer (say, of 1000+ digits) that I've saved into a text file named 'large_number.txt'. But the problem is the integer has been divided into multiple lines in the file, i.e. like the following:
47451445736001306439091167216856844588711603153276
70386486105843025439939619828917593665686757934951
62176457141856560629502157223196586755079324193331
64906352462741904929101432445813822663347944758178
92575867718337217661963751590579239728245598838407
58203565325359399008402633568948830189458628227828
80181199384826282014278194139940567587151170094390
35398664372827112653829987240784473053190104293586
86515506006295864861532075273371959191420517255829
71693888707715466499115593487603532921714970056938
54370070576826684624621495650076471787294438377604
Now, I want to read this number from the file and use it as a regular integer in my program. I tried the following but I can't.
My Try (Python):
with open('large_number.txt') as f:
data = f.read().splitlines()
Is there any way to do this properly in Python 3.6 ? Or what best can be done in this situation?
Just replace the newlines with nothing, then parse:
with open('large_number.txt') as f:
data = int(f.read().replace('\n', ''))
If you might have arbitrary (ASCII) whitespace and you want to discard all of it, switch to:
import string
killwhitespace = str.maketrans(dict.fromkeys(string.whitespace))
with open('large_number.txt') as f:
data = int(f.read().translate(killwhitespace))
Either way that's significantly more efficient than processing line-by-line in this case (because you need all the lines to parse, any line-by-line solution would be ugly), both in memory and runtime.
You can use this code:
with open('large_number.txt', 'r') as myfile:
data = myfile.read().replace('\n', '')
number = int(data)
You can use str.rstrip to remove the trailing newline characters and use str.join to join the lines into one string:
with open('large_number.txt') as file:
data = int(''.join(line.rstrip() for line in file))

How do I write a simple, Python parsing script?

Most of what I do involves writing simple parsing scripts that reads search terms from one file and searches, line by line, another file. Once the search term is found, the line and sometimes the following line are written to another output file. The code I use is rudimentary and likely crude.
#!/usr/bin/env python
data = open("data.txt", "r")
search_terms = ids.read().splitlines()
data.close()
db = open("db.txt", "r")
output = open("output.txt", "w")
for term in search_terms:
for line in db:
if line.find(term) > -1:
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found %s" % term)
There are a few problems here. First, I don't think it's the most efficient and fastest to search line by line, but I'm not exactly sure about that. Second, I often run into issues with cursor placement and the cursor doesn't reset to the beginning of the file when the search term is found. Third, while I am usually confident that all of the terms can be found in the db, there are rare times when I can't be sure, so I would like to write to another file whenever it iterates through the entire db and can't find the term. I've tried adding a snippet that counts the number of lines of the db so if the find() function gets to the last line and the term isn't found, then it outputs to another "not found" file, but I haven't been able to get my elif and else loops right.
Overall, I'd just like any hints or corrections that could make this sort of script more efficient and robust.
Thanks.
Unless it's a really big file, why not iterate line by line? If the input file's size is some significant portion of your machine's available resources (memory), then you might want to look into buffered input and other, more low-level abstractions of what the computer is doing. But if you're talking about a few hundred MB or less on a relatively modern machine, let the computer do the computing ;)
Off the bat you might want to get into the habit of using the built-in context manager with. For instance, in your snippet, you don't have a call to output.close().
with open('data.txt', 'r') as f_in:
search_terms = f_in.read().splitlines()
Now search_terms is a handle to a list that has each line from data.txt as a string (but with the newline characters removed). And data.txt is closed thanks to with.
In fact, I would do that with the db.txt file, also.
with open('db.txt', 'r') as f_in:
lines = f_in.read().splitlines()
Context managers are cool.
As a side note, you could open your destination file now, and do your parsing and results-tracking with it open the whole time, but I like leaving as many files closed as possible for as long as possible.
I would suggest setting the biggest object on the outside of your loop, which I'm guessing is db.txt contents. The outermost loop only usually only gets iterated once, so might as well put the biggest thing there.
results = []
for i, line in enumerate(lines):
for term in search_terms:
if term in line:
# Use something not likely to appear in your line as a separator
# for these "second lines". I used three pipe characters, but
# you could just as easily use something even more random
results.append('{}|||{}'.format(line, lines[i+1]))
if results:
with open('output.txt', 'w') as f_out:
for result in results:
# Don't forget to replace your custom field separator
f_out.write('> {}\n'.format(result.replace('|||', '\n')))
else:
with open('no_results.txt', 'w') as f_out:
# This will write an empty file to disk
pass
The nice thing about this approach is each line in db.txt is checked once for each search_term in search_terms. However, the downside is that any line will be recorded for each search term it contains, ie., if it has three search terms in it, that line will appear in your output.txt three times.
And all the files are magically closed.
Context managers are cool.
Good luck!
search_terms keeps whole data.txt in memory. That it's not good in general but in this case it's not quite bad.
Looking line-by-line is not sufficient but if the case is simple and files are not too big it's not a big deal. If you want more efficiency you should sort data.txt file and put this to some tree-like structure. It depends on data which is inside.
You have to use seek to move pointer back after using next.
Propably the easiest way here is to generate two lists of lines and search using in like:
`db = open('db.txt').readlines()
db_words = [x.split() for x in db]
data = open('data.txt').readlines()
print('Lines in db {}'.format(len(db)))
for item in db:
for words in db_words:
if item in words:
print("Found {}".format(item))`
Your key issue is that you may be looping in the wrong order -- in your code as posted, you'll always exhaust the db looking for the first term, so after the first pass of the outer for loop db will be at end, no more lines to read, no other term will ever be found.
Other improvements include using the with statement to guarantee file closure, and a set to track which search terms were not found. (There are also typos in your posted code, such as opening a file as data but then reading it as ids).
So, for example, something like:
with open("data.txt", "r") as data:
search_terms = data.read().splitlines()
missing_terms = set(search_terms)
with open("db.txt", "r") as db, open("output.txt", "w") as output:
for line in db:
for term in search_terms:
if term in line:
missing_terms.discard(term)
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found {}".format(term))
break
if missing_terms:
diagnose_not_found(missing_terms)
where the diagnose_not_found function does whatever you need to do to warn the user about missing terms.
There are assumptions embedded here, such as the fact that you don't care if some other search term is present in a line where you've found a previous one, or the very next one; they might take substantial work to fix if not applicable and it will require that you edit your Q with a very complete and unambiguous list of specifications.
If your db is actually small enough to comfortably fit in memory, slurping it all in as a list of lines once and for all would allow easier accommodation for more demanding specs (as in that case you can easily go back and forth, while iterating on a file means you can only go forward one line at a time), so if your specs are indeed more demanding please also clarify if this crucial condition hold, or rather you need this script to process potentially humungous db files (say gigabyte-plus sizes, so as to not "comfortably fit in memory", depending on your platform of course).

python join "large" file

In python I have read in a file into a list using file.readlines() , later on after some logic, I would like to put it back together in a string using fileString = ''.join(file), for some reason, even without a print function, it prints the fileString out to the console up to a certain point, then it just stops. It does not run the rest of the program which is not useful for me.
Why does join do this, how do I perhaps pre-allocate how much memory I would like my list/string to use so that it does not stop. Or some other solution too.
Thank you
File is your file pointer in memory. When you attempt to join on it, you don't actually have a string to work with.
How about this?
with open(file, 'rb') as myfile:
strings = myfile.readlines()
# do your stuff to strings
filestring = ''.join(strings)
Note that strings is a list of lines like this:
['my line\n', 'my other line!\n']
And as such, a large file will require quite a bit of memory. You may be better served by building a mini filter.
You should also consider what you are going to do with the resulting string. If you just want to write the contents back to a file, there is no need to join the parts first, you can use file.writelines(strings) directly.

How can I change a huge file into csv in python

I'm a beginner in python. I have a huge text file (hundreds of GB) and I want to convert the file into csv file. In my text file, I know the row delimiter is a string "<><><><><><><>". If a line contains that string, I want to replace it with ". Is there a way to do it without having to read the old file and rewriting a new file.
Normally I thought I need to do something like this:
fin = open("input", "r")
fout = open("outpout", "w")
line = f.readline
while line != "":
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line)
line = f.readline
but copying hundreds of GB is wasteful. Also I don't know if open will eat lots of memory (does it treat file handler as a stream?)
Any help is greatly appreciated.
Note: an example of the file would be
file.txt
<><><><><><><>
abcdefeghsduai
asdjliwa
1231214 ""
<><><><><><><>
would be one row and one column in csv.
#richard-levasseur
I agree, sed seems like the right way to go. Here's a rough cut at what the OP describes:
sed -i -e's/<><><><><><><>/"/g' foo.txt
This will do the replacement in-place in the existing foo.txt. For that reason, I recommend having the original file under some sort of version control; any of the DVCS should fit the bill.
Yes, open() treats the file as a stream, as does readline(). It'll only read the next line. If you call read(), however, it'll read everything into memory.
Your example code looks ok at first glance. Almost every solution will require you to copy the file elsewhere. Its not exactly easy to modify the contents of a file inplace without a 1:1 replacement.
It may be faster to use some standard unix utilities (awk and sed most likely), but I lack the unix and bash-fu necessary to provide a full solution.
It's only wasteful if you don't have disk to spare. That is, fix it when it's a problem. Your solution looks ok as a first attempt.
It's not wasteful of memory because a file handler is a stream.
Reading lines is simply done using a file iterator:
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
Also consider the CSV writer object to write CSV files, e.g:
import csv
writer = csv.writer(open("some.csv", "wb"))
writer.writerows(someiterable)
With python you will have to create a new file for safety sake, it will cause alot less headaches than trying to write in place.
The below listed reads your input 1 line at a time and buffers the columns (from what I understood of your test input file was 1 row) and then once the end of row delimiter is hit it will write that buffer to disk, flushing manually every 1000 lines of the original file. This will save some IO as well instead of writing every segment, 1000 writes of 32 bytes each will be faster than 4000 writes of 8 bytes.
fin = open(input_fn, "rb")
fout = open(output_fn, "wb")
row_delim = "<><><><><><><>"
write_buffer = []
for i, line in enumerate(fin):
if not i % 1000:
fout.flush()
if row_delim in line and i:
fout.write('"%s"\r\n'%'","'.join(write_buffer))
write_buffer = []
else:
write_buffer.append(line.strip())
Hope that helps.
EDIT: Forgot to mention, while using .readline() is not a bad thing don't use .readlines() which will go and read the entire content of the file into a list containing each line which is incredibly inefficient. Using the built in iterator that comes with a file object is the best memory usage and speed.
#Constatin suggests that if you would be satisfied with replacing '<><><><><><><>\n' by '" \n'
then the replacement string is the same length, and in that case you can craft a solution to in-place editing with mmap. You will need python 2.6. It's vital that the file is opened in the right mode!
import mmap, os
CHUNK = 2**20
oldStr = ''
newStr = '" '
strLen = len(oldStr)
assert strLen==len(newStr)
f = open("myfilename", "r+")
size = os.fstat(f.fileno()).st_size
for offset in range(0,size,CHUNK):
map = mmap.mmap(f.fileno(),
length=min(CHUNK+strLen,size-offset), # not beyond EOF
offset=offset)
index = 0 # start at beginning
while 1:
index = map.find(oldStr,index) # find next match
if index == -1: # no more matches in this map
break
map[index:index+strLen] = newStr
f.close()
This code is not debugged! It works for me on a 3 MB test case, but it may not work on a large ( > 2GB) file - the mmap module still seems a bit immature, so I wouldn't rely on it too much.
Looking at the bigger picture, from what you've posted it isn't clear that your file will end up as valid CSV. Also be aware that the tool you're planning to use to actually process the CSV may be flexible enough to deal with the file as it stands.
If you're delimiting fields with double quotes, it looks like you need to escape the double quotes you have occurring in your elements (for example 1231214 "" will need to be \n1231214 \"\").
Something like
fin = open("input", "r")
fout = open("output", "w")
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line.replace('"',r'\"')
fin.close()
fout.close()
[For the problem exactly as stated] There's no way that this can be done without copying the data, in python or any other language. If your processing always replaced substrings with new substrings of equal length, maybe you could do it in-place. But whenever you replace <><><><><><><> with " you are changing the position of all subsequent characters in the file. Copying from one place to another is the only way to handle this.
EDIT:
Note that the use of sed won't actually save any copying...sed doesn't really edit in-place either. From the GNU sed manual:
-i[SUFFIX]
--in-place[=SUFFIX]
This option specifies that files are to be edited in-place. GNU sed does this by creating a temporary file and sending output to this file rather than to the standard output.
(emphasis mine.)

Categories