How do I write a simple, Python parsing script?

How do I write a simple, Python parsing script? - python

Most of what I do involves writing simple parsing scripts that reads search terms from one file and searches, line by line, another file. Once the search term is found, the line and sometimes the following line are written to another output file. The code I use is rudimentary and likely crude.
#!/usr/bin/env python
data = open("data.txt", "r")
search_terms = ids.read().splitlines()
data.close()
db = open("db.txt", "r")
output = open("output.txt", "w")
for term in search_terms:
for line in db:
if line.find(term) > -1:
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found %s" % term)
There are a few problems here. First, I don't think it's the most efficient and fastest to search line by line, but I'm not exactly sure about that. Second, I often run into issues with cursor placement and the cursor doesn't reset to the beginning of the file when the search term is found. Third, while I am usually confident that all of the terms can be found in the db, there are rare times when I can't be sure, so I would like to write to another file whenever it iterates through the entire db and can't find the term. I've tried adding a snippet that counts the number of lines of the db so if the find() function gets to the last line and the term isn't found, then it outputs to another "not found" file, but I haven't been able to get my elif and else loops right.
Overall, I'd just like any hints or corrections that could make this sort of script more efficient and robust.
Thanks.

Unless it's a really big file, why not iterate line by line? If the input file's size is some significant portion of your machine's available resources (memory), then you might want to look into buffered input and other, more low-level abstractions of what the computer is doing. But if you're talking about a few hundred MB or less on a relatively modern machine, let the computer do the computing ;)
Off the bat you might want to get into the habit of using the built-in context manager with. For instance, in your snippet, you don't have a call to output.close().
with open('data.txt', 'r') as f_in:
search_terms = f_in.read().splitlines()
Now search_terms is a handle to a list that has each line from data.txt as a string (but with the newline characters removed). And data.txt is closed thanks to with.
In fact, I would do that with the db.txt file, also.
with open('db.txt', 'r') as f_in:
lines = f_in.read().splitlines()
Context managers are cool.
As a side note, you could open your destination file now, and do your parsing and results-tracking with it open the whole time, but I like leaving as many files closed as possible for as long as possible.
I would suggest setting the biggest object on the outside of your loop, which I'm guessing is db.txt contents. The outermost loop only usually only gets iterated once, so might as well put the biggest thing there.
results = []
for i, line in enumerate(lines):
for term in search_terms:
if term in line:
# Use something not likely to appear in your line as a separator
# for these "second lines". I used three pipe characters, but
# you could just as easily use something even more random
results.append('{}|||{}'.format(line, lines[i+1]))
if results:
with open('output.txt', 'w') as f_out:
for result in results:
# Don't forget to replace your custom field separator
f_out.write('> {}\n'.format(result.replace('|||', '\n')))
else:
with open('no_results.txt', 'w') as f_out:
# This will write an empty file to disk
pass
The nice thing about this approach is each line in db.txt is checked once for each search_term in search_terms. However, the downside is that any line will be recorded for each search term it contains, ie., if it has three search terms in it, that line will appear in your output.txt three times.
And all the files are magically closed.
Context managers are cool.
Good luck!

search_terms keeps whole data.txt in memory. That it's not good in general but in this case it's not quite bad.
Looking line-by-line is not sufficient but if the case is simple and files are not too big it's not a big deal. If you want more efficiency you should sort data.txt file and put this to some tree-like structure. It depends on data which is inside.
You have to use seek to move pointer back after using next.
Propably the easiest way here is to generate two lists of lines and search using in like:
`db = open('db.txt').readlines()
db_words = [x.split() for x in db]
data = open('data.txt').readlines()
print('Lines in db {}'.format(len(db)))
for item in db:
for words in db_words:
if item in words:
print("Found {}".format(item))`

Your key issue is that you may be looping in the wrong order -- in your code as posted, you'll always exhaust the db looking for the first term, so after the first pass of the outer for loop db will be at end, no more lines to read, no other term will ever be found.
Other improvements include using the with statement to guarantee file closure, and a set to track which search terms were not found. (There are also typos in your posted code, such as opening a file as data but then reading it as ids).
So, for example, something like:
with open("data.txt", "r") as data:
search_terms = data.read().splitlines()
missing_terms = set(search_terms)
with open("db.txt", "r") as db, open("output.txt", "w") as output:
for line in db:
for term in search_terms:
if term in line:
missing_terms.discard(term)
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found {}".format(term))
break
if missing_terms:
diagnose_not_found(missing_terms)
where the diagnose_not_found function does whatever you need to do to warn the user about missing terms.
There are assumptions embedded here, such as the fact that you don't care if some other search term is present in a line where you've found a previous one, or the very next one; they might take substantial work to fix if not applicable and it will require that you edit your Q with a very complete and unambiguous list of specifications.
If your db is actually small enough to comfortably fit in memory, slurping it all in as a list of lines once and for all would allow easier accommodation for more demanding specs (as in that case you can easily go back and forth, while iterating on a file means you can only go forward one line at a time), so if your specs are indeed more demanding please also clarify if this crucial condition hold, or rather you need this script to process potentially humungous db files (say gigabyte-plus sizes, so as to not "comfortably fit in memory", depending on your platform of course).

Related

Remove double file contents

I have previously written a file in python and I wrote the same contents twice while trying to run the script for the second time.
Here is my file content:
Story1: A short story is a piece of prose fiction that typically can be read in one sitting and focuses on a self-contained incident or series of linked incidents, with the intent of evoking a "single effect" or mood, however there are many exceptions to this. A dictionary definition is "an invented prose narrative shorter than a novel usually dealing with a few characters and aiming at unity of effect and often concentrating on the creation of mood rather than plot. Story1: A short story is a piece of prose fiction that typically can be read in one sitting and focuses on a self-contained incident or series of linked incidents, with the intent of evoking a "single effect" or mood, however there are many exceptions to this. A dictionary definition is "an invented prose narrative shorter than a novel usually dealing with a few characters and aiming at unity of effect and often concentrating on the creation of mood rather than plot.
I am using python Set operator like this, but this won't work for my case:
uniqlines = set(open('file.txt').readlines())
bar = open('file', 'w').writelines(set(uniqlines))
In my case, there are now newline characters so everything is read once. I want to be able to delete the contents after the Story1: is encountered the second time.
How do I accomplish it?

Update: Since you don't have line breaks to split up the file, you're likely better off just slurping the file, splitting appropriately, and writing a new file. Simple solution would be:
import os, tempfile
with open('file.txt') as f,\
tempfile.NamedTemporaryFile('w', dir='.', delete=False) as tf:
# You've got a space only before second copy, so it's a useful partition point
firstcopy, _, _ f.read().partition(' Story1: ')
# Write first copy
tf.write(firstcopy)
# Exiting with block closes temporary file so data is there
# Atomically replace original file with rewritten temporary file
os.replace(tf.name, 'file.txt')
Technically, this isn't completely safe against actual power loss, since data might not be written to disk before the replace metadata update occurs. If you're paranoid, tweak it to explicitly block until the data is synced by adding the following two lines just before dedenting out of the with block (after the write):
tf.flush() # Flushes Python level buffers to OS
os.fsync(tf.fileno()) # Flush OS kernel buffer out to disk, block until done
Old answer for case where copies begin on separate lines:
Find where the second copy begins, and truncate the file:
seen_story1 = False
with open('file.txt', 'r+') as f:
while True:
pos = f.tell() # Record position before next line
line = f.readline()
if not line:
break # Hit EOF
if line.startswith('Story1:'):
if seen_story1:
# Seen it already, we're in duplicate territory
f.seek(pos) # Go back to end of last line
f.truncate() # Truncate file
break # We're done
else:
seen_story1 = True # Seeing it for the first time
Since all you're doing is removing duplicate information from the end of the file, this is safe and effective; truncate should be atomic on most OSes, so the trailing data is freed all at once, with no risk of partial write corruption or the like.

You could use the find method.
# set the word you want to look for
myword = "Story1"
#read the file into a variable called text
with open('file.txt', 'r+') as fin:
text = fin.read()
#find your word for the first time. This method returns the lowest index of the substring if it is found.
# That's why we add the length of the word we are looking for.
index_first_time_found = text.find(myword) + len(myword)
# We search again, but now we start looking from the index of our previous result.
index_second_time_found = text.find(myword, index_first_time_found)
# We cut of everything upto the index of our second index.
new_text = text[:index_second_time_found]
print(new_text)

Entry of a list item in a file

Good afternoon, I have a multiple list of IP and MAC, list of arbitrary length
A = [['10.0.0.1','00:4C:3S:**:**:**', 0], ['10.0.0.2', '00:5C:4S:**:**:**', 0], [....], [....]]
I want to check if this MAC is in the oui file:
E043DB (base 16) Shenzhen
2405f5 (base 16) Integrated
3CD92B (base 16) Hewlett Packard
...
If the MAC from the list is in the file, write the name of the manufacturer as 3 list items. I'm trying to do so and it turns out to check only the first element, the remaining ones are not checked, how can I do this please tell me?
f = open('oui.txt', 'r')
for values in A:
for line in f.readlines():
if values[1][0:8].replace(':','') in line:
values[2]=(line.split('(base 16)')[1].strip())
f.close()
print (A)
And get an answer:
A = [['10.0.0.1','00:4C:3S:**:**:**', 'Firm Name'], ['10.0.0.2', '00:5C:4S:**:**:**', 0], [....], [....]]

The Problem
Consider the "shape" of your code:
f = open('a file')
for values in [ 'some list' ]:
for line in f.readlines():
Your two loops are doing this:
Start with first value in list
Read all lines remaining in file object f
Move to next value in list
Read all lines remaining in file object f
Except that the first time you told it to "read all lines remaining" it would do so.
So, unless you have some way to put more lines into f (which can happen with async files like stdin!) you are going to get one "good" pass through the file, and then every subsequent pass the file object will point to the end of the file, so you'll get nothing.
A Solution
When you are dealing with a file, you want to only process it one time. File I/O is expensive compared to other operations. So you can choose to either (a) read the entire file into memory, and do whatever you want since it's not a file any more; or (b) scan through it only one time.
If you choose to scan through it only once, the easy solution is just to invert the two for loops. Instead of doing this:
for item in list:
for line in file:
Do this instead:
for line in file:
for item in list:
And presto! You are now only reading the file one time.
Other Considerations
If I look at your code, and your examples, it seems like you are trying for an exact match on a particular key. You trim down the MAC addresses in your list to check them against the manufacturer ids.
This suggests to me that you might well have many, many more list values (source MAC addresses) than you have manufacturers. So perhaps you should consider reading the contents of the tile into memory, rather than processing it one line at a time.
Once you have the file in memory, consider building a proper dictionary. You have a key (MAC prefix) and a value (manufacturer). So build something like:
for line in f:
mac = line.split('(base 16)')[0].strip()
mfg = line.split('(base 16)')[1].strip()
mac_to_mfg[mac] = mfg
Then you can make one pass through the source addresses and use the dict's O(1) lookup to your advantage:
for src in A:
prefix = src[1][:8].replace(':', '')
if prefix in mac_to_mfg:
# etc...

The problem is you got the order of the loops reversed. Usually this isn't that big of a problem, but when working objects that are consumed (like the IO file object) the contents will no longer produce once it's been iterated over.
You'll need to iterate the lines first, and then within each lines iterate through A to check the values:
with open('oui.txt', 'r') as f:
for line in f.readlines():
for values in A:
if values[1][0:8].replace(':','') in line:
values[2]=(line.split('(base 16)')[1].strip())
print (A)
Notice I changed your file opening to use the with context manager instead, where once your code exists the with block it will automatically close() the file for you. It is recommended over manually opening the file as you might forget to close() it after.

How do i replace a specific value in a file in python

Im trying to replace the zero's with a value. So far this is my code, but what do i do next?
g = open("January.txt", "r+")
for i in range(3):
dat_month = g.readline()
Month: January
Item: Lawn
Total square metres purchased:
0
monthly value = 0

You could do that -
but that is not the usual approach, and certainly is not the correct approach for text files.
The correct way to do it is to write another file, with the information you want updated in place, and then rename the new file to the old one. That is the only sane way of doing this with text files, since the information size in bytes for the fields is variable.
As for the impresion that you are "writing 200 bytes to the disk" instead of a single byte, changing your value, don't let that fool you: at the Operating system level, all file access has to be done in blocks, which are usually a couple of kilobytes long (in special cases, and tunned filesystems it could be a couple hundred bytes). Anyway, you will never, in a user-space program, much less in a high level language like Python, trigger a diskwrite of less than a few hundred bytes.
Now, for the code:
import os
my_number = <number you want to place in the line you want to rewrite>
with open("January.txt", "r") as in_file, open("newfile.txt", "w") as out_file:
for line in in_file:
if line.strip() == "0":
out_file.write(str(my_number) + "\n")
else:
out_file.write(line)
os.unlink("January.txt")
os.rename("newfile.txt", "January.txt")
So - that is the general idea -
of course you should not write code with all values hardcoded in that way (i.e. the values to be checked and written fixed in the program code, as are the filenames).
As for the with statement - it is a special construct of the language wich is very appropriate to oppening files and manipulating then in a block, like in this case - but it is not needed.
Programing apart, the concept you have to keep in mind is this:
when you use an application that lets you edit a text file, a spreadsheet, an image, you, as user, may have the impression that after you are done and have saved your work, the updates are comitted to the same file. In the vast, vast majority of use cases, that is not what happens: the application uses internally a pattern like the one I presented above - a completly new file is written to disk and the old one is deleted, or renamed. The few exceptions could be simple database applications, which could replace fixed width fields inside the file itself on updates. Modern day databases certainly do not do that, resorting to appending the most recent, updated information, to the end of the file. PDF files are another kind that were not designed to be replaced entirely on each update, when being created: but also in that case, the updated information is written at the end of the file, even if the update is to take place in a page in the beginning of the rendered document.

dat_month = dat_month.replace("0", "45678")
To write to a file you do:
with open("Outfile.txt", "wt") as outfile:
And then
outfile.write(dat_month)

Try this:
import fileinput
import itertools
import sys
with fileinput.input('January.txt', inplace=True) as file:
beginning = tuple(itertools.islice(file, 3))
sys.stdout.writelines(beginning)
sys.stdout.write(next(file).replace('0', 'a value'))
sys.stdout.write(next(file).replace('0', 'a value'))
sys.stdout.writelines(file)

Update strings in a text file at a specific location

I would like to find a better solution to achieve the following three steps:
read strings at a given row
update strings
write the updated strings back
Below are my code which works but I am wondering is there any better (simple) solutions?
new='99999'
f=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP','r+')
lines=f.readlines()
#the row number we want to update is given, so just load the content
x = lines[95]
print(x)
f.close()
#replace
f1=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP')
con = f1.read()
print con
con1 = con.replace(x[2:8],new) #only certain columns in this row needs to be updated
print con1
f1.close()
#write
f2 = open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'w')
f2.write(con1)
f2.close()
Thanks!
UPDATE: get an idea from jtmoulia this time it becomes easier
def replace_line(file_name, line_num, col_s, col_e, text):
lines = open(file_name, 'r').readlines()
temp=lines[line_num]
temp = temp.replace(temp[col_s:col_e],text)
lines[line_num]=temp
out = open(file_name, 'w')
out.writelines(lines)
out.close()

The problem with textual data, even when tabulated, is that the byte offsets are not predictable. For example, when representing numbers with strings you have one byte per digit, whereas when using binary (e.g. two's complement) you always need four or eight bytes either for small and large integers.
Nevertheless, if your text format is strict enough you can get along by replacing bytes without changing the size of the file, you can try using the standard mmap module. With it, you'll be able to treat a file as a mutable byte string and modify parts of it inplace and letting the kernel do the file saving for you.
Otherwise, whatever of the other answers are much better suited for the problem.

Well, to begin with you don't need to keep reopening and reading from the file every time. The r+ mode allows you to read and write to the given file.
Perhaps something like
with open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'r+') as f:
lines = f.readlines()
#... Perform whatever replacement you'd like on lines
f.seek(0)
f.writelines(lines)
Also, Editing specific line in text file in python

When I had to do something similar (for a Webmin customization), I did it entirely in PERL because that's what the Webmin framework used, and I found it quite easy. I assume (but don't know for sure) there are equivalent things in Python. First read the entire file into memory all at once (the PERL way to do this is probably called "slurp"). (This idea of holding the entire file in memory rather than just one line used to make little sense {or even be impossible}. But these days RAM is so large it's the only way to go.) Then use the split operator to divide the file into lines and put each line in a different element of a giant array. You can then use the desired line number as an index into the array (remember array indices usually start with 0). Finally, use "regular expression" processing to change the text of the line. Then change another line, and another, and another (or make another change to the same line). When you're all done, use join to put all the lines in the array back together into one giant string. Then write the whole modified file out.
While I don't have the complete code handy, here's an approximate fragment of some of the PERL code so you can see what I mean:
our #filelines = ();
our $lineno = 43;
our $oldstring = 'foobar';
our $newstring = 'fee fie fo fum';
$filelines[$lineno-1] =~ s/$oldstring/$newstring/ig;
# "ig" modifiers for case-insensitivity and possible multiple occurences in the line
# use different modifiers at the end of the s/// construct as needed

FILENAME = 'C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP'
lines = list(open(FILENAME))
lines[95][2:8] = '99999'
open(FILENAME, 'w').write(''.join(lines))

How can I change a huge file into csv in python

I'm a beginner in python. I have a huge text file (hundreds of GB) and I want to convert the file into csv file. In my text file, I know the row delimiter is a string "<><><><><><><>". If a line contains that string, I want to replace it with ". Is there a way to do it without having to read the old file and rewriting a new file.
Normally I thought I need to do something like this:
fin = open("input", "r")
fout = open("outpout", "w")
line = f.readline
while line != "":
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line)
line = f.readline
but copying hundreds of GB is wasteful. Also I don't know if open will eat lots of memory (does it treat file handler as a stream?)
Any help is greatly appreciated.
Note: an example of the file would be
file.txt
<><><><><><><>
abcdefeghsduai
asdjliwa
1231214 ""
<><><><><><><>
would be one row and one column in csv.

#richard-levasseur
I agree, sed seems like the right way to go. Here's a rough cut at what the OP describes:
sed -i -e's/<><><><><><><>/"/g' foo.txt
This will do the replacement in-place in the existing foo.txt. For that reason, I recommend having the original file under some sort of version control; any of the DVCS should fit the bill.

Yes, open() treats the file as a stream, as does readline(). It'll only read the next line. If you call read(), however, it'll read everything into memory.
Your example code looks ok at first glance. Almost every solution will require you to copy the file elsewhere. Its not exactly easy to modify the contents of a file inplace without a 1:1 replacement.
It may be faster to use some standard unix utilities (awk and sed most likely), but I lack the unix and bash-fu necessary to provide a full solution.

It's only wasteful if you don't have disk to spare. That is, fix it when it's a problem. Your solution looks ok as a first attempt.
It's not wasteful of memory because a file handler is a stream.

Reading lines is simply done using a file iterator:
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
Also consider the CSV writer object to write CSV files, e.g:
import csv
writer = csv.writer(open("some.csv", "wb"))
writer.writerows(someiterable)

With python you will have to create a new file for safety sake, it will cause alot less headaches than trying to write in place.
The below listed reads your input 1 line at a time and buffers the columns (from what I understood of your test input file was 1 row) and then once the end of row delimiter is hit it will write that buffer to disk, flushing manually every 1000 lines of the original file. This will save some IO as well instead of writing every segment, 1000 writes of 32 bytes each will be faster than 4000 writes of 8 bytes.
fin = open(input_fn, "rb")
fout = open(output_fn, "wb")
row_delim = "<><><><><><><>"
write_buffer = []
for i, line in enumerate(fin):
if not i % 1000:
fout.flush()
if row_delim in line and i:
fout.write('"%s"\r\n'%'","'.join(write_buffer))
write_buffer = []
else:
write_buffer.append(line.strip())
Hope that helps.
EDIT: Forgot to mention, while using .readline() is not a bad thing don't use .readlines() which will go and read the entire content of the file into a list containing each line which is incredibly inefficient. Using the built in iterator that comes with a file object is the best memory usage and speed.

#Constatin suggests that if you would be satisfied with replacing '<><><><><><><>\n' by '" \n'
then the replacement string is the same length, and in that case you can craft a solution to in-place editing with mmap. You will need python 2.6. It's vital that the file is opened in the right mode!
import mmap, os
CHUNK = 2**20
oldStr = ''
newStr = '" '
strLen = len(oldStr)
assert strLen==len(newStr)
f = open("myfilename", "r+")
size = os.fstat(f.fileno()).st_size
for offset in range(0,size,CHUNK):
map = mmap.mmap(f.fileno(),
length=min(CHUNK+strLen,size-offset), # not beyond EOF
offset=offset)
index = 0 # start at beginning
while 1:
index = map.find(oldStr,index) # find next match
if index == -1: # no more matches in this map
break
map[index:index+strLen] = newStr
f.close()
This code is not debugged! It works for me on a 3 MB test case, but it may not work on a large ( > 2GB) file - the mmap module still seems a bit immature, so I wouldn't rely on it too much.
Looking at the bigger picture, from what you've posted it isn't clear that your file will end up as valid CSV. Also be aware that the tool you're planning to use to actually process the CSV may be flexible enough to deal with the file as it stands.

If you're delimiting fields with double quotes, it looks like you need to escape the double quotes you have occurring in your elements (for example 1231214 "" will need to be \n1231214 \"\").
Something like
fin = open("input", "r")
fout = open("output", "w")
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line.replace('"',r'\"')
fin.close()
fout.close()

[For the problem exactly as stated] There's no way that this can be done without copying the data, in python or any other language. If your processing always replaced substrings with new substrings of equal length, maybe you could do it in-place. But whenever you replace <><><><><><><> with " you are changing the position of all subsequent characters in the file. Copying from one place to another is the only way to handle this.
EDIT:
Note that the use of sed won't actually save any copying...sed doesn't really edit in-place either. From the GNU sed manual:
-i[SUFFIX]
--in-place[=SUFFIX]
This option specifies that files are to be edited in-place. GNU sed does this by creating a temporary file and sending output to this file rather than to the standard output.
(emphasis mine.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I write a simple, Python parsing script? - python

Related

Remove double file contents

Entry of a list item in a file

How do i replace a specific value in a file in python

Update strings in a text file at a specific location

How can I change a huge file into csv in python

Categories

Resources