Matching every word in a txt file - python

I'm working on a Project Euler problem (for fun).
It comes with a 46kb txt file containing 1 line with a list of over 5000 names in the format like this:
"MARIA","SUSAN","ANGELA","JACK"...
My plan is to write a method to extract every name and append them into a Python list. Is regular expression the best weapon to tackle this problem?
I looked up the Python re doc, but am having hard time figuring out the right regex.

If the format of the file is as you say it is, i.e.
It's a single line
The format is like this: "MARIA","SUSAN","ANGELA","JACK"
Then this should work:
>>> import csv
>>> lines = csv.reader(open('words.txt', 'r'), delimiter=',')
>>> words = lines.next()
>>> words
['MARIA', 'SUSAN', 'ANGELA', 'JACK']

That looks like a format that the csv module would be helpful with. Then you wouldn't have to write any regex.

If you can do it simpler, then do it simpler. No need to use the csv module. I don't think 5000 names or 46KB is enough to worry.
names = []
f = open("names.txt", "r")
# In case there is more than one line...
for line in f.readlines():
names = [x.strip().replace('"', '') for x in line.split(",")]
print names
#should print ['name1', ... , ...]

A regexp will get the job done, but would be inefficient. Using csv would work, but it might not handle 5000 cells in a single line very well. At the very least it has to load the whole file in and maintain the entire list of names in memory (which might not be a problem for you because that's a very small amount of data). If you want an iterator for relatively large files (much larger than 5000 names), a state machine will do the trick:
def parse_chunks(iter, quote='"', delim=',', escape='\\'):
in_quote = False
in_escaped = False
buffer = ''
for chunk in iter:
for byte in chunk:
if in_escaped:
# Done with the escape char, add it to the buffer
buffer += byte
in_escaped = False
elif byte == escape:
# The next charachter will be added literally and not parsed
in_escaped = True
elif in_quote:
if byte == quote:
in_quote = False
else:
buffer += byte
elif byte == quote:
in_quote = True
elif byte in (' ', '\n', '\t', '\r'):
# Ignore whitespace outside of quotes
pass
elif byte == delim:
# Done with this block of text
yield buffer
buffer = ''
else:
buffer += byte
if in_quote:
raise ValueError('Found unbalanced quote char %r' % quote)
elif in_escaped:
raise ValueError('Found unbalanced escape char %r' % escape)
# Yield the last bit in the buffer
yield buffer
data = r"""
"MARIA","SUSAN",
"ANG
ELA","JACK",,TED,"JOE\""
"""
print list(parse_chunks(data))
# ['MARIA', 'SUSAN', 'ANG\nELA', 'JACK', '', 'TED', 'JOE"']
# Use a fixed buffer size if you know the file has only one long line or
# don't care about line parsing
buffer_size = 4096
with open('myfile.txt', 'r', buffer_size) as file:
for name in parse_chunks(file):
print name

Related

Problem reading valid last line of a file [duplicate]

I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
To read both the first and final line of a file you could...
open the file, ...
... read the first line using built-in readline(), ...
... seek (move the cursor) to the end of the file, ...
... step backwards until you encounter EOL (line break) and ...
... read the last line from there.
def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...
0 or os.SEEK_SET = The beginning of the file.
1 or os.SEEK_CUR = The current position.
2 or os.SEEK_END = The end of the file.
* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.
Efficiency
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
Amendment
A more complex, and harder to read, variation to address comments and issues raised since.
Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, noted by comment.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname).
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
Here's a modified version of SilentGhost's answer that will do what you want.
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.
Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.
This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:
def tail(filepath):
"""
#author Marco Sulla (marcosullaroma#gmail.com)
#date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment.
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for loop runs through the lines and x gets the last line on the final iteration.
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of #Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file

How to store first N strings from a txt file in Python?

I'm trying to figure out how to get the first N strings from a txt file, and store them into an array. Right now, I have code that gets every string from a txt file, separated by a space delimiter, and stores it into an array. However, I want to be able to only grab the first N number of strings from it, not every single string. Here is my code (and I'm doing it from a command prompt):
import sys
f = open(sys.argv[1], "r")
contents = f.read().split(' ')
f.close()
I'm sure that the only line I need to fix is:
contents = f.read().split(' ')
I'm just not sure how to limit it here to N number of strings.
If the file is really big, but not too big--that is, big enough that you don't want to read the whole file (especially in text mode or as a list of lines), but not so big that you can't page it into memory (which means under 2GB on a 32-bit OS, but a lot more on 64-bit), you can do this:
import itertools
import mmap
import re
import sys
n = 5
# Notice that we're opening in binary mode. We're going to do a
# bytes-based regex search. This is only valid if (a) the encoding
# is ASCII-compatible, and (b) the spaces are ASCII whitespace, not
# other Unicode whitespace.
with open(sys.argv[1], 'rb') as f:
# map the whole file into memory--this won't actually read
# more than a page or so beyond the last space
m = mmap.mmap(f.fileno(), access=mmap.ACCESS_READ)
# match and decode all space-separated words, but do it lazily...
matches = re.finditer(r'(.*?)\s', m)
bytestrings = (match.group(1) for match in matches)
strings = (b.decode() for b in bytestrings)
# ... so we can stop after 5 of them ...
nstrings = itertools.islice(strings, n)
# ... and turn that into a list of the first 5
contents = list(nstrings)
Obviously you can combine steps together, even cramming the whole thing into a giant one-liner if you want. (An idiomatic version would be somewhere between that extreme and this one.)
If you're fine with reading the whole file (assuming it's not memory prohibitive to do so) you can just do this:
strings_wanted = 5
strings = open('myfile').read().split()[:strings_wanted]
That works like this:
>>> s = 'this is a test string with more than five words.'
>>> s.split()[:5]
['this', 'is', 'a', 'test', 'string']
If you actually want to stop reading exactly as soon as you've reached the nth word, you pretty much have to read a byte at a time. But that's going to be slow, and complicated. Plus, it's still not really going to stop reading after the nth word, unless you're reading in binary mode and decoding manually, and you disable buffering.
As long as the text file has line breaks (as opposed to being one giant 80MB line), and it's acceptable to read a few bytes past the nth word, a very simple solution will still be pretty efficient: just read and split line by line:
import sys
f = open(sys.argv[1], "r")
contents = []
for line in f:
contents += line.split()
if len(contents) >= n:
del contents[n:]
break
f.close()
what about just:
output=input[:3]
output will contain the first three strings in input

How to remove extra space from end of the line before newline in python?

I'm quite new to python. I have a program which reads an input file with different characters and then writes all unique characters from that file into an output file with a single space between each of them. The problem is that after the last character there is one extra space (before the newline). How can I remove it?
My code:
import sys
inputName = sys.argv[1]
outputName = sys.argv[2]
infile = open(inputName,"r",encoding="utf-8")
outfile = open(outputName,"w",encoding="utf-8")
result = []
for line in infile:
for c in line:
if c not in result:
result.append(c)
outfile.write(c.strip())
if(c == ' '):
pass
else:
outfile.write(' ')
outfile.write('\n')
With the line outfile.write(' '), you write a space after each character (unless the character is a space). So you'll have to avoid writing the last space. Now, you can't tell whether any given character is the last one until you're done reading, so it's not like you can just put in an if statement to test that, but there are a few ways to get around that:
Write the space before the character c instead of after it. That way the space you have to skip is the one before the first character, and that you definitely can identify with an if statement and a boolean variable. If you do this, make sure to check that you get the right result if the first or second c is itself a space.
Alternatively, you can avoid writing anything until the very end. Just save up all the characters you see - you already do this in the list result - and write them all in one go. You can use
' '.join(strings)
to join together a list of strings (in this case, your characters) with spaces between them, and this will automatically omit a trailing space.
Why are you adding that if block on the end?
Your program is adding the extra space on the end.
import sys
inputName = sys.argv[1]
outputName = sys.argv[2]
infile = open(inputName,"r",encoding="utf-8")
outfile = open(outputName,"w",encoding="utf-8")
result = []
for line in infile:
charno = 0
for c in line:
if c not in result:
result.append(c)
outfile.write(c.strip())
charno += 1
if (c == ' '):
pass
elif charno => len(line):
pass
else:
outfile.write(' ')
outfile.write('\n')

Python - Parse file into outputs based on magic number/length

I'm a complete beginner to coding - only started 3 weeks ago, and really only have codecademy's Python course under my belt - so simple explanations would be really appreciated!
I'm trying to write a python script that reads a file as a HEX string, and then parses the file into individual output files based on finding a "magic number" within the HEX string.
EG: if my HEX string were "0011AABB00BBAACC00223344", I might want to parse this string into new output files based on the magic number "00", and telling python that each output should be 8 characters long. The output for the example string above should be 3 files that contain the HEX values:
"0011AABB"
"00BBAACC"
"00223344"
Here's what I have so far (assuming in this case that the string above is contained within the "hextests" file
import os
import binascii
filename = "hextests"
# read file as a binary string
with open(filename, 'rb') as f:
content = f.read()
# convert binary string to hex string
hexString = binascii.hexlify(content)
# define magic number as "00"
magic_N = "00"
# attempting to create a new substring called newFile that is equal to each instance magic_N repeats throughout the file for a length of 8 characters
for chars in hexString:
newFile = ""
if chars == magic_N:
newFile += chars.len(9)
# attempting to create a series of new output files for each instance of newFile - while incrementing the output file name
if newFile != "":
i = 0
while os.path.exists("output_file%s.xyz" % i):
i += 1
fh = with open("output_file%s.xyz" % i, "wb"):
newFile
I'm sure I have a lot of errors to work through on this - and it's likely more complicated than I think .... but my main question has to do with the proper way to define the chars and newFile variables. I'm pretty sure python sees chars as only single characters in the string, so it's failing because I'm attempting to search for a magic_N that is longer than 1 character. Am I correct that that is part of the issue?
Also, if you understand the main goal of this script, any other thoughts about things I should be doing differently?
Thanks so much for the help!
You can try something like this:
filename = "hextests"
# read file as a binary string
with open(filename, "rb") as f:
content = f.read()
# You don't need this part if you want
# to parse the hex string as it is given in the file
# convert binary string to hex string
# hexString = binascii.hexlify(content)
# Remove the newline at the end of the string
hexString = content.strip()
# define magic number as "00"
magic_N = "00"
i = 0
j = 0
while i < len(hexString) - 1:
index = hexString.find(magic_N, i)
# This is the part which was incorrect in your code.
with open("output_file_%s.xyz" % j, "wb") as output:
output.write(hexString[i:i+8])
i += 8
j += 1
Note that you need to explicitly call write method to write the data to the output file.
Here it is assumed that the chunks of data are exactly 8 hex symbols long and they always start with 00. So it's not a flexible solution but it gives you an idea on how to tackle the problem.

How to loop over every 2 characters in a file in python

I'm trying to loop over every 2 character in a file, do some tasks on them and write the result characters into another file.
So I tried to open the file and read the first two characters.Then I set the pointer on the 3rd character in the file but it gives me the following error:
'bytes' object has no attribute 'seek'
This is my code:
the_file = open('E:\\test.txt',"rb").read()
result = open('E:\\result.txt',"w+")
n = 0
s = 2
m = len(the_file)
while n < m :
chars = the_file.seek(n)
chars.read(s)
#do something with chars
result.write(chars)
n =+ 1
m =+ 2
I have to mention that inside test.txt is only integers (numbers).
The content of test.txt is a series of binary data (0's and 1's) like this:
01001010101000001000100010001100010110100110001001011100011010000001010001001
Although it's not the point here, but just want to replace every 2 character with something else and write it into result.txt .
Use the file with the seek and not its contents
Use an if statement to break out of the loop as you do not have the length
use n+= not n=+
finally we seek +2 and read 2
Hopefully this will get you close to what you want.
Note: I changed the file names for the example
the_file = open('test.txt',"rb")
result = open('result.txt',"w+")
n = 0
s = 2
while True:
the_file.seek(n)
chars = the_file.read(2)
if not chars:
break
#do something with chars
print chars
result.write(chars)
n +=2
the_file.close()
Note that because, in this case, you are reading the file sequentially, in chunks i.e. read(2) rather than read() the seek is superfluous.
The seek() would only be required if you wished to alter the position pointer within the file, say for example you wanted to start reading at the 100th byte (seek(99))
The above could be written as:
the_file = open('test.txt',"rb")
result = open('result.txt',"w+")
while True:
chars = the_file.read(2)
if not chars:
break
#do something with chars
print chars
result.write(chars)
the_file.close()
You were trying to use .seek() method on a string, because you thought it was a File object, but the .read() method of files turns it into a string.
Here's a general approach I might take to what you were going for:
# open the file and load its contents as a string file_contents
with open('E:\\test.txt', "r") as f:
file_contents = f.read()
# do the stuff you were doing
n = 0
s = 2
m = len(file_contents)
# initialize a result string
result = ""
# iterate over the file_contents, incrementing by 2, adding to results
for i in xrange(0, m, 2):
result += file_contents[i]
# write to results.txt
with open ('E:\\result.txt', 'wb') as f:
f.write(result)
Edit: It seems like there was a change to the question. If you want to change every second character, you'll need to make some adjustments.

Categories