MemoryError Python, in file 99999999 string - python

Windows 10 pro 64bit, python installed 64bit version
The file weighs 1,80 gb
How to fix thiss error, and print all string
def count():
reg = open('link_genrator.txt', 'r')
s = reg.readline().split()
print(s)

reg.read().split('\n') will give a list of all lines.

Why don't you just do s = reg.read(65536).splitlines()? This will give you a hint on the structure of the content and you can then play with the size you read in a chunk.
Once you know a bit more, you can try to loop that line an sum up the number of lines

After looking at the answers and trying to understand what the initial question could be I come to more complete answer than my previous one.
Looking at the question and the code in the sample function I assume now following:
is seems he want to separate the contents of a file into words and print them
from the function name I suppose he would like to count all these words
the whole file is quite big and thus Python stops with a memory error
Handling such large files obviously asks for a different treatment than the usual ones. For example, I do not see any use in printing all the separated words of such a file on the console. Of course it might make sense to count these words or search for patterns in it.
To show as an example how one might treat such big files I wrote following example. It is meant as a starting point for further refinements and changes according your own requirements.
MAXSTR = 65536
MAXLP = 999999999
WORDSEP = ';'
lineCnt = 0
wordCnt = 0
lpCnt = 0
fn = 'link_genrator.txt'
fin = open(fn, 'r')
try:
while lpCnt < MAXLP:
pos = fin.tell()
s = fin.read(MAXSTR)
lines = s.splitlines(True)
if len(lines) == 0:
break
# count words of line
k= 0
for l in lines:
lineWords = l.split(WORDSEP)# semi-colon separates each word
k += len(lineWords) # sum up words of each line
wordCnt += k - 1 # last word most probably not complete: subtract one
# count lines
lineCnt += len(lines)-1
# correction when line ends with \n
if lines[len(lines)-1][-1] == '\n':
lineCnt += 1
wordCnt += 1
lpCnt += 1
print('{0} {4} - {5} act Pos: {1}, act lines: {2}, act words: {3}'.format(lpCnt, pos, lineCnt, wordCnt, lines[0][0:10], lines[len(lines)-1][-10:]))
finally:
fin.close()
lineCnt += 1
print('Total line count: {}'.format(lineCnt))
That code works for files up to 2GB (tested with 2.1GB). The two constants at the beginning let you play with the size of the read in chunks and limit the amount of text processed. During testing you can then just process a subset of the whole data which goes much faster.

Related

How to extract specific line from large number (4.5 M) of files and debug properly?

I have a question regarding data manipulation and extraction.
I have a large amount of files (about 4.5 million files) from which I want to extract the third row (line) from each file and save it to a new file. However, there seems to be a small discrepancy of about 5 lines that are missing with the number of files and the number of lines extracted.
I have tried debugging to see where the error occurs. For debugging purposes I can think of two possible problems:
(1) I am counting the number of lines incorrectly (I have tried two algorithms for row count and they seem to match)
(2) It reads an empty string which I have also tried to debug in the code. What other possibilities are there that I could look to debug?
Algorithm for calculating file length 1
def file_len(filename):
with open(filename) as f:
for i, l in enumerate(f):
pass
return i + 1
Algorithm for calculating file length 2
def file_len2(filename):
i = sum(1 for line in open(filename))
return i
Algorithm for extracting line no. 3
def extract_line(filename):
f = open(filename, 'r')
for i, line in enumerate(f):
if i == 2: # Line number 3
a = line
if not a.strip():
print(Error!)
f.close()
return a
There were no error messages.
I expect the number of input files to match the number of lines in the output file, but there is a small discrepancy of about 5 lines out of 4.5 million lines between the two.
Suggestion: If a is set globally, checking to see if there are less than three lines will fail.
(I would put this in comments but I don’t have enough rep)
Your general idea is correct, but things can be made a bit simpler.
I also suppose that the discrepancy is due to files with empty third line, or with fewer than 3 lines..
def extract_line(filename):
with open(filename) as f:
for line_no, line_text in enumerate(f):
if line_no == 2:
return line_text.strip() # Stop searching, we found the third line
# Here file f is closed because the `with` statement's scope ended.
# None is implicitly returned here.
def process_files(source_of_filenames):
processed = 0 # Count the files where we found the third line.
for filename in source_of_filenames:
third_line = extract_line(filename)
if third_line:
processed += 1 # Account for the success.
# Write the third line; given as an illustration.
with open(filename + ".3rd-line", "w") as f:
f.write(third_line)
else:
print("File %s has a problem with third line" % filename);
return processed
def main(): # I don't know the source of your file names.
filenames = # Produce a list or a generator here.
processed = process_files(filenames)
print("Processed %d files successfully", processed)
Hope this helps.

Slice variable from specified letter to specified letter in line that varies in length

New to the site so I apologize if I format this incorrectly.
So I'm searching a file for lines containing
Server[x] ip.ip.ip.ip response=235ms accepted....
where x can be any number greater than or equal to 0, then storing that information in a variable named line.
I'm then printing this content to a tkinter GUI and its way too much information for the window.
To resolve this I thought I would slice the information down with a return line[15:30] in the function but the info that I want off these lines does not always fall between 15 and 30.
To resolve this I tried to make a loop with
return line[cnt1:cnt2]
checked cnt1 and cnt2 in a loop until cnt1 meets "S" and cnt2 meets "a" from accepted.
The problem is that I'm new to Python and I cant get the loop to work.
def serverlist(count):
try:
with open("file.txt", "r") as f:
searchlines = f.readlines()
if 'f' in locals():
for i, line in enumerate(reversed(searchlines)):
cnt = 90
if "Server["+str(count)+"]" in line:
if line[cnt] == "t":
cnt += 1
return line[29:cnt]
except WindowsError as fileerror:
print(fileerror)
I did a reversed on the line reading because the lines I am looking for repeats over and over every couple of minutes in the text file.
Originally I wanted to scan from the bottom and stop when it got to server[0] but this loop wasn't working for me either.
I gave up and started just running serverlist(count) and specifying the server number I was looking for instead of just running serverlist().
Hopefully when I understand the problem with my original loop I can fix this.
End goal here:
file.txt has multiple lines with
<timestamp/date> Server[x] ip.ip.ip.ip response=<time> accepted <unneeded garbage>
I want to cut just the Server[x] and the response time out of that line and show it somewhere else using a variable.
The line can range from Server[0] to Server[999] and the same response times are checked every few minutes so I need to avoid duplicates and only get the latest entries at the bottom of the log.
Im sorry this is lengthy and confusing.
EDIT:
Here is what I keep thinking should work but it doesn't:
def serverlist():
ips = []
cnt = 0
with open("file.txt", "r") as f:
for line in reversed(f.readlines()):
while cnt >= 0:
if "Server["+str(cnt)+"]" in line:
ips.append(line.split()) # split on spaces
cnt += 1
return ips
My test log file has server[4] through server[0]. I would think that the above would read from the bottom of the file, print server[4] line, then server[3] line, etc and stop when it hits 0. In theory this would keep it from reading every line in the file(runs faster) and it would give me only the latest data. BUT when I run this with while cnt >=0 it gets stuck in a loop and runs forever. If I run it with any other value like 1 or 2 then it returns a blank list []. I assume I am misunderstanding how this would work.
Here is my first approach:
def serverlist(count):
with open("file.txt", "r") as f:
for line in f.readlines():
if "Server[" + str(count) + "]" in line:
return line.split()[1] # split on spaces
return False
print serverlist(30)
# ip.ip.ip.ip
print serverlist(";-)")
# False
You can change the index in line.split()[1] to get the specific space separated string of the line.
Edit: Sure, just remove the if condition to get all ip's:
def serverlist():
ips = []
with open("file.txt", "r") as f:
for line in f.readlines():
if line.strip().startswith("Server["):
ips.append(line.split()[1]) # split on spaces
return ips

Counting the lines of a long file not working correctly in Python

I've been trying to count the lines of a very long file (more than 635000 lines).
I've tried with:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
and also:
num_lines = sum(1 for line in open(fname))
Both work perfectly for files with not so much lines. I've checked making a 5 lines file and works ok, output is 5.
But with with a long file, which exactly has 635474 lines, the output of both methods posted above is 635466.
I know that the file has 635474 lines, not 635466 lines because I'm creating strings inside of the file and the last two lines are:
alm_asdf_alarm635473=.NOT USED
alm_asdf_alarm635474=.NOT USED
And also because if I open the file with Notepad++ the last line is counted as 635474.
What's the logic behind this? Why is it counting less lines that the real ones?
Thanks in advance.
If all your lines have same structure, you could try a program like that :
import re
num = re.compile('[^0-9]*([0-9]+)')
delta = 1 # initial delta
with open(...) as fd:
for i, line in enumerate(fd, delta):
m = num.match(line)
if i != int(m.group(1)):
print i, "th line for number ", int(m.group(1))
break
It should be enough to find first line where you have a difference (delta is here for the case where first line would be internally numbered 1 and not 0). Then you could more easily understand where the problem really comes from with notepad++.
Note : if only some lines have this structure, you could use that variation :
m = num.match(line)
if (m is not None) and (i != int(m.group(1))):

making lists from data in a file in python

I'm really new at python and needed help in making a list from data in a file. The list contains numbers on separate lines (by use of "\n" and this is something I don't want to change to CSV). The amount of numbers saved can be changed at any time because the way the data is saved to the file is as follows:
Program 1:
# creates a new file for writing
numbersFile = open('numbers.txt', 'w')
# determines how many times the loop will iterate
totalNumbers = input("How many numbers would you like to save in the file? ")
# loop to get numbers
count = 0
while count < totalNumbers:
number = input("Enter a number: ")
# writes number to file
numbersFile.write(str(number) + "\n")
count = count + 1
This is the second program that uses that data. This is the part that is messy and that I'm unsure of:
Program 2:
maxNumbers = input("How many numbers are in the file? ")
numFile = open('numbers.txt', 'r')
total = 0
count = 0
while count < maxNumbers:
total = total + numbers[count]
count = count + 1
I want to use the data gathered from program 1 to get a total in program 2. I wanted to put it in a list because the amount of numbers can vary. This is for an introduction to computer programming class, so I need a SIMPLE fix. Thank you to all who help.
Your first program is fine, although you should use raw_input() instead of input() (which also makes it unnecessary to call str() on the result).
Your second program has a small problem: You're not actually reading anything from the file. Fortunately, that's easy in Python. You can iterate over the lines in a file using
for line in numFile:
# line now contains the current line, including a trailing \n, if present
so you don't need to ask for the total of numbers in your file at all.
If you want to add the numbers, don't forget to convert the string line to an int first:
total += int(line) # shorthand for total = total + int(line)
There remains one problem (thanks #tobias_k!): The last line of the file will be empty, and int("") raises an error, so you could check that first:
for line in numFile:
if line:
total += int(line)

Two simple questions about python

I have 2 simple questions about python:
1.How to get number of lines of a file in python?
2.How to locate the position in a file object to the
last line easily?
lines are just data delimited by the newline char '\n'.
1) Since lines are variable length, you have to read the entire file to know where the newline chars are, so you can count how many lines:
count = 0
for line in open('myfile'):
count += 1
print count, line # it will be the last line
2) reading a chunk from the end of the file is the fastest method to find the last newline char.
def seek_newline_backwards(file_obj, eol_char='\n', buffer_size=200):
if not file_obj.tell(): return # already in beginning of file
# All lines end with \n, including the last one, so assuming we are just
# after one end of line char
file_obj.seek(-1, os.SEEK_CUR)
while file_obj.tell():
ammount = min(buffer_size, file_obj.tell())
file_obj.seek(-ammount, os.SEEK_CUR)
data = file_obj.read(ammount)
eol_pos = data.rfind(eol_char)
if eol_pos != -1:
file_obj.seek(eol_pos - len(data) + 1, os.SEEK_CUR)
break
file_obj.seek(-len(data), os.SEEK_CUR)
You can use that like this:
f = open('some_file.txt')
f.seek(0, os.SEEK_END)
seek_newline_backwards(f)
print f.tell(), repr(f.readline())
Let's not forget
f = open("myfile.txt")
lines = f.readlines()
numlines = len(lines)
lastline = lines[-1]
NOTE: this reads the whole file in memory as a list. Keep that in mind in the case that the file is very large.
The easiest way is simply to read the file into memory. eg:
f = open('filename.txt')
lines = f.readlines()
num_lines = len(lines)
last_line = lines[-1]
However for big files, this may use up a lot of memory, as the whole file is loaded into RAM. An alternative is to iterate through the file line by line. eg:
f = open('filename.txt')
num_lines = sum(1 for line in f)
This is more efficient, since it won't load the entire file into memory, but only look at a line at a time. If you want the last line as well, you can keep track of the lines as you iterate and get both answers by:
f = open('filename.txt')
count=0
last_line = None
for line in f:
num_lines += 1
last_line = line
print "There were %d lines. The last was: %s" % (num_lines, last_line)
One final possible improvement if you need only the last line, is to start at the end of the file, and seek backwards until you find a newline character. Here's a question which has some code doing this. If you need both the linecount as well though, theres no alternative except to iterate through all lines in the file however.
For small files that fit memory,
how about using str.count() for getting the number of lines of a file:
line_count = open("myfile.txt").read().count('\n')
I'd like too add to the other solutions that some of them (those who look for \n) will not work with files with OS 9-style line endings (\r only), and that they may contain an extra blank line at the end because lots of text editors append it for some curious reasons, so you might or might not want to add a check for it.
The only way to count lines [that I know of] is to read all lines, like this:
count = 0
for line in open("file.txt"): count = count + 1
After the loop, count will have the number of lines read.
For the first question there're already a few good ones, I'll suggest #Brian's one as the best (most pythonic, line ending character proof and memory efficient):
f = open('filename.txt')
num_lines = sum(1 for line in f)
For the second one, I like #nosklo's one, but modified to be more general should be:
import os
f = open('myfile')
to = f.seek(0, os.SEEK_END)
found = -1
while found == -1 and to > 0:
fro = max(0, to-1024)
f.seek(fro)
chunk = f.read(to-fro)
found = chunk.rfind("\n")
to -= 1024
if found != -1:
found += fro
It seachs in chunks of 1Kb from the end of the file, until it finds a newline character or the file ends. At the end of the code, found is the index of the last newline character.
Answer to the first question (beware of poor performance on large files when using this method):
f = open("myfile.txt").readlines()
print len(f) - 1
Answer to the second question:
f = open("myfile.txt").read()
print f.rfind("\n")
P.S. Yes I do understand that this only suits for small files and simple programs. I think I will not delete this answer however useless for real use-cases it may seem.
Answer1:
x = open("file.txt")
opens the file or we have x associated with file.txt
y = x.readlines()
returns all lines in list
length = len(y)
returns length of list to Length
Or in one line
length = len(open("file.txt").readlines())
Answer2 :
last = y[-1]
returns the last element of list
Approach:
Open the file in read-mode and assign a file object named “file”.
Assign 0 to the counter variable.
Read the content of the file using the read function and assign it to a
variable named “Content”.
Create a list of the content where the elements are split wherever they encounter an “\n”.
Traverse the list using a for loop and iterate the counter variable respectively.
Further the value now present in the variable Counter is displayed
which is the required action in this program.
Python program to count the number of lines in a text file
# Opening a file
file = open("filename","file mode")#file mode like r,w,a...
Counter = 0
# Reading from file
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
The above code will print the number of lines present in a file. Replace filename with the file with extension and file mode with read - 'r'.

Categories