The following is my code, its giving me Memory Error:
with open('E:\\Book\\1900.txt', 'r', encoding='utf-8') as readFile:
for line in readFile:
sepFile = readFile.read().lower()
words_1900 = re.findall('\w+', sepFile)
output:
Traceback (most recent call last):
File "C:\Python34\50CommonWords.py", line 13, in <module>
sepFile = readFile.read().lower()
MemoryError
I would say instead of reading the entire file into memory, you should read the file line by line , and then use collections.Counter() to incrementally keep track of the words and their count in the entire file. And then at the end use the Counter.most_common() method to get the 50 most common elements. Example -
import collections
import re
cnt = Counter()
with open('E:\\Book\\1900.txt', 'r', encoding='utf-8') as readFile:
for line in readFile:
cnt.update(re.findall('\w+', line.lower()))
print("50 most common are")
print([x for x,countx in cnt.most_common(50)]) # Doing this list comprehension to only take the elements, not the count.
This method may also end up with MemoryError if there are lots of distinct words in the file.
Also, Counter.most_common() returns a list of tuples, where in each tuple the first element of the tuple is the actual word , and the second element is the count of that word.
Related
The following code returns a list, e.g. <class 'list'> in python. Everything I do to access that list fails
indexing list fails,
enumerating list fails
example if I just print(s)
['0.5211', '3.1324']
but if I access the indices
Traceback (most recent call last):
File "parse-epoch.py", line 11, in <module>
print("losses.add({}, {})".format(s[0], s[1]))
IndexError: list index out of range
Why can't I access the elements of the list?
import re
with open('epoch.txt', 'r') as f:
content = f.readlines()
content = [x.strip() for x in content]
for line in content:
s = re.findall("\d+\.\d+", line)
#print(s)
print("losses.add({}, {})".format(s[0], s[1]))
You should check what print(s) outputs again. Your issue is likely with a line where s does not contain a list with 2 values. If those values do not exist, then you cannot use them.
I'm trying to start reading some file from line 3, but I can't.
I've tried to use readlines() + the index number of the line, as seen bellow:
x = 2
f = open('urls.txt', "r+").readlines( )[x]
line = next(f)
print(line)
but I get this result:
Traceback (most recent call last):
File "test.py", line 441, in <module>
line = next(f)
TypeError: 'str' object is not an iterator
I would like to be able to set any line, as a variable, and from there, all the time that I use next() it goes to the next line.
IMPORTANT: as this is a new feature and all my code already uses next(f), the solution needs to be able to work with it.
Try this (uses itertools.islice):
from itertools import islice
f = open('urls.txt', 'r+')
start_at = 3
file_iterator = islice(f, start_at - 1, None)
# to demonstrate
while True:
try:
print(next(file_iterator), end='')
except StopIteration:
print('End of file!')
break
f.close()
urls.txt:
1
2
3
4
5
Output:
3
4
5
End of file!
This solution is better than readlines because it doesn't load the entire file into memory and only loads parts of it when needed. It also doesn't waste time iterating previous lines when islice can do that, making it much faster than #MadPhysicist's answer.
Also, consider using the with syntax to guarantee the file gets closed:
with open('urls.txt', 'r+') as f:
# do whatever
The readlines method returns a list of strings for the lines. So when you take readlines()[2] you're getting the third line, as a string. Calling next on that string then makes no sense, so you get an error.
The easiest way to do this is to slice the list: readlines()[x:] gives a list of everything from line x onwards. Then you can use that list however you like.
If you have your heart set on an iterator, you can turn a list (or pretty much anything) into an iterator with the iter builtin function. Then you can next it to your heart's content.
The following code will allow you to use an iterator to print the first line:
In [1]: path = '<path to text file>'
In [2]: f = open(path, "r+")
In [3]: line = next(f)
In [4]: print(line)
This code will allow you to print the lines starting from the xth line:
In [1]: path = '<path to text file>'
In [2]: x = 2
In [3]: f = iter(open(path, "r+").readlines()[x:])
In [4]: f = iter(f)
In [5]: line = next(f)
In [6]: print(line)
Edit: Edited the solution based on #Tomothy32's observation.
The line you printed returns a string:
open('urls.txt', "r+").readlines()[x]
open returns a file object. Its readlines method returns a list of strings. Indexing with [x] returns the third line in the file as a single string.
The first problem is that you open the file without closing it. The second is that your index doesn't specify a range of lines until the end. Here's an incremental improvement:
with open('urls.txt', 'r+') as f:
lines = f.readlines()[x:]
Now lines is a list of all the lines you want. But you first read the whole file into memory, then discarded the first two lines. Also, a list is an iterable, not an iterator, so to use next on it effectively, you'd need to take an extra step:
lines = iter(lines)
If you want to harness the fact that the file is already a rather efficient iterator, apply next to it as many times as you need to discard unwanted lines:
with open('urls.txt', 'r+') as f:
for _ in range(x):
next(f)
# now use the file
print(next(f))
After the for loop, any read operation you do on the file will start from the third line, whether it be next(f), f.readline(), etc.
There are a few other ways to strip the first lines. In all cases, including the example above, next(f) can be replaced with f.readline():
for n, _ in enumerate(f):
if n == x:
break
or
for _ in zip(f, range(x)): pass
After you run either of these loops, next(f) will return the xth line.
Just call next(f) as many times as you need to. (There's no need to overcomplicate this with itertools, nor to slurp the entire file with readlines.)
lines_to_skip = 3
with open('urls.txt') as f:
for _ in range(lines_to_skip):
next(f)
for line in f:
print(line.strip())
Output:
% cat urls.txt
url1
url2
url3
url4
url5
% python3 test.py
url4
url5
I'm trying to store contents of a file into a dictionary and I want to return a value when I call its key. Each line of the file has two items (acronyms and corresponding phrases) that are separated by commas, and there are 585 lines. I want to store the acronyms on the left of the comma to the key, and the phrases on the right of the comma to the value. Here's what I have:
def read_file(filename):
infile = open(filename, 'r')
for line in infile:
line = line.strip() #remove newline character at end of each line
phrase = line.split(',')
newDict = {'phrase[0]':'phrase[1]'}
infile.close()
And here's what I get when I try to look up the values:
>>> read_file('acronyms.csv')
>>> acronyms=read_file('acronyms.csv')
>>> acronyms['ABT']
Traceback (most recent call last):
File "<pyshell#65>", line 1, in <module>
acronyms['ABT']
TypeError: 'NoneType' object is not subscriptable
>>>
If I add return newDict to the end of the body of the function, it obviously just returns {'phrase[0]':'phrase[1]'} when I call read_file('acronyms.csv'). I've also tried {phrase[0]:phrase[1]} (no single quotation marks) but that returns the same error. Thanks for any help.
def read_acronym_meanings(path:str):
with open(path) as f:
acronyms = dict(l.strip().split(',') for l in f)
return acronyms
First off, you are creating a new dictionary at every iteration of the loop. Instead, create one dictionary and add elements every time you go over a line. Second, the 'phrase[0]' includes the apostrophes which turn make it a string instead of a reference to the phrase variable that you just created.
Also, try using the with keyword so that you don't have to explicitly close the file later.
def read(filename):
newDict = {}
with open(filename, 'r') as infile:
for line in infile:
line = line.strip() #remove newline character at end of each line
phrase = line.split(',')
newDict[phrase[0]] = phrase[1]}
return newDict
def read_file(filename):
infile = open(filename, 'r')
newDict = {}
for line in infile:
line = line.strip() #remove newline character at end of each line
phrase = line.split(',', 1) # split max of one time
newDict.update( {phrase[0]:phrase[1]})
infile.close()
return newDict
Your original creates a new dictionary every iteration of the loop.
My code:
fd = open('C:\Python27\\alu.txt', 'r')
D = dict(line.split("\n") for line in fd)
It shows the following error
traceback (most recent call last):
File "C:\Users\ram\Desktop\rest_enz3.py", line 8, in <module>
D = dict(line.split("\n") for line in fd)
ValueError: dictionary update sequence element #69 has length 1; 2 is required
The only newline you'll ever find in line will be the one at the very end, so line.split("\n") will return a list of length 1. Perhaps you meant to use a different delimiter. If your file looks like...
lorem:ipsum
dolor:sit
Then you should do
D=dict(line.strip().split(":") for line in fd)
As Kevin above points out, line.split("\n") is a bit odd, but maybe the file is just a list of dictionary keys?
Regardless, the error you get implies that the line.split("\n") returns just a single element (in other words, the line is missing the trailing newline). For example:
"Key1\n".split("\n") returns ["Key1", ""]
while
"Key1".split("\n") returns ["Key1"]
dict([["key1", ""],["key2", ""]])
is fine, while
dict([["key1, ""],["key2"]])
returns the error you quote
It may be as simple as editing the file in question and adding a new line at the end of the file.
File example: alu.txt
GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGA
TCACGAGGTCAGGAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAA
AAATACAAAAATTAGCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGC
TGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCC
Reading file
with open('C:\\Python27\\alu.txt', 'r') as fp:
dna = {'Key %i' % i: j.strip() for i, j in enumerate(fp.readlines())}
for key, value in dna.iteritems():
print key, ':', value
Key 1 : GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGA
Key 2 : TCACGAGGTCAGGAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAA
Key 3 : AAATACAAAAATTAGCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGC
If you're using Python 3 change the iteration flow to this way
for key, value in dna.items():
print(key, ':', value)
I got code to count words in the file. It can work for small files which is less than 500 MB. I have to keep the entire file in the memory before starting counting otherwise there will be errors in the count. This code reads file and store it to the RAM,process it. If I read line by line , there will be errors in the count( readline()).
import collections
import codecs
from collections import Counter
with io.open('Prabhodhanam.txt', 'r', encoding='utf8') as infh:
words =infh.read().split()
with open('file.txt', 'wb') as f:
for word, count in Counter(words).most_common(10000000):
f.write(u'{} {}\n'.format(word, count).encode('utf8'))
When file is big it produces
Memory error
When we use read line() Counter() tend to count words in the line instead of whole file
How to count the words without storing entire file to the memory ?
Can you please check this code.
I dont know whether this helps or not.
def filePro(filename):
f=open(filename,'r')
wordcount=0
for lines in f:
f1=lines.split()
wordcount=wordcount+len(f1)
f.close()
print 'word count:', str(wordcount)
filePro(raw_input("enter file name:"))
You don't have to have the entire file in memory. You can count the words line by line (but of course you mustn't reset the counter after each line, so a list comprehension won't work here).
import collections
counter = collections.Counter()
with open('Prabhodhanam.txt', 'r', encoding='utf8') as infh:
for line in infh:
counter.update(line.strip().split())
with open('file.txt', 'wb') as f:
for word, count in counter.most_common(10000000):
f.write(u'{} {}\n'.format(word, count).encode('utf8'))