How to extend a python script to compute large file?

How to extend a python script to compute large file? - python

I got code to count words in the file. It can work for small files which is less than 500 MB. I have to keep the entire file in the memory before starting counting otherwise there will be errors in the count. This code reads file and store it to the RAM,process it. If I read line by line , there will be errors in the count( readline()).
import collections
import codecs
from collections import Counter
with io.open('Prabhodhanam.txt', 'r', encoding='utf8') as infh:
words =infh.read().split()
with open('file.txt', 'wb') as f:
for word, count in Counter(words).most_common(10000000):
f.write(u'{} {}\n'.format(word, count).encode('utf8'))
When file is big it produces
Memory error
When we use read line() Counter() tend to count words in the line instead of whole file
How to count the words without storing entire file to the memory ?

Can you please check this code.
I dont know whether this helps or not.
def filePro(filename):
f=open(filename,'r')
wordcount=0
for lines in f:
f1=lines.split()
wordcount=wordcount+len(f1)
f.close()
print 'word count:', str(wordcount)
filePro(raw_input("enter file name:"))

You don't have to have the entire file in memory. You can count the words line by line (but of course you mustn't reset the counter after each line, so a list comprehension won't work here).
import collections
counter = collections.Counter()
with open('Prabhodhanam.txt', 'r', encoding='utf8') as infh:
for line in infh:
counter.update(line.strip().split())
with open('file.txt', 'wb') as f:
for word, count in counter.most_common(10000000):
f.write(u'{} {}\n'.format(word, count).encode('utf8'))

Related

Appending output of a for loop for Python to a csv file

I have a folder with .txt files in it. My code will find the line count and character count in each of these files and save the output for each file in a single csv file in a different directory. The csv file is Linecount.csv. For some reason the output to csv file is repeating for character and linecount for the last output, though printing the output is producing correct results. The output of the print statement is correct.
For the csv file it is not.
import glob
import os
import csv
os.chdir('c:/Users/dasa17/Desktop/sample/Upload')
for file in glob.glob("*.txt"):
chars = lines = 0
with open(file,'r')as f:
for line in f:
lines+=1
chars += len(line)
a=file
b=lines
c=chars
print(a,b,c)
d=open('c:/Users/dasa17/Desktop/sample/Output/LineCount.cs‌v', 'w')
writer = csv.writer(d,lineterminator='\n')
for a in os.listdir('c:/Users/dasa17/Desktop/sample/Upload'):
writer.writerow((a,b,c)) d.close()

Please check your indentation.
You are looping through each file using for file in glob.glob("*.txt"):
This stores the last result in a,b, and c. It doesn't appear to write it anywhere.
You then loop through each item using for a in os.listdir('c:/Users/dasa17/Desktop/sample/Upload'):, and store a from this loop (the filename), and the last value of b and c from the initial loop.
I've not run but reordering as follows may solve the issue:
import glob
import os
import csv
os.chdir('c:/Users/dasa17/Desktop/sample/Upload')
d=open('c:/Users/dasa17/Desktop/sample/Output/LineCount.cs‌v', 'w')
writer = csv.writer(d,lineterminator='\n')
for file in glob.glob("*.txt"):
chars = lines = 0
with open(file,'r') as f:
for line in f:
lines+=1
chars += len(line)
a=file
b=lines
c=chars
print(a,b,c)
writer.writerow((a,b,c))
d.close()

Most frequent word in one file which is not found in another file using Python

I am trying to write a program where I count the most frequently used words from one file but those words should not be available in another file. So basically I am reading data from test.txt file and counting the most frequently used word from that file, but that word should not be found in test2.txt file.
Below are sample data files, test.txt and test2.txt
test.txt:
The Project is for testing. doing some testing to find what's going on. the the the.
test2.txt:
a
about
above
across
after
afterwards
again
against
the
Below is my script, which parses files test.txt and test2.txt. It finds the most frequently used words from test.txt, excluding words found in test2.txt.
I thought I was doing everything right, but when I execute my script, it gives "the" as the most frequent word. But actually, the result should be "testing", as "the" is found in test2.txt but "testing" is not found in test2.txt.
from collections import Counter
import re
dgWords = re.findall(r'\w+', open('test.txt').read().lower())
f = open('test2.txt', 'rb')
sWords = [line.strip() for line in f]
print(len(dgWords));
for sWord in sWords:
print (sWord)
print (dgWords)
while sWord in dgWords: dgWords.remove(sWord)
print(len(dgWords));
mostFrequentWord = Counter(dgWords).most_common(1)
print (mostFrequentWord)

Here's how I'd go about it - using sets
all_words = re.findall(r'\w+', open('test.txt').read().lower())
f = open('test2.txt', 'rb')
stop_words = [line.strip() for line in f]
set_all = set(all_words)
set_stop = set(stop_words)
all_only = set_all - set_stop
print Counter(filter(lambda w:w in all_only, all_words)).most_common(1)
This should be slightly faster as well as you do a counter on only 'all_only' words

I simply changed the following line of your original code
f = open('test2.txt', 'rb')
to
f = open('test2.txt', 'r')
and it worked. Simply read your text as string instead of binaries. Otherwise they won't match in regex. Tested on python 3.4 eclipse PyDev Win7 x64.
OFFTOPIC:
It's more pythonic to open files using with statements. In this case, write
with open('test2.txt', 'r') as f:
and indent file processing statements accordingly. That should keep you away from forgetting to close the filestream.

import re
from collections import Counter
with open('test.txt') as testfile, open('test2.txt') as stopfile:
stopwords = set(line.strip() for line in stopfile)
words = Counter(re.findall(r'\w+', open('test.txt').read().lower()))
for word in stopwords:
if word in words:
words.pop(word)
print("the most frequent word is", words.most_common(1))

Two files combination

The first file looks something like this:
writing
writing
writing
writing
eating
eating
eating
doing
doing
doing
...
The second file looks this way:
writing write wrote written
eating eat ate
doing do does done
...
So basically, I need to add words (word by word) from the second file to each line of a first file (sequentially one word per line) and save it in a third file which would look like this:
writing writing
writing write
writing wrote
writing written
eating eating
eating eat
eating ate
doing doing
doing do
doing does
doing done
...
I tried this code but it does not do the job:
infile = open("first.txt", 'r') # open file for reading
infile2 = open("second.txt", 'r') # open file for reading
outfile = open("third.txt","w") # open file for writing
line = infile.readline()
line2 = infile2.readline() # Invokes readline() method on file
while line:
outfile.write(line.strip(' ')+line2.strip("\n")+'\n')
line = infile.readline()
line2 = infile2.readline()
infile.close()
outfile.close()
infile2.close()

Why do you even need the first file?
infile2 = open('second.txt', 'r')
outfile = open('third.txt', 'w')
for line in infile2:
words = line.split()
outfile.write('\n'.join('%s %s' % (words[0], w) for w in words) + '\n')
outfile.close()
infile2.close()

To put your two files together I would read both completely and split them in different ways to get your words and then put them together.
Load the first file. In the first file there are one word per line, so read each line and store it into a list:
words_first = []
with open('first.txt') as f:
for line in f:
words_first.append(line)
Load the second file. The second file has multiple words per line and multiple lines, so read each line and split it into words and store it into a list:
words_second = []
with open('second.txt') as f:
for line in f:
words_second.extend(line.split(" "))
Store into the new file. Now you have two list of words, so use zip to pack them together and store them into the file:
with open('third.txt', 'w') as f:
for first, second in zip(words_first, words_second):
f.write("{0} {1}\n".format(first, second))
This version utilizes split() (which splits all white space (newlines and spaces)), so you can split the complete files and get a list of all words separated by newlines and spaces:
def get_words(file_path):
with open(file_path) as f:
return f.read().split()
with open('third.txt', 'w') as f:
for first, second in zip(get_words("first.txt"), get_words("second.txt")):
f.write("{0} {1}".format(first, second))

strings in file do not match to string in a set

I have a file with a word in each line and a set with words, and I want to put not equal words from set called 'out' to the file. There is part of my code:
def createNextU(self):
print "adding words to final file"
if not os.path.exists(self.finalFile):
open(self.finalFile, 'a').close
fin = open(self.finalFile,"r")
out = set()
for line in self.lines_seen: #lines_seen is a set with words
if line not in fin:
out.add(line)
else:
print line
fin.close()
fout= open(self.finalFile,"a+")
for line in out:
fout.write(line)
but it only match a bit of real equal words. I play with the same dictionary of words and it add repeat words to file each run. What I am doing wrong?? what happening?? I try to use '==' and 'is' comparators and I have the same result.
Edit 1: I am working with huge files(finalFile), which can't be full loaded at RAM, so I think I should read file line by line
Edit 2: Found big problem with pointer:
def createNextU(self):
print "adding words to final file"
if not os.path.exists(self.finalFile):
open(self.finalFile, 'a').close
out = set()
out.clear()
with open(self.finalFile,"r") as fin:
for word in self.lines_seen:
fin.seek(0, 0)'''with this line speed down to 40 lines/second,without it dont work'''
if word in fin:
self.totalmatches = self.totalmatches+1
else:
out.add(word)
self.totalLines=self.totalLines+1
fout= open(self.finalFile,"a+")
for line in out:
fout.write(line)
If I put the lines_seen bucle before opening the file, I open the file for each line in lines_seen, but speed ups to 30k lines/second only. With set() I am having 200k lines/second at worst, so I think I will load the file by parts and compare it using sets. Any better solution?
Edit 3: Done!

fin is a filehandle so you can't compare it with if line not in fin. The content needs to be read first.
with open(self.finalFile, "r") as fh:
fin = fh.read().splitlines() # fin is now a list of words from finalFile
for line in self.lines_seen: #lines_seen is a set with words
if line not in fin:
out.add(line)
else:
print line
# remove fin.close()
EDIT:
Since lines_seen is a set, try to create a new set with the words from finalFile then diff the sets?
file_set = set()
with open(self.finalFile, "r") as fh:
for f_line in fh:
new_set.add(f_line.strip())
# This will give you all the words in finalFile that are not in lines_seen.
print new_set.difference(self.lines_seen)

Your comparison is likely not working because the lines read from the file will have a newline at the end, so you are comparing 'word\n' to 'word'. Using 'rstrip' will help remove the trailing newlines:
>>> foo = 'hello\n'
>>> foo
'hello\n'
>>> foo.rstrip()
'hello'
I would also iterate over the file, rather than iterate over the variable containing the words you would like to check against. If I've understood your code, you would like to write anything that is in self.lines_seen to self.finalFile, if it is not already in it. If you use 'if line not in fin' as you have, this will not work as you're expecting. For example, if your file contains:
lineone
linetwo
linethree
and the set lines_seen, being unordered, returns 'linethree' and then 'linetwo', then the following will match 'linethree' but not 'linetwo' because the file object has already read past it:
with open(self.finalFile,"r" as fin:
for line in self.lines_seen:
if line not in fin:
print line
Instead, consider using a counter:
from collections import Counter
linecount = Counter()
# using 'with' means you don't have to worry about closing it once the block ends
with open(self.finalFile,"r") as fin:
for line in fin:
line = line.rstrip() # remove the right-most whitespace/newline
linecount[line] += 1
for word in self.lines_seen:
if word not in linecount:
out.add(word)

Python how to read last three lines of a .txt file, and put those items into a list?

I am trying to have python read the last three lines of a .txt file. I am also trying to add each line as an element in a list.
So for instance:
**list.txt**
line1
line2
line3
**python_program.py**
(read list.txt, insert items into line_list)
line_list[line1,line2,line3]
However I am a bit confused on this process.
Any help would be greatly appreciated!

What if you are dealing with a very big file? Reading all the lines in memory is going to be quite wasteful. An alternative approach may be:
from collections import deque
d=deque([], maxlen=3)
with open("file.txt") as f:
for l in f:
d.append(l)
This keeps in memory at a given time only the last three rows read (the deque discards the oldest elements in excess at each append).
As #user2357112 points out, this will work as well, and is more synthetic:
from collections import deque
d=None
with open("file.txt") as f:
d=deque(f, maxlen=3)

with open('list.txt') as f:
lines = f.readlines()
line_list = lines[-3:]

Try these:
#!/usr/local/cpython-3.3/bin/python
import pprint
def get_last_3_variant_1(file_):
# This is simple, but it also reads the entire file into memory
lines = file_.readlines()
return lines[-3:]
def get_last_3_variant_2(file_):
# This is more complex, but it only keeps three lines in memory at any given time
three_lines = []
for index, line in zip(range(3), file_):
three_lines.append(line)
for line in file_:
three_lines.append(line)
del three_lines[0]
return three_lines
get_last_3 = get_last_3_variant_2
def main():
# /etc/services is a long file
# /etc/adjtime is exactly 3 lines long on my system
# /etc/libao.conf is exactly 2 lines long on my system
for filename in ['/etc/services', '/etc/adjtime', '/etc/libao.conf']:
with open (filename, 'r') as file_:
result = get_last_3(file_)
pprint.pprint(result)
main()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extend a python script to compute large file? - python

Can you please check this code. I dont know whether this helps or not. def filePro(filename): f=open(filename,'r') wordcount=0 for lines in f: f1=lines.split() wordcount=wordcount+len(f1) f.close() print 'word count:', str(wordcount) filePro(raw_input("enter file name:"))

Related

Appending output of a for loop for Python to a csv file

Most frequent word in one file which is not found in another file using Python

Two files combination

strings in file do not match to string in a set

Python how to read last three lines of a .txt file, and put those items into a list?

Categories

Resources