getting context for a word - python

I am dealing with an extremely large text file (around 3.77 GB), and trying to extract all the sentences a specific word occurs in and write out to a text file.
So the large text file is just many lines of text:
line 1 text ....
line 2 text ....
I have also extracted the unique word list from the text file, and want to extract all the sentences each word occurs in and write out the context associated with the word. Ideally, the output file will take the format of
word1 \t sentence 1\n sentence 2\n sentence N\n
word2 \t sentence 1\n sentence 2\n sentence M\n
The current code I have is something like this :
fout=open('word_context_3000_4000(4).txt','a')
for x in unique_word[3000:4000]:
fout.write('\n'+x+'\t')
fin=open('corpus2.txt')
for line in fin:
if x in line.strip().split():
fout.write(line)
else:
pass
fout.close()
Since the unique word list is big, so I process the word list chunk by chunk. But, somehow, the code failed to get the context for all the words, and only returned the context for the first hundreds of words in the unique word list.
Does any one have worked on the similar problem before? I am using python, btw.
Thanks a lot.

First problem, you never close fin.
Maybe you should try something like this :
fout=open('word_context_3000_4000(4).txt','a')
fin=open('corpus2.txt')
for x in unique_word[3000:4000]:
fout.write('\n'+x+'\t')
fin.seek(0) # go to the begining of the file
for line in fin:
if x in line.strip().split():
fout.write(line)
else:
pass
fout.close()
fin.close()

Related

How to get rid of the last whitespace when printing with end=" "?

Task:
Create a solution that accepts an input identifying the name of a text file, for example, "WordTextFile1.txt". Each text file contains three rows with one word per row. Using the open() function and write() and read() methods, interact with the input text file to write a new sentence string composed of the three existing words to the end of the file contents on a new line. Output the new file contents.
The solution output should be in the format
cat
chases
dog
cat chases dog
the "WordTextFile1.txt" has only 3 words each in a different row
cat
chases
dog
This is what I have which works however the last line with the sentence has an extra whitespace which is breaking my program. What can I do to get rid of the whitespace and fix my code? help!
file = input()
with open(file, "r+") as f:
list_words = f.readlines()
for word in list_words:
print(word.strip())
for word in list_words:
print(word.strip(), end = " ")
this is current output:
student
reads
book
student reads book(extra whitespace)
You are properly removing the last white space by word.strip() but adding end = " " just adds the last whitespace again. Change it to:
file = input()
with open(file, "r+") as f:
list_words = f.readlines()
# I don't see any reason having this for loop
# for word in list_words:
# print(word.strip())
print(' '.join(word.strip() for word in list_words) # this should work
Edit: Removed the list as it was not required. Thanks to #PranavHosangadi

Split a .txt at each period instead of by line?

I am attempting to split a .txt file by sentence into a list, but my coding efforts can only split by line.
Example of .txt contents:
This is line 1 of txt file,
it is now on line 2. Here is the
second sentence between line 2 and 3.
Code
listed = []
with open("example.txt","r") as text:
Line = text.readline()
while Line!="":
Line1 = Line.split(".")
for sentence in Line1:
listed.append(sentence)
Line = text.readline()
print(listed)
This would print something like: ['This is line 1 of txt file,\n','it is now on line 2\n', 'Here is the\n','second sentence between line 2 and 3/n']
If the entire document was on one line, this would work correctly, except for cases like "Mr." and "Mrs." and such. However, that's a future worry. Does anyone out there know how to use split in the above scenario?
Assuming all sentence ends with a dot .
You may just :
read the whole file : fic.read()
remove return char replace('\n', '')
split on dot
apply strip on each sentence to remove spaces padding and leading spaces
keep the sentences
with open("data.txt", "r") as fic:
content = fic.read().replace('\n', '')
sentences = list(map(str.strip, content.split(".")))
A version more detailled
with open("data.txt", "r") as fic:
content = fic.read()
content = content.replace('\n', '')
sentences = content.split(".")
sentences = list(map(str.strip, sentences))
# same as
sentences = [s.strip() for s in sentences]
split on a string will split on whatever you ask it to, without regard to line breaks, just do read to pull the whole file instead of readlines. the issue becomes whether that's too much text to handle in a single read, if so you'll need to be more clever. you'll probably want to filter out actual line breaks to get the effect of one-string-per-sentence.

How to compare contents of two large text files in Python?

Datasets: Two Large text files for train and test that all words of them are tokenized. a part of data is like the following: " the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place . "
Question: How can I replace every word in the test data not seen in training with the word "unk" in Python?
So far, I made the dictionary by the following codes to count the frequency of each word in the file:
#open text file and assign it to varible with the name "readfile"
readfile= open('C:/Users/amtol/Desktop/NLP/Homework_1/brown-train.txt','r')
writefile=open('C:/Users/amtol/Desktop/NLP/Homework_1/brown-trainReplaced.txt','w')
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in readfile:
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
#replace all words occurring in the training data once with the token<unk>.
for key in list(d.keys()):
line= d[key]
if (line==1):
line="<unk>"
writefile.write(str(d))
else:
writefile.write(str(d))
#close the file that we have created and we wrote the new data in that
writefile.close()
Honestly the above code doesn't work with writefile.write(str(d)) which I want to write the result in the new textfile, but by print(key, ":", line) it works and shows the frequency of each word but in the console which doesn't create the new file. if you also know the reason for this, please let me know.
First off, your task is to replace the words in test file that are not seen in train file. Your code never mentions the test file. You have to
Read the train file, gather what words are there. This is mostly okay; but you need to .strip() your line or the last word in each line will end with a newline. Also, it would make more sense to use set instead of dict if you don't need to know the count (and you don't, you just want to know if it's there or not). Sets are cool because you don't have to care if an element is in already or not; you just toss it in. If you absolutely need to know the count, using collections.Counter is easier than doing it yourself.
Read the test file, and write to replacement file, as you are replacing the words in each line. Something like:
with open("test", "rt") as reader:
with open("replacement", "wt") as writer:
for line in reader:
writer.write(replaced_line(line.strip()) + "\n")
Make sense, which your last block does not :P Instead of seeing whether a word from test file is seen or not, and replacing the unseen ones, you are iterating on the words you have seen in the train file, and writing <unk> if you've seen them exactly once. This does something, but not anything close to what it should.
Instead, split the line you got from the test file and iterate on its words; if the word is in the seen set (word in seen, literally) then replace its contents; and finally add it to the output sentence. You can do it in a loop, but here's a comprehension that does it:
new_line = ' '.join(word if word in seen else '<unk>'
for word in line.split(' '))

Print output to text file (.txt) using Python

I want print my output to text file. But the results different if I print in terminal. My code :
...
words = keywords.split("makan","Rina")
sentences = text.split(".")
for itemIndex in range(len(sentences)):
for word in words:
if word in sentences[itemIndex]:
print('"' + sentences[itemIndex] + '."')
break
The ouput like this :
"Semalam saya makan nasi padang."
" Saya makan bersama Rina."
" Rina pesan ayam goreng."
If I add print to text file :
words = ["makan","Rina"]
sentences = text.split(".")
for itemIndex in range(len(sentences)):
for word in words:
if word in sentences[itemIndex]:
with open("corpus.txt",'w+') as f:
f.write(sentences[itemIndex])
f.close()
The output just :
Rina pesan ayam goreng
Why? How to print outputs to text file same like I print outputs in terminal?
You are reopening the file for each iteration of the loop so when you write to it you overwrite what is already there. You need to open the file outside of all the loops and open it in append mode, denoted by a.
When you finish you will end up with only the last line in the file. Remember to close the file using f.close() when you are done with it.
You have to reorder the lines of your code, by moving opening/closing the file outside of the loop:
with open("corpus.txt",'w+') as f:
words = ["makan","Rina"]
sentences = text.split(".")
for itemIndex in range(len(sentences)):
for word in words:
if word in sentences[itemIndex]:
f.write(sentences[itemIndex])
Also, print usually added a newline character after the output, if you want your sentences to be written on the different lines in the file, you may want to add f.write('\n') after every sentence.
Because you are listing with open inside of the loop, and you're using 'w+' mode, your program is going to overwrite the file each time, so you will only end up with the last line written to the file. Try it with 'a' instead, or move with open outside of the loop.
You don't need to call close on a file handle that you have opened using the with syntax. The closing of the file is handled for you.
I would open the file just once before for loops (the for loops should be within the with statement) instead of opening it multiple times. You are overwriting the file each time you are opening it to write a new line.
Your code should be:
words = ["makan","Rina"]
sentences = text.split(".")
with open("corpus.txt",'w+') as f:
for itemIndex in range(len(sentences)):
for word in words:
if word in sentences[itemIndex]:
f.write(sentences[itemIndex] + '\n')

Read in file and print only the last words of each line if these words are 8 characters or longer and dont contain #, # or : symbols

so far i can write the code to filter out words that are less than 8 characters long and also the words that contain the #, # or : symbols. However i cant figure out how to just get the last words. My code looks like this so far.
f = open("file.txt").read()
for words in f.split():
if len(words) >= 8 and not "#" in words and not "#" in words and not ":" in words:
print(words)
Edit - sorry im pretty new to this and so ive probably done something wrong above as well. The file is quite long so ill give the first line and the expected output. The first line is:
"I wish they would show out takes of Dick Cheney #GOPdebates Candidates went after #HillaryClinton 32 times in the #GOPdebate-but remained"
the expected output is "remained" however my code outputs "Candidates" and "remained".
for line in open(filename):
if some_test(line):
do_rad_thing(line)
I think is what you want .... you have the some_test part and the do_rad_thing part
I think this works: you can open the file with readlines and pass the delimeter in split(), then get the last one using [-1].
f = open("file.txt").realines()
for line in f:
last_word = line.split()[-1]
This should accomplish what you are trying to do.
Split words of the file into an array using .split() and then access the last value using [-1]. I also put all the illegal characters into an array and just did a check to see if any of the chars in the illegal_chars array are in last_word.
f = open("file.txt").read()
illegal_chars = ["#", "#", ":"]
last_word = f.split()[-1]
if( len(last_word) >= 8 and illegal_chars not in last_word:
print(last_word)

Categories