How can I get two txt files by finding common occurrences? - python

I need to know which English words were used in the Italian chat and to count how many times they were used.
But in the output I also have the words I didn't use in the example chat (baby-blue-eyes': 0)
english_words = {}
with open("dizionarioen.txt") as f:
for line in f:
for word in line.strip().split():
english_words[word] = 0
with open("_chat.txt") as f:
for line in f:
for word in line.strip().split():
if word in english_words:
english_words[word] += 1
print(english_words)

You can simply iterate over your result and remove all elements that have value 0:
english_words = {}
with open("dizionarioen.txt") as f:
for line in f:
for word in line.strip().split():
english_words[word] = 0
with open("_chat.txt") as f:
for line in f:
for word in line.strip().split():
if word in english_words:
english_words[word] += 1
result = {key: value for key, value in english_words.items() if value}
print(result)
Also here is another solution that allows you to count words with usage of Counter:
from collections import Counter
with open("dizionarioen.txt") as f:
all_words = set(word for line in f for word in line.split())
with open("_chat.txt") as f:
result = Counter([word for line in f for word in line.split() if word in all_words])
print(result)

If you want to remove the words without occurrence after indexing, just delete these entries:
for w in list(english_words.keys()):
if english_words[w]==0: del english_words[w]
Then, your dictionary only contains words that occurred. Was that the question?

Related

How to loop through two list and append key,val pair?

I'm trying two loop through a text file and create a dict which holds dict[line_index]=word_index_position which means the key is the line number and the value is all the words in that line. The goal is to create a "matrix" so that a the user later on should be able to specify x,y coordinates (line, word_index_position) and retrieve a word in those coordinates, if there is any (Not sure how it is going to work with a dict, since it's not ordered). Below is the loop to create the dict.
try:
f = open("file.txt", "r")
except Exception as e:
print("Skriv in ett korrekt filnamn")
uppslag = dict()
num_lines = 0
for line in f.readlines():
num_lines += 1
print(line)
for word in line.split():
print(num_lines)
print(word)
uppslag[num_lines] = word
f.close()
uppslag
Loop works as it's supposed to, but uppslag[num_lines] = word seems to only store the last word in each line. Any guidance would be highly appreciated.
Many thanks,
Instead of overwriting the word:
for word in line.split():
print(num_lines)
print(word)
uppslag[num_lines] = word
you may be better off saving the whole line:
uppslag[num_lines] = line.split()
This way you'll be able to find the 3rd word in 4th line as:
uppslag[4][3]
uppslag[num_lines] = word is overwriting the dictionary entry for key num_lines every time it is called. You can use a list to hold all the words:
for line in f:
num_lines += 1
print(line)
uppslag[num_lines] = [] # initialize dictionary entry with empty list
for word in line.split():
print(num_lines, word)
uppslag[num_lines].append(word) # add new word to list
You can write the same code in a more compact form, since line.split() already returns a list:
for line_number, line in enumerate(f):
uppslag[line_number] = line.split()
If there is a word on every line (i.e. the line index will be continuous) you can use a list instead of a dictionary, and reduce your code to a one-line list comprehension:
uppslag = [line.split() for line in f]
There is no need for a dictionary, or .readlines().
with open("file.txt") as words_file:
words = [line.split() for line in words_file]

How to compare word frequencies from two text files?

How to compare word frequencies from two text files in python? For example, if a word contains in file1 and file2 both then it should be written only once but not adding their frequencies while comparing, it should be {'The': 3,5}. Here 3 is the frequency in file1 and 5 is frequency in file2. And if some words only exist in one file but not both then for that file there should be 0. Please Help
Here is what I have done so far:
import operator
f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2
wordlist=[]
wordlist2=[]
for line in f1:
for word in line.split():
wordlist.append(word)
for line in f2:
for word in line.split():
wordlist2.append(word)
worddictionary = {}
for word in wordlist:
if word in worddictionary:
worddictionary[word] += 1
else:
worddictionary[word] = 1
worddictionary2 = {}
for word in wordlist2:
if word in worddictionary2:
worddictionary2[word] += 1
else:
worddictionary2[word] = 1
print(worddictionary)
print(worddictionary2)
Edit: Here's the more general way you would do this for any list of files (explanation in comments):
f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2
file_list = [f1, f2] # This would hold all your open files
num_files = len(file_list)
frequencies = {} # We'll just make one dictionary to hold the frequencies
for i, f in enumerate(file_list): # Loop over the files, keeping an index i
for line in f: # Get the lines of that file
for word in line.split(): # Get the words of that file
if not word in frequencies:
frequencies[word] = [0 for _ in range(num_files)] # make a list of 0's for any word you haven't seen yet -- one 0 for each file
frequencies[word][i] += 1 # Increment the frequency count for that word and file
print frequencies
Keeping with the code you wrote, here's how you could create a combined dictionary:
import operator
f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2
wordlist=[]
wordlist2=[]
for line in f1:
for word in line.split():
wordlist.append(word)
for line in f2:
for word in line.split():
wordlist2.append(word)
worddictionary = {}
for word in wordlist:
if word in worddictionary:
worddictionary[word] += 1
else:
worddictionary[word] = 1
worddictionary2 = {}
for word in wordlist2:
if word in worddictionary2:
worddictionary2[word] += 1
else:
worddictionary2[word] = 1
# Create a combined dictionary
combined_dictionary = {}
all_word_set = set(worddictionary.keys()) | set(worddictionary2.keys())
for word in all_word_set:
combined_dictionary[word] = [0,0]
if word in worddictionary:
combined_dictionary[word][0] = worddictionary[word]
if word in worddictionary2:
combined_dictionary[word][1] = worddictionary2[word]
print(worddictionary)
print(worddictionary2)
print(combined_dictionary)
Edit: I misunderstood the problem, the code now works for your question.
f1 = open('file1.txt','r') #file 1
f2 = open('file2.txt','r') #file 2
wordList = {}
for line in f1.readlines(): #for each line in lines (file.readlines() returns a list)
for word in line.split(): #for each word in each line
if(not word in wordList): #if the word is not already in our dictionary
wordList[word] = 0 #Add the word to the dictionary
for line in f2.readlines(): #for each line in lines (file.readlines() returns a list)
for word in line.split(): #for each word in each line
if(word in wordList): #if the word is already in our dictionary
wordList[word] = wordList[word]+1 #add one to it's value
f1.close() #close files
f2.close()
f1 = open('file1.txt','r') #Have to re-open because we are at the end of the file.
#might be a n easier way of doing this
for line in f1.readlines(): #Removing keys whose values are 0
for word in line.split(): #for each word in each line
try:
if(wordList[word] == 0): #if it's value is 0
del wordList[word] #remove it from the dictionary
else:
wordList[word] = wordList[word]+1 #if it's value is not 0, add one to it for each occurrence in file1
except:
pass #we know the error was that there was no wordList[word]
f1.close()
print(wordList)
Adding first file words, if that word is in second file, add one to the value.
After that, check each word, if it's value is 0, remove it.
This can't be done by iterating over the dictionary, because it is changing size while iterating over it.
This is how you would implement it for multiple files (more complex):
f1 = open('file1.txt','r') #file 1
f2 = open('file2.txt','r') #file 2
fileList = ["file1.txt", "file2.txt"]
openList = []
for i in range(len(fileList)):
openList.append(open(fileList[i], 'r'))
fileWords = []
for i, file in enumerate(openList): #for each file
fileWords.append({}) #add a dictionary to our list
for line in file: #for each line in each file
for word in line.split(): #for each word in each line
if(word in fileWords[i]): #if the word is already in our dictionary
fileWords[i][word] += 1 #add one to it
else:
fileWords[i][word] = 1 #add it to our dictionary with value 0
for i in openList:
i.close()
for i, wL in enumerate(fileWords):
print(f"File: {fileList[i]}")
for l in wL.items():
print(l)
#print(f"File {i}\n{wL}")
You might find the following demonstration program to be a good starting point for getting the word frequencies of your files:
#! /usr/bin/env python3
import collections
import pathlib
import pprint
import re
import sys
def main():
freq = get_freq(sys.argv[0])
pprint.pprint(freq)
def get_freq(path):
if isinstance(path, str):
path = pathlib.Path(path)
return collections.Counter(
match.group() for match in re.finditer(r'\b\w+\b', path.open().read())
)
if __name__ == '__main__':
main()
In particular, you will want to use the get_freq function to get a Counter object that tells you what the word frequencies are. Your program can call the get_freq function multiple times with different file names, and you should find the Counter objects to be very similar to the dictionaries you were previously using.

Check if they are same

I want to read from text file and print the first three words having the same initial three letters. I can get the first 3 initials but I cannot check if they are same or not.
Here is my code:
def main():
f = open("words.txt", "r+")
# The loop that prints the initial letters
for word in f.read().split():
# the part that takes the 3 initials letters of the word
initials = [j[:3] for j in word.split()]
print(initials)
words.txt
when, where, loop, stack, wheel, wheeler
output
You can use a mapping from the first 3 letters to the list of words. collections.defaultdict could save you a few keystrokes here:
from collections import defaultdict
def get_words():
d = defaultdict(list)
with open('words.txt') as f:
for line in f:
for word in line.split(', '):
prefix = word[:3]
d[prefix].append(word)
if len(d[prefix]) == 3:
return d[prefix]
return []
print(get_words()) # ['when', 'where', 'wheel']
This code snippet groups the words by there first 3 letters:
def main():
# a dict where the first 3 letters are the keys and the
# values are lists of words
my_dict = {}
with open("words.txt", "r") as f:
for line in f:
for word in line.strip().split():
s = word[:3]
if s not in my_dict:
# add 3 letters as the key
my_dict[s] = []
my_dict[s].append(word)
if len(my_dict[s]) == 3:
print(my_dict[s])
return
# this will only print if there are no 3 words with the same start letters
print(my_dict)
This stops the processing (I used a return statement) if you get to 3 words with the same 3 letters.
You can use dictionary here with first 3 characters as key. Example
d={}
f = open("words.txt", "r+")
key_with_three_element=''
for word in f.read().split():
if word[:3] in d:
d[word[:3]].append(word)
else:
d[word[:3]]=[word]
if(len(d[word[:3]])==3):
key_with_three_element=word[:3]
break
print(d[key_with_three_element])
Ouput:
['when', 'where', 'wheel']
def main():
f = open("words.txt", "r+")
for word in f.read().split():
record[word[:3]] = record.get(word[:3], [])+[word]
if len(record[word[:3]]) == 3:
print (record[word[:3]])
break

Sorting and counting words from a text file

I'm new to programming and stuck on my current program. I have to read in a story from a file, sort the words, and count the number of occurrences per word. It will count the words, but it won't sort the words, remove the punctuation, or duplicate words. I'm lost to why its not working. Any advice would be helpful.
ifile = open("Story.txt",'r')
fileout = open("WordsKAI.txt",'w')
lines = ifile.readlines()
wordlist = []
countlist = []
for line in lines:
wordlist.append(line)
line = line.split()
# line.lower()
for word in line:
word = word.strip(". , ! ? : ")
# word = list(word)
wordlist.sort()
sorted(wordlist)
countlist.append(word)
print(word,countlist.count(word))
There main problem in your code is at the line (line 9):
wordlist.append(line)
You are appending the whole line into the wordlist, I doubt that is what you want. As you do this, the word added is not .strip()ed before it is added to wordlist.
What you have to do is to add the word only after you have strip()ed it and make sure you only do that after you checked that there are not other same words (no duplicates):
ifile = open("Story.txt",'r')
lines = ifile.readlines()
wordlist = []
countlist = []
for line in lines:
# Get all the words in the current line
words = line.split()
for word in words:
# Perform whatever manipulation to the word here
# Remove any punctuation from the word
word = word.strip(".,!?:;'\"")
# Make the word lowercase
word = word.lower()
# Add the word into wordlist only if it is not in wordlist
if word not in wordlist:
wordlist.append(word)
# Add the word to countlist so that it can be counted later
countlist.append(word)
# Sort the wordlist
wordlist.sort()
# Print the wordlist
for word in wordlist:
print(word, countlist.count(word))
Another way you could do this is using a dictionary, storing the word as they key and the number of occurences as the value:
ifile = open("Story.txt", "r")
lines = ifile.readlines()
word_dict = {}
for line in lines:
# Get all the words in the current line
words = line.split()
for word in words:
# Perform whatever manipulation to the word here
# Remove any punctuation from the word
word = word.strip(".,!?:;'\"")
# Make the word lowercase
word = word.lower()
# Add the word to word_dict
word_dict[word] = word_dict.get(word, 0) + 1
# Create a wordlist to display the words sorted
word_list = list(word_dict.keys())
word_list.sort()
for word in word_list:
print(word, word_dict[word])
You have to provide a key function to the sorting methods.
Try this
r = sorted(wordlist, key=str.lower)
punctuation = ".,!?: "
counts = {}
with open("Story.txt",'r') as infile:
for line in infile:
for word in line.split():
for p in punctuation:
word = word.strip(p)
if word not in counts:
counts[word] = 0
counts[word] += 1
with open("WordsKAI.txt",'w') as outfile:
for word in sorted(counts): # if you want to sort by counts instead, use sorted(counts, key=counts.get)
outfile.write("{}: {}\n".format(word, counts[word]))

I have a txt file. How can I take dictionary key values and print the line of text they appear in?

I have a txt file. I have written code that finds the unique words and the number of times each word appears in that file. I now need to figure out how to print the lines that those words apear in as well. How can I go about doing this?
Here is a sample output:
Analyze what file: itsy_bitsy_spider.txt
Concordance for file itsy_bitsy_spider.txt
itsy : Total Count: 2
Line:1: The ITSY Bitsy spider crawled up the water spout
Line:4: and the ITSY Bitsy spider went up the spout again
#this function will get just the unique words without the stop words.
def openFiles(openFile):
for i in openFile:
i = i.strip()
linelist.append(i)
b = i.lower()
thislist = b.split()
for a in thislist:
if a in stopwords:
continue
else:
wordlist.append(a)
#print wordlist
#this dictionary is used to count the number of times each stop
countdict = {}
def countWords(this_list):
for word in this_list:
depunct = word.strip(punctuation)
if depunct in countdict:
countdict[depunct] += 1
else:
countdict[depunct] = 1
from collections import defaultdict
target = 'itsy'
word_summary = defaultdict(list)
with open('itsy.txt', 'r') as f:
lines = f.readlines()
for idx, line in enumerate(lines):
words = [w.strip().lower() for w in line.split()]
for word in words:
word_summary[word].append(idx)
unique_words = len(word_summary.keys())
target_occurence = len(word_summary[target])
line_nums = set(word_summary[target])
print "There are %s unique words." % unique_words
print "There are %s occurences of '%s'" % (target_occurence, target)
print "'%s' is found on lines %s" % (target, ', '.join([str(i+1) for i in line_nums]))
If you parsed the input text file line by line, you could maintain another dictionary that is a word -> List<Line> mapping. ie for each word in a line, you add an entry. Might look something like the following. Bearing in mind I'm not very familiar with python, so there may be syntactic shortcuts I've missed.
eg
countdict = {}
linedict = {}
for line in text_file:
for word in line:
depunct = word.strip(punctuation)
if depunct in countdict:
countdict[depunct] += 1
else:
countdict[depunct] = 1
# add entry for word in the line dict if not there already
if depunct not in linedict:
linedict[depunct] = []
# now add the word -> line entry
linedict[depunct].append(line)
One modification you will probably need to make is to prevent duplicates being added to the linedict if a word appears twice in the line.
The above code assumes that you only want to read the text file once.
openFile = open("test.txt", "r")
words = {}
for line in openFile.readlines():
for word in line.strip().lower().split():
wordDict = words.setdefault(word, { 'count': 0, 'line': set() })
wordDict['count'] += 1
wordDict['line'].add(line)
openFile.close()
print words

Categories