I originally posted this question here but was then told to post it to code review; however, they told me that my question needed to be posted here instead. I will try to better explain my problem so hopefully there is no confusion. I am trying to write a word-concordance program that will do the following:
1) Read the stop_words.txt file into a dictionary (use the same type of dictionary that you’re timing) containing only stop words, called stopWordDict. (WARNING: Strip the newline(‘\n’) character from the end of the stop word before adding it to stopWordDict)
2) Process the WarAndPeace.txt file one line at a time to build the word-concordance dictionary(called wordConcordanceDict) containing “main” words for the keys with a list of their associated line numbers as their values.
3) Traverse the wordConcordanceDict alphabetically by key to generate a text file containing the concordance words printed out in alphabetical order along with their corresponding line numbers.
I tested my program on a small file with a short list of stop words and it worked correctly (provided an example of this below). The outcome was what I expected, a list of the main words with their line count, not including words from the stop_words_small.txt file. The only difference between the small file I tested and the main file I am actually trying to test, is the main file is much longer and contains punctuation. So the problem I am running into is when I run my program with the main file, I am getting way more results then expected. The reason I am getting more results then expected is because the punctuation is not being removed from the file.
For example, below is a section of the outcome where my code counted the word Dmitri as four separate words because of the different capitalization and punctuation that follows the word. If my code were to remove the punctuation correctly, the word Dmitri would be counted as one word followed by all the locations found. My output is also separating upper and lower case words, so my code is not making the file lower case either.
What my code currently displays:
Dmitri : [2528, 3674, 3687, 3694, 4641, 41131]
Dmitri! : [16671, 16672]
Dmitri, : [2530, 3676, 3685, 13160, 16247]
dmitri : [2000]
What my code should display:
dmitri : [2000, 2528, 2530, 3674, 3676, 3685, 3687, 3694, 4641, 13160, 16671, 16672, 41131]
Words are defined to be sequences of letters delimited by any non-letter. There should also be no distinction made between upper and lower case letters, but my program splits those up as well; however, blank lines are to be counted in the line numbering.
Below is my code and I would appreciate it if anyone could take a look at it and give me any feedback on what I am doing wrong. Thank you in advance.
import re
def main():
stopFile = open("stop_words.txt","r")
stopWordDict = dict()
for line in stopFile:
stopWordDict[line.lower().strip("\n")] = []
hwFile = open("WarAndPeace.txt","r")
wordConcordanceDict = dict()
lineNum = 1
for line in hwFile:
wordList = re.split(" |\n|\.|\"|\)|\(", line)
for word in wordList:
word.strip(' ')
if (len(word) != 0) and word.lower() not in stopWordDict:
if word in wordConcordanceDict:
wordConcordanceDict[word].append(lineNum)
else:
wordConcordanceDict[word] = [lineNum]
lineNum = lineNum + 1
for word in sorted(wordConcordanceDict):
print (word," : ",wordConcordanceDict[word])
if __name__ == "__main__":
main()
Just as another example and reference here is the small file I test with the small list of stop words that worked perfectly.
stop_words_small.txt file
a, about, be, by, can, do, i, in, is, it, of, on, the, this, to, was
small_file.txt
This is a sample data (text) file to
be processed by your word-concordance program.
The real data file is much bigger.
correct output
bigger: 4
concordance: 2
data: 1 4
file: 1 4
much: 4
processed: 2
program: 2
real: 4
sample: 1
text: 1
word: 2
your: 2
You can do it like this:
import re
from collections import defaultdict
wordConcordanceDict = defaultdict(list)
with open('stop_words_small.txt') as sw:
words = (line.strip() for line in sw)
stop_words = set(words)
with open('small_file.txt') as f:
for line_number, line in enumerate(f, 1):
words = (re.sub(r'[^\w\s]','',word).lower() for word in line.split())
good_words = (word for word in words if word not in stop_words)
for word in good_words:
wordConcordanceDict[word].append(line_number)
for word in sorted(wordConcordanceDict):
print('{}: {}'.format(word, ' '.join(map(str, wordConcordanceDict[word]))))
Output:
bigger: 4
data: 1 4
file: 1 4
much: 4
processed: 2
program: 2
real: 4
sample: 1
text: 1
wordconcordance: 2
your: 2

I will add explanations tomorrow, it's getting late here ;). Meanwhile, you can ask in the comments if some part of the code isn't clear for you.
Related
So from a text file which has a content:
Lemonade juice whiskey beer soda vodka
In Python, by implementing that same .txt file, I would like to output word-pairs in the next order:
juice-lemonade
whiskey-juice
beer-whiskey
soda-beer
vodka-soda
I managed outputing something like that by using list instead of opening file in Python, but in the case with some major .txt file, that is not really a handy solution.
Also, the bonus task for this would be to output the probability for each of those pairs. Any kind of hint would be highly appreciated.
To read large files efficiently, you should read them line-by-line, or (if you have really long lines, which is what the snippet below assumes) token-by-token.
A clean way to do this while keeping an open handle on a file is by using generators that yield a word at a time.
You can have another generator that combines 2 words at a time and yields pairs.
from typing import Iterator
def memory_efficient_word_generator(text_file: str) -> Iterator[str]:
word = ''
with open(text_file) as text:
while True:
character = text.read(1)
if not character:
return
if character.isspace():
yield word.lower()
word = ''
else:
word += character
def pair_generator(text_file: str) -> Iterator[str]:
previous_word = ''
for word in memory_efficient_word_generator(text_file):
if previous_word and word:
yield f'{previous_word}-{word}'
previous_word = word or previous_word
for pair in pair_generator('filename.txt'):
print(pair)
Assuming filename.txt contains:
Lemonade juice whiskey beer soda vodka
cola tequila lemonade juice
You should see something like:
lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice
Of course, there's a lot more you should handle depending on your desired behaviour (for example, handling non-alphabetic characters in your input).
Thank you very much for the feedback.
That's pretty much it, I just added encoding = 'utf-8' here:
with open(text_file, encoding='utf-8') as text:
since it outputs error for 'charmap' for me.
And just one more thing, I also wanted to output the number of the elements(words) from the text file by using:
file = open("filename.txt", "rt", encoding="utf8")
data = file.read()
words = data.split()
print('Number of words :', len(words))
which I did, now I'm trying to do the same with those word-pairs that you sent, basically each of those pairs would be one element, like for example:
lemonade-juice ---> one element
So if we would to count all of these from a text file:
lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice
we would get the output of 9 elements or
Number of word-pairs: 9
Was thinking now to try to do that with using len function and calling text_file.
Fix me if I'm looking in a wrong direction.
Once again, thank you for your time.
I have a piece of code what is supposed to tell me how many times a word occurs in a CSV file. Note: the file is pretty large (2 years of text messages)
This is my code:
key_word1 = 'Exmple_word1'
key_word2 = 'Example_word2'
counter = 0
with open('PATH_TO_FILE.csv',encoding='UTF-8') as a:
for line in a:
if (key_word1 or key_word2) in line:
counter = counter + 1
print(counter)
There are two words because I did not know how to make it non-case sensitive.
To test it I used the find function in word on the whole file (using only one of the words as I was able to do a non-case sensitive search there) and I received more than double of what my code has calculated.
At first I did use the value_counts() function BUT I received different values for the same word (searching Exmple_word1 appeared 32 and 56 times and 2 times and so on. I kind of got stuck there for a while but it got me thinking. I use two keyboards on my phone which I change regularly - could it be that the same words could actually be different and that would explain why I am getting these results?
Also, I pretty much checked all sources regarding this matter and I found different approaches that did not actually do what I want them to do. ( the value_counts() method for example)
If that is the case, how can I fix this?
Notice some mistakes in your code:
key_word1 or key_word2 - it's "lazy", meaning if the left part - "key_word1" evaluated to True, it won't even look at key_word2. The will cause checking only if key_word1 appeared in the line.
An example to emphesize:
w1 = 'word1'
w2 = 'word2'
s = 'bla word2'
(w1 or w2) in s
>> False
(w2 or w1) in s
>> True
2. Reading csv file: I recommend using csv package (just import it), something like:
import csv
with open('PATH_TO_FILE.csv') as f:
for line in csv.reader(f):
# do you logic here
Case sensitivity - don't work hard, you probably can lower case the line you read, just to not hold 2 words..
guess the solution you are looking for should look something like:
import csv
word_to_search = 'donald'
with open('PATH_TO_FILE.csv', encoding='UTF-8') as f:
for line in csv.reader(f):
if any(word_to_search in l for l in map(str.lower, line)):
counter += 1
Running on input:
bla,some other bla,donald rocks
make,who,great
again, donald is here, hura
will result:
counter=2
Hi so i have 2 text files I have to read the first text file count the frequency of each word and remove duplicates and create a list of list with the word and its count in the file.
My second text file contains keywords I need to count the frequency of these keywords in the first text file and return the result without using any imports, dict, or zips.
I am stuck on how to go about this second part I have the file open and removed punctuation etc but I have no clue how to find the frequency
I played around with the idea of .find() but no luck as of yet.
Any suggestions would be appreciated this is my code at the moment seems to find the frequency of the keyword in the keyword file but not in the first text file
def calculateFrequenciesTest(aString):
listKeywords= aString
listSize = len(listKeywords)
keywordCountList = []
while listSize > 0:
targetWord = listKeywords [0]
count =0
for i in range(0,listSize):
if targetWord == listKeywords [i]:
count = count +1
wordAndCount = []
wordAndCount.append(targetWord)
wordAndCount.append(count)
keywordCountList.append(wordAndCount)
for i in range (0,count):
listKeywords.remove(targetWord)
listSize = len(listKeywords)
sortedFrequencyList = readKeywords(keywordCountList)
return keywordCountList;
EDIT- Currently toying around with the idea of reopening my first file again but this time without turning it into a list? I think my errors are somehow coming from it counting the frequency of my list of list. These are the types of results I am getting
[[['the', 66], 1], [['of', 32], 1], [['and', 27], 1], [['a', 23], 1], [['i', 23], 1]]
You can try something like:
I am taking a list of words as an example.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = {}
for word in word_list:
if word not in frequency_list:
frequency_list[word] = 1
else:
frequency_list[word] += 1
print(frequency_list)
RESULT: {'test': 1, 'world': 1, 'hello': 2}
Since, you have put a constraint on dicts, I have made use of two lists to do the same task. I am not sure how efficient it is, but it serves the purpose.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = []
frequency_word = []
for word in word_list:
if word not in frequency_word:
frequency_word.append(word)
frequency_list.append(1)
else:
ind = frequency_word.index(word)
frequency_list[ind] += 1
print(frequency_word)
print(frequency_list)
RESULT : ['hello', 'world', 'test']
[2, 1, 1]
You can change it to how you like or re-factor it as you wish
I agree with #bereal that you should use Counter for this. I see that you have said that you don't want "imports, dict, or zips", so feel free to disregard this answer. Yet, one of the major advantages of Python is its great standard library, and every time you have list available, you'll also have dict, collections.Counter and re.
From your code I'm getting the impression that you want to use the same style that you would have used with C or Java. I suggest trying to be a little more pythonic. Code written this way may look unfamiliar, and can take time getting used to. Yet, you'll learn way more.
Claryfying what you're trying to achieve would help. Are you learning Python? Are you solving this specific problem? Why can't you use any imports, dict, or zips?
So here's a proposal utilizing built in functionality (no third party) for what it's worth (tested with Python 2):
#!/usr/bin/python
import re # String matching
import collections # collections.Counter basically solves your problem
def loadwords(s):
"""Find the words in a long string.
Words are separated by whitespace. Typical signs are ignored.
"""
return (s
.replace(".", " ")
.replace(",", " ")
.replace("!", " ")
.replace("?", " ")
.lower()).split()
def loadwords_re(s):
"""Find the words in a long string.
Words are separated by whitespace. Only characters and ' are allowed in strings.
"""
return (re.sub(r"[^a-z']", " ", s.lower())
.split())
# You may want to read this from a file instead
sourcefile_words = loadwords_re("""this is a sentence. This is another sentence.
Let's write many sentences here.
Here comes another sentence.
And another one.
In English, we use plenty of "a" and "the". A whole lot, actually.
""")
# Sets are really fast for answering the question: "is this element in the set?"
# You may want to read this from a file instead
keywords = set(loadwords_re("""
of and a i the
"""))
# Count for every word in sourcefile_words, ignoring your keywords
wordcount_all = collections.Counter(sourcefile_words)
# Lookup word counts like this (Counter is a dictionary)
count_this = wordcount_all["this"] # returns 2
count_a = wordcount_all["a"] # returns 1
# Only look for words in the keywords-set
wordcount_keywords = collections.Counter(word
for word in sourcefile_words
if word in keywords)
count_and = wordcount_keywords["and"] # Returns 2
all_counted_keywords = wordcount_keywords.keys() # Returns ['a', 'and', 'the', 'of']
Here is a solution with no imports. It uses nested linear searches which are acceptable with a small number of searches over a small input array, but will become unwieldy and slow with larger inputs.
Still the input here is quite large, but it handles it in reasonable time. I suspect if your keywords file was larger (mine has only 3 words) the slow down would start to show.
Here we take an input file, iterate over the lines and remove punctuation then split by spaces and flatten all the words into a single list. The list has dupes, so to remove them we sort the list so the dupes come together and then iterate over it creating a new list containing the string and a count. We can do this by incrementing the count as long the same word appears in the list and moving to a new entry when a new word is seen.
Now you have your word frequency list and you can search it for the required keyword and retrieve the count.
The input text file is here and the keyword file can be cobbled together with just a few words in a file, one per line.
python 3 code, it indicates where applicable how to modify for python 2.
# use string.punctuation if you are somehow allowed
# to import the string module.
translator = str.maketrans('', '', '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~')
words = []
with open('hamlet.txt') as f:
for line in f:
if line:
line = line.translate(translator)
# py 2 alternative
#line = line.translate(None, string.punctuation)
words.extend(line.strip().split())
# sort the word list, so instances of the same word are
# contiguous in the list and can be counted together
words.sort()
thisword = ''
counts = []
# for each word in the list add to the count as long as the
# word does not change
for w in words:
if w != thisword:
counts.append([w, 1])
thisword = w
else:
counts[-1][1] += 1
for c in counts:
print('%s (%d)' % (c[0], c[1]))
# function to prevent need to break out of nested loop
def findword(clist, word):
for c in clist:
if c[0] == word:
return c[1]
return 0
# open keywords file and search for each word in the
# frequency list.
with open('keywords.txt') as f2:
for line in f2:
if line:
word = line.strip()
thiscount = findword(counts, word)
print('keyword %s appear %d times in source' % (word, thiscount))
If you were so inclined you could modify findword to use a binary search, but its still not going to be anywhere near a dict. collections.Counter is the right solution when there are no restrictions. Its quicker and less code.
I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.
I have a txt file of Letters to the Editor that I need to split into their own individual files.
The files are all formatted in relatively the same way with:
For once, before offering such generous but the unasked for advice, put yourselves in...
Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...
Why is it that The Times does not urge totalitarian Arab slates and terrorist...
PAUL STONEHILL Los Angeles
There you go again. Your editorial again makes groundless criticisms of the Israeli...
On Dec. 7 you called proportional representation “bizarre," despite its use in the...
Proportional representation distorts Israeli politics? Huh? If Israel changes the...
MATTHEW SHUGART Laguna Beach
Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...
Although the mayor did not support Proposition U (the slow-growth initiative) his...
If West Los Angeles is any indication of the no-growth policy, where do we go from here?
MARJORIE L. SCHWARTZ Los Angeles
I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.
I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.
The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.
import re
thefile = raw_input('Filename to split: ')
name_occur = []
full_file = []
pattern = re.compile("^[A-Z]{4,}")
with open (thefile, 'rt') as in_file:
for line in in_file:
full_file.append(line)
if pattern.search(line):
name_occur.append(line)
totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)
while letters <= totalFiles:
f1 = open(thefile + '-' + str(letters) + ".txt", "a")
doIHaveToCopyTheLine = False
ignoreLines = False
for line in full_file:
if not ignoreLines:
f1.write(line)
full_file.remove(line)
if pattern.search(line):
doIHaveToCopyTheLine = True
ignoreLines = True
letters += 1
f1.close()
I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.
I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.
import string
def split_letters(fullpath):
current_letter = []
letter_index = 1
fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)
with open(fullpath, 'r') as letters_file:
letters = letters_file.readlines()
for line in letters:
words = line.split()
upper_words = []
for word in words:
upper_word = ''.join(
c for c in word if c in string.ascii_uppercase)
upper_words.append(upper_word)
len_upper_words = len(upper_words)
first_word_upper = len_upper_words and len(upper_words[0]) > 1
second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
if first_word_upper and (second_word_upper or third_word_upper):
current_letter.append(line)
new_filename = '{0}{1}.{2}'.format(
fullpath_base, letter_index, fullpath_ext)
with open(new_filename, 'w') as new_letter:
new_letter.writelines(current_letter)
current_letter = []
letter_index += 1
else:
current_letter.append(line)
I tested it on your sample input and it worked fine.
While the other answer is suitable, you may still be curious about using a regex to split up a file.
smallfile = None
buf = ""
with open ('input_file.txt', 'rt') as f:
for line in f:
buf += str(line)
if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
if smallfile:
smallfile.close()
match = re.findall(r'^([A-Z\s\.]+\b)' , line)
smallfile_name = '{}.txt'.format(match[0])
smallfile = open(smallfile_name, 'w')
smallfile.write(buf)
buf = ""
if smallfile:
smallfile.close()
If you run on Linux, use csplit.
Otherwise, check out these two threads:
How can I split a text file into multiple text files using python?
How to match "anything up until this sequence of characters" in a regular expression?
Here is my code
import re
with open('newfiles.txt') as f:
k = f.read()
p = re.compile(r'[\w\:\-\.\,\']+|[^[\w\:\-\.\'\,]\s]')
originaltext = p.findall(k)
uniquelist = []
for word in originaltext:
if word not in uniquelist:
uniquelist.append(word)
indexes = ' '.join(str(uniquelist.index(word)+1) for word in originaltext)
n = p.findall(indexes)
file = open("newfiletwo.txt","w")
file.write (' '.join(str(e) for e in n))
file.close()
file = open("newfilethree.txt","w")
file.write(' '.join(uniquelist))
file.close()
with open('newfiletwo.txt') as f:
indexess = f.read()
with open('newfilethree.txt') as f:
differentwords = f.read()
differentwords = p.findall(differentwords)
indexess = [uniquelist.index(word) for word in originaltext]
for word in originaltext:
if not word in differentwords:
differentwords.append(word)
i = differentwords.index(word)
indexess.append(i)
s = "" # the reconstructed sentence
for i in indexess:
s = s + differentwords[i] + " "
print(s)
The program basically takes an external text file, returns the index of its positions (if any word repeats, then the first position is taken) and then saves the positions as an external file. Whilst doing this, I have split up the text file including splitting punctuation and saved different words and punctuation that occur in the file as an external file too. Now for the hard part, using both of these external files - the indexes and the different separated words, I am trying to recreate the original text file, including the punctuation. But the error shown in the title occurs:
Traceback (most recent call last):
File "E:\Python\Index.py", line 31, in <module>
s = s + differentwords[i] + " "
IndexError: list index out of range
Not trying to sound rude but I am a sort of beginner, please try to change as less as possible in a simple way, as I have created this myself. You guys maybe know a far shorter way to do this, but this is the level of simplicity I can handle, proved by the length of the code. I have tried to shorten the original text file but that proves no use. Anyone know why the error occurs and how to fix it? I am not looking for efficiency right now, maybe after another couple of months of learning, but the simplest (i don't mind long) answer will be the best. Sorry if I have repeated myself a lot :-)
'newfiles' - A bunch of sentences with punctuation
UPDATE
The code does not show the error but prints the original sentence twice. The error has gone due to the removal of +1 on line 23. Does anyone know why the output repeats twice though?
Problem is, how you qualify what word is, what is not. For instance is comma part of word? In your case that is not mentioned as such, while it is also not a separator. So you end up with separate word comma, or dot, and so on. I have no access to your input, so I can just provide sample:
p = re.compile(r'[\w\:\-\.\,]+|[^[\w\:\-\.\,]\s]')
There is one point - in this case: 'Word', 'word', 'Word', 'Word.', 'word,' are all separate words. Since dot, and coma are parts of word. You can't eat cake and have it. To fix that... you need to store information if there is white space before separation.
UPDATE:
Oh, yes. Double output. Files that are stored in the middle - are OK. So something was filed after that. Look at this two lines:
i = differentwords.index(word)
indexess.append(i)
They need to be inside preceding if statement.