Consider below .txt file: myfile.txt:
Box-No.: DK10-95794
Total Discounts USD 1,360.80
Totat: usp 529.20
As you can see, in above text file there is two errors totat and usp (should be total and usd)
Now, I am using a Python package built upon SymSpell, called SymSpellPy. This can check a word and determine if it's spelled incorrectly.
This is my Python script:
# maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 2
prefix_length = 7
# create object
sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
# load dictionary
dictionary_path = os.path.join(
os.path.dirname(__file__), "Dictionaries/eng.dictionary.txt")
term_index = 0 # column of the term in the dictionary text file
count_index = 1 # column of the term frequency in the dictionary text file
with open("myfile.txt", "r") as file:
for line in file:
for word in re.findall(r'\w+', line):
# word by word
input_term = word
# max edit distance per lookup
max_edit_distance_lookup = 2
suggestion_verbosity = Verbosity.CLOSEST # TOP, CLOSEST, ALL
suggestions = sym_spell.lookup(input_term, suggestion_verbosity,
max_edit_distance_lookup)
# display suggestion term, term frequency, and edit distance
for suggestion in suggestions:
word = word.replace(input_term, suggestion.term)
print("{}, {}". format(input_term, word))
Running above script on my text file, gives me this output:
Total, Total
USD, USD
Totat, Total
As you can see, it correctly catches the last word totat => total.
My question is - how can I find mispelled words and correct them in a txt file?
Related
I have a txt file of data that is recorded daily. The program runs every day and records the data it receives from the user and considers a number for each input data.
like this:
#1
data number 1
data
data
data
-------------
#2
data number 2
text
text
-------------
#3
data number 3
-------------
My problem is in numbering the data. For example, when I run the program to record a data in a txt file, the program should find the number of the last recorded data, add one to it and record my data.
But I can't write the program to find the last data number.
I tried these:
Find "#" in text. List all numbers after hashtags and find the biggest number that can be the number of the last recorded data.
text_file = open(r'test.txt', 'r')
line = text_file.read().splitlines()
for Number in line:
hashtag = Number[Number.find('#')]
if hashtag == '#':
hashtag = Number[Number.find('#')+1]
hashtag = int(hashtag)
record_list.append(hashtag)
last_number = max(record_list)
But when I use hashtag = Number[Number.find('#')], even in the lines where there is no hashtag, it returns the first or last letters in that line as a hashtag.
And if the text file is empty, it gives the following error:
hashtag = Number[Number.find('#')]
~~~~~~^^^^^^^^^^^^^^^^^^
IndexError: string index out of range
How can I find the number of the last data and use it in saving the next data?
Consider:
>>> s = "hello world"
>>> s[s.find('#')]
'd'
>>> s.find('#')
-1
If # is not in the line, -1 is returned, which when we use as an index, returns the last character.
We can use regular expressions and a list comprehension as one approach to solving this. Iterate over the lines, selecting only those which match the pattern of a numbered line. We'll match the number part, converting that to an int. We select the last one, which should be the highest number.
with open('test.txt' ,'r') as text_file:
next_number = [
int(m.group(1))
for x in text_file.read().splitlines()
if (m := re.match(r'^\s*#(\d+)\s*$', x))
][-1] + 1
Or we can pass a generator expression to max to ensure we get the highest number.
with open('test.txt' ,'r') as text_file:
next_number = max(
int(m.group(1))
for x in text_file.read().splitlines()
if (m := re.match(r'^\s*#(\d+)\s*$', x))
) + 1
I have a folder with a number of txt files.
I want to count the number of occurrences of a set of words in a certain part of a each txt file and export the results to a new excel file.
Specifically, I want to look for the occurrences of words only in part of text that begins after the word "Company A" and ends in the word "Company B."
For example:
I want to look for the words "Corporation" and "Board" in the bold part of the following text:
...the Board of Company A oversees the management of risks inherent in the operation of the Corporation businesses and the implementation of its strategic plan. The Board reviews the risks associated with the Corporation strategic plan at an annual strategic planning session and periodically throughout the year as part of its consideration of the strategic direction of Company B. In addition, the Board addresses the primary risks associated with...
I have managed to count the occurrences of the set of words but from the whole txt file and not the part from Company A up to Company B.
import os
import sys
import glob
for filename in glob.iglob('file path' + '**/*', recursive=True):
def countWords(filename, list_words):
try:
reading = open(filename, "r+", encoding="utf-8")
check = reading.readlines()
reading.close()
for each in list_words:
lower = each.lower()
count = 0
for string in check:
word_check = string.split()
for word in word_check:
lowerword = word.lower()
line = lowerword.strip("!##$%^&*()_+?><:.,-'\\ ")
if lower == line:
count += 1
print(lower, ":", count)
except FileNotFoundError:
print("This file doesn't exist.")
for zero in list_words:
if zero != "":
print(zero, ":", "0")
else:
pass
print('----')
print(os.path.basename(filename))
countWords(filename, ["Corporation", "Board"])
The final output for the example text should be like this:
txtfile1
Corporation: 2
Board: 1
And the above process should be replicated for all txt files of the folder and exported as an excel file.
Thanks for the consideration and I apologize in advance for the length of the question.
you might try regexp, assuming you want the whole string if you see repetitions of company a before you see company b.
re.findall('company a.*?company b', 'company a did some things in agreement with company b')
That will provide a list of all the text strings starting with company a and ending with company b.
I'm a novice Python user. I'm trying to create a program that reads a text file and searches that text for certain words that are grouped (that I predefine by reading from csv). For example, if I wanted to create my own definition for "positive" containing the words "excited", "happy", and "optimistic", the csv would contain those terms. I know the below is messy - the txt file I am reading from contains 7 occurrences of the three "positive" tester words I read from the csv, yet the results print out to be 25. I think it's returning character count, not word count. Code:
import csv
import string
import re
from collections import Counter
remove = dict.fromkeys(map(ord, '\n' + string.punctuation))
# Read the .txt file to analyze.
with open("test.txt", "r") as f:
textanalysis = f.read()
textresult = textanalysis.lower().translate(remove).split()
# Read the CSV list of terms.
with open("positivetest.csv", "r") as senti_file:
reader = csv.reader(senti_file)
positivelist = list(reader)
# Convert term list into flat chain.
from itertools import chain
newposlist = list(chain.from_iterable(positivelist))
# Convert chain list into string.
posstring = ' '.join(str(e) for e in newposlist)
posstring2 = posstring.split(' ')
posstring3 = ', '.join('"{}"'.format(word) for word in posstring2)
# Count number of words as defined in list category
def positive(str):
counts = dict()
for word in posstring3:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
total = sum (counts.values())
return total
# Print result; will write to CSV eventually
print ("Positive: ", positive(textresult))
I'm a beginner as well but I stumbled upon a process that might help. After you read in the file, split the text at every space, tab, and newline. In your case, I would keep all the words lowercase and include punctuation in your split call. Save this as an array and then parse it with some sort of loop to get the number of instances of each 'positive,' or other, word.
Look at this, specifically the "train" function:
https://github.com/G3Kappa/Adjustable-Markov-Chains/blob/master/markovchain.py
Also, this link, ignore the JSON stuff at the beginning, the article talks about sentiment analysis:
https://dev.to/rodolfoferro/sentiment-analysis-on-trumpss-tweets-using-python-
Same applies with this link:
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Good luck!
I looked at your code and passed through some of my own as a sample.
I have 2 idea's for you, based on what I think you may want.
First Assumption: You want a basic sentiment count?
Getting to 'textresult' is great. Then you did the same with the 'positive lexicon' - to [positivelist] which I thought would be the perfect action? Then you converted [positivelist] to essentially a big sentence.
Would you not just:
1. Pass a 'stop_words' list through [textresult]
2. merge the two dataframes [textresult (less stopwords) and positivelist] for common words - as in an 'inner join'
3. Then basically do your term frequency
4. It is much easier to aggregate the score then
Second assumption: you are focusing on "excited", "happy", and "optimistic"
and you are trying to isolate text themes into those 3 categories?
1. again stop at [textresult]
2. download the 'nrc' and/or 'syuzhet' emotional valence dictionaries
They breakdown emotive words by 8 emotional groups
So if you only want 3 of the 8 emotive groups (subset)
3. Process it like you did to get [positivelist]
4. do another join
Sorry, this is a bit hashed up, but if I was anywhere near what you were thinking let me know and we can make contact.
Second apology, Im also a novice python user, I am adapting what I use in R to python in the above (its not subtle either :) )
does anyone know how to replace a word in a text file?
Here's one line from my stock file:
bread 0.99 12135479 300 200 400
I want to be able to replace my 4th word (in this instance '300') when I print 'productline' with a new number created by the nstock part of this code:
for line in details: #for every line in the file:
if digits in line: #if the barcode is in the line
productline=line #it stores the line as 'productline'
itemsplit=productline.split(' ') #seperates into different words
price=float(itemsplit[1]) #the price is the second part of the line
current=int(itemsplit[3]) #the current stock level is the third part of the line
quantity=int(input("How much of the product do you wish to purchase?\n"))
if quantity<current:
total=(price)*(quantity) #this works out the price
print("Your total spent on this product is:\n" + "£" +str(total)+"\n") #this tells the user, in total how much they have spent
with open("updatedstock.txt","w") as f:
f.writelines(productline) #writes the line with the product in
nstock=int(current-quantity) #the new stock level is the current level minus the quantity
My code does not replace the 4th word (which is the current stock level) with the new stock level (nstock)
Actually you can use regular expressions for that purpose.
import re
string1='bread 0.99 12135479 300 200 400'
pattern='300'
to_replace_with="youyou"
string2=re.sub(pattern, to_replace_with, string1)
You will have the output bellow:
'bread 0.99 12135479 youyou 200 400'
Hope this was what you were looking for ;)
Decided to delete and ask again, was just easier! Please do not vote down as have taken on board what people have been saying.
I have two nested dictionaries:-
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog':{1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}
The first dictionary links words a file number and the number of times they appear in that file. The second contains searches linking a word to the number of times it appears in the current search.
I want to extract certain values so that for each search I can calculate the scalar product between the number of times words appear in a file and number of times they appear in a search divided by their magnitudes, then see which file is most similar to the current search i.e. (word 1 appearances in search * word 1 appearances in file) + (word 2 appearances in search * word 2 appearances in file) etc. And then return a dictionary of searches to list of file numbers, most similar first, least similar last.
Expected output is a dictionary:
{1:[4,3,1,2],2:[1,2,4,3]}
etc.
The key is the search number, the value is a list of files most relevant first.
(These may not actually be right.)
This is what I have:-
def retrieve():
results = {}
for word in search:
numberOfAppearances = wordFrequency.get(word).values()
for appearances in numberOfAppearances:
results[fileNumber] = numberOfAppearances.dot()
return sorted (results.iteritems(), key=lambda (fileNumber, appearances): appearances, reverse=True)
Sorry no it just says wdir = and then the directory the .py file is in.
Edit
The entire Retrieve.py file:
from collections import Counter
def retrieve():
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog': {1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}
results = {}
for search_number, words in search.iteritems():
file_relevancy = Counter()
for word, num_appearances in words.iteritems():
for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
file_relevancy[file_id] += num_appearances * appear_in_file
results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]
return results
I am using the Spyder GUI / IDE for Anaconda Python 2.7, just press the green play button and output is:
wdir='/Users/danny/Desktop'
Edit 2
In regards to the magnitude, for example, for search number 3 and file 1 it would be:
sqrt (2^2 + 3^2 + 0^2) * sqrt (3^2 + 0^2 + 3^2)
Here is a start:
from collections import Counter
def retrieve():
results = {}
for search_number, words in search.iteritems():
file_relevancy = Counter()
for word, num_appearances in words.iteritems():
for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
file_relevancy[file_id] += num_appearances * appear_in_file
results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]
return results
print retrieve()