I'm a novice Python user. I'm trying to create a program that reads a text file and searches that text for certain words that are grouped (that I predefine by reading from csv). For example, if I wanted to create my own definition for "positive" containing the words "excited", "happy", and "optimistic", the csv would contain those terms. I know the below is messy - the txt file I am reading from contains 7 occurrences of the three "positive" tester words I read from the csv, yet the results print out to be 25. I think it's returning character count, not word count. Code:
import csv
import string
import re
from collections import Counter
remove = dict.fromkeys(map(ord, '\n' + string.punctuation))
# Read the .txt file to analyze.
with open("test.txt", "r") as f:
textanalysis = f.read()
textresult = textanalysis.lower().translate(remove).split()
# Read the CSV list of terms.
with open("positivetest.csv", "r") as senti_file:
reader = csv.reader(senti_file)
positivelist = list(reader)
# Convert term list into flat chain.
from itertools import chain
newposlist = list(chain.from_iterable(positivelist))
# Convert chain list into string.
posstring = ' '.join(str(e) for e in newposlist)
posstring2 = posstring.split(' ')
posstring3 = ', '.join('"{}"'.format(word) for word in posstring2)
# Count number of words as defined in list category
def positive(str):
counts = dict()
for word in posstring3:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
total = sum (counts.values())
return total
# Print result; will write to CSV eventually
print ("Positive: ", positive(textresult))
I'm a beginner as well but I stumbled upon a process that might help. After you read in the file, split the text at every space, tab, and newline. In your case, I would keep all the words lowercase and include punctuation in your split call. Save this as an array and then parse it with some sort of loop to get the number of instances of each 'positive,' or other, word.
Look at this, specifically the "train" function:
https://github.com/G3Kappa/Adjustable-Markov-Chains/blob/master/markovchain.py
Also, this link, ignore the JSON stuff at the beginning, the article talks about sentiment analysis:
https://dev.to/rodolfoferro/sentiment-analysis-on-trumpss-tweets-using-python-
Same applies with this link:
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Good luck!
I looked at your code and passed through some of my own as a sample.
I have 2 idea's for you, based on what I think you may want.
First Assumption: You want a basic sentiment count?
Getting to 'textresult' is great. Then you did the same with the 'positive lexicon' - to [positivelist] which I thought would be the perfect action? Then you converted [positivelist] to essentially a big sentence.
Would you not just:
1. Pass a 'stop_words' list through [textresult]
2. merge the two dataframes [textresult (less stopwords) and positivelist] for common words - as in an 'inner join'
3. Then basically do your term frequency
4. It is much easier to aggregate the score then
Second assumption: you are focusing on "excited", "happy", and "optimistic"
and you are trying to isolate text themes into those 3 categories?
1. again stop at [textresult]
2. download the 'nrc' and/or 'syuzhet' emotional valence dictionaries
They breakdown emotive words by 8 emotional groups
So if you only want 3 of the 8 emotive groups (subset)
3. Process it like you did to get [positivelist]
4. do another join
Sorry, this is a bit hashed up, but if I was anywhere near what you were thinking let me know and we can make contact.
Second apology, Im also a novice python user, I am adapting what I use in R to python in the above (its not subtle either :) )
Here is my code
import re
with open('newfiles.txt') as f:
k = f.read()
p = re.compile(r'[\w\:\-\.\,\']+|[^[\w\:\-\.\'\,]\s]')
originaltext = p.findall(k)
uniquelist = []
for word in originaltext:
if word not in uniquelist:
uniquelist.append(word)
indexes = ' '.join(str(uniquelist.index(word)+1) for word in originaltext)
n = p.findall(indexes)
file = open("newfiletwo.txt","w")
file.write (' '.join(str(e) for e in n))
file.close()
file = open("newfilethree.txt","w")
file.write(' '.join(uniquelist))
file.close()
with open('newfiletwo.txt') as f:
indexess = f.read()
with open('newfilethree.txt') as f:
differentwords = f.read()
differentwords = p.findall(differentwords)
indexess = [uniquelist.index(word) for word in originaltext]
for word in originaltext:
if not word in differentwords:
differentwords.append(word)
i = differentwords.index(word)
indexess.append(i)
s = "" # the reconstructed sentence
for i in indexess:
s = s + differentwords[i] + " "
print(s)
The program basically takes an external text file, returns the index of its positions (if any word repeats, then the first position is taken) and then saves the positions as an external file. Whilst doing this, I have split up the text file including splitting punctuation and saved different words and punctuation that occur in the file as an external file too. Now for the hard part, using both of these external files - the indexes and the different separated words, I am trying to recreate the original text file, including the punctuation. But the error shown in the title occurs:
Traceback (most recent call last):
File "E:\Python\Index.py", line 31, in <module>
s = s + differentwords[i] + " "
IndexError: list index out of range
Not trying to sound rude but I am a sort of beginner, please try to change as less as possible in a simple way, as I have created this myself. You guys maybe know a far shorter way to do this, but this is the level of simplicity I can handle, proved by the length of the code. I have tried to shorten the original text file but that proves no use. Anyone know why the error occurs and how to fix it? I am not looking for efficiency right now, maybe after another couple of months of learning, but the simplest (i don't mind long) answer will be the best. Sorry if I have repeated myself a lot :-)
'newfiles' - A bunch of sentences with punctuation
UPDATE
The code does not show the error but prints the original sentence twice. The error has gone due to the removal of +1 on line 23. Does anyone know why the output repeats twice though?
Problem is, how you qualify what word is, what is not. For instance is comma part of word? In your case that is not mentioned as such, while it is also not a separator. So you end up with separate word comma, or dot, and so on. I have no access to your input, so I can just provide sample:
p = re.compile(r'[\w\:\-\.\,]+|[^[\w\:\-\.\,]\s]')
There is one point - in this case: 'Word', 'word', 'Word', 'Word.', 'word,' are all separate words. Since dot, and coma are parts of word. You can't eat cake and have it. To fix that... you need to store information if there is white space before separation.
UPDATE:
Oh, yes. Double output. Files that are stored in the middle - are OK. So something was filed after that. Look at this two lines:
i = differentwords.index(word)
indexess.append(i)
They need to be inside preceding if statement.
I have made a compression code, and have tested it on 10 KB text files, which took no less than 3 minutes. However, I've tested it with a 1 MB file, which is the assessment assigned by my teacher, and it takes longer than half an hour. Compared to my classmates, mine is irregularly long. It might be my computer or my code, but I have no idea. Does anyone know any tips or shortcuts into making the speed of my code shorter? My compression code is below, if there are any quicker ways of doing loops, etc. please send me an answer (:
(by the way my code DOES work, so I'm not asking for corrections, just enhancements, or tips, thanks!)
import re #used to enable functions(loops, etc.) to find patterns in text file
import os #used for anything referring to directories(files)
from collections import Counter #used to keep track on how many times values are added
size1 = os.path.getsize('file.txt') #find the size(in bytes) of your file, INCLUDING SPACES
print('The size of your file is ', size1,)
words = re.findall('\w+', open('file.txt').read())
wordcounts = Counter(words) #turns all words into array, even capitals
common100 = [x for x, it in Counter(words).most_common(100)] #identifies the 200 most common words
keyword = []
kcount = []
z = dict(wordcounts)
for key, value in z.items():
keyword.append(key) #adds each keyword to the array called keywords
kcount.append(value)
characters =['$','#','#','!','%','^','&','*','(',')','~','-','/','{','[', ']', '+','=','}','|', '?','cb',
'dc','fd','gf','hg','kj','mk','nm','pn','qp','rq','sr','ts','vt','wv','xw','yx','zy','bc',
'cd','df','fg','gh','jk','km','mn','np','pq','qr','rs','st','tv','vw','wx','xy','yz','cbc',
'dcd','fdf','gfg','hgh','kjk','mkm','nmn','pnp','qpq','rqr','srs','tst','vtv','wvw','xwx',
'yxy','zyz','ccb','ddc','ffd','ggf','hhg','kkj','mmk','nnm','ppn','qqp','rrq','ssr','tts','vvt',
'wwv','xxw','yyx''zzy','cbb','dcc','fdd','gff','hgg','kjj','mkk','nmm','pnn','qpp','rqq','srr',
'tss','vtt','wvv','xww','yxx','zyy','bcb','cdc','dfd','fgf','ghg','jkj','kmk','mnm','npn','pqp',
'qrq','rsr','sts','tvt','vwv','wxw','xyx','yzy','QRQ','RSR','STS','TVT','VWV','WXW','XYX','YZY',
'DC','FD','GF','HG','KJ','MK','NM','PN','QP','RQ','SR','TS','VT','WV','XW','YX','ZY','BC',
'CD','DF','FG','GH','JK','KM','MN','NP','PQ','QR','RS','ST','TV','VW','WX','XY','YZ','CBC',
'DCD','FDF','GFG','HGH','KJK','MKM','NMN','PNP','QPQ','RQR','SRS','TST','VTV','WVW','XWX',
'YXY','ZYZ','CCB','DDC','FFD','GGF','HHG','KKJ','MMK','NNM','PPN','QQP','RRQ','SSR','TTS','VVT',
'WWV','XXW','YYX''ZZY','CBB','DCC','FDD','GFF','HGG','KJJ','MKK','NMM','PNN','QPP','RQQ','SRR',
'TSS','VTT','WVV','XWW','YXX','ZYY','BCB','CDC','DFD','FGF','GHG','JKJ','KMK','MNM','NPN','PQP',] #characters which I can use
symbols_words = []
char = 0
for i in common100:
symbols_words.append(characters[char]) #makes the array literally contain 0 values
char = char + 1
print("Compression has now started")
f = 0
g = 0
no = 0
while no < 100:
for i in common100:
for w in words:
if i == w and len(i)>1: #if the values in common200 are ACTUALLY in words
place = words.index(i)#find exactly where the most common words are in the text
symbols = symbols_words[common100.index(i)] #assigns one character with one common word
words[place] = symbols # replaces the word with the symbol
g = g + 1
no = no + 1
string = words
stringMade = ' '.join(map(str, string))#makes the list into a string so you can put it into a text file
file = open("compression.txt", "w")
file.write(stringMade)#imports everything in the variable 'words' into the new file
file.close()
size2 = os.path.getsize('compression.txt')
no1 = int(size1)
no2 = int(size2)
print('Compression has finished.')
print('Your original file size has been compressed by', 100 - ((100/no1) * no2 ) ,'percent.'
'The size of your file now is ', size2)
Using something like
word_substitutes = dict(zip(common100, characters))
will give you a dict that maps common words to their corresponding symbol.
Then you can simply iterate over the words:
# Iterate over all the words
# Use enumerate because we're going to modify the word in-place in the words list
for word_idx, word in enumerate(words):
# If the current word is in the `word_substitutes` dict, then we know its in the
# 'common' words, and can be replaced by the symbol
if word in word_substitutes:
# Replaces the word in-place
replacement_symbol = word_substitutes[word]
words[word_idx] = replacement_symbol
This will give much better performance, because the dictionary lookup used for the common word symbol mapping is logarithmic in time rather than linear. So the overall complexity will be something like O(N log(N)) rather than O(N^3) that you get from the 2 nested loops with the .index() call inside that.
The first thing I see that is bad for performance is:
for i in common100:
for w in words:
if i == w and len(i)>1:
...
What you are doing is seeing if the word w is in your list of common100 words. However, this check can be done in O(1) time by using a set and not looping through all of your top 100 words for each word.
common_words = set(common100)
for w in words:
if w in common_words:
...
Generally you would do the following:
Measure how much time each "part" of your program needs. You could use a profiler (e.g. this one in the standard library) or simply sprinkle some times.append(time.time.now) into your code and compute differences. Then you know which part of your code is slow.
See if you can improve the algorithm of the slow part. gnicholas answer shows one possibility to speed things up. The while no<=100 seems suspiciously, maybe that can be improved. This step needs understanding of the algorithms you use. Be careful to select the best data structures for your use case.
If you can't use a better algorithm (because you always use the best way to calculate something) you need to speed up the computations themselves. Numerical stuff benefits from numpy, with cython you can basically compile python code to C and numba uses LLVM to compile.
Say my CSV file is like:
love, like, 200
love, like, 50
say, claim, 30
where the numbers stand for the counts of those words co-occurring together in different contexts.
I want to combine the counts of the similar words. So I want to output something like:
love, like, 250
say, claim, 30
I've been looking around but it seems that I'm stuck with this simple issue.
Without seeing an exact CSV its hard to know whats appropriate. The below code assumes the last token is a count, and it matches on everything before the last comma.
# You'd need to replace the below with the appropriate code to open your file
file = """love, like, 200
love, like, 50
love, 20
say, claim, 30"""
file = file.split("\n")
words = {}
for line in file:
word,count=line.rsplit(",",1) # Note this uses String.rsplit() NOT String.split()
words[word] = words.get(word,0) + int(count)
for word in words:
print word,": ",words[word]
And outputs this:
say, claim : 30
love : 20
love, like : 250
Depending on what exactly your application is, I think I would actually recommend using a Counter here. A Counter is a python collections module that lets you keep track of how many of everything there are. For example, in your situation you could just iteratively update a counter object.
for instance:
from collections import Counter
with open("your_file.txt", "rb") as source:
counter = Counter()
for line in source:
entry, count = line.rsplit(",", 1)
counter[entry] += int(count)
At which point you can either write the data back out as a csv or just continue to use it.