Merge and sum similar CSV entries - python

Say my CSV file is like:
love, like, 200
love, like, 50
say, claim, 30
where the numbers stand for the counts of those words co-occurring together in different contexts.
I want to combine the counts of the similar words. So I want to output something like:
love, like, 250
say, claim, 30
I've been looking around but it seems that I'm stuck with this simple issue.

Without seeing an exact CSV its hard to know whats appropriate. The below code assumes the last token is a count, and it matches on everything before the last comma.
# You'd need to replace the below with the appropriate code to open your file
file = """love, like, 200
love, like, 50
love, 20
say, claim, 30"""
file = file.split("\n")
words = {}
for line in file:
word,count=line.rsplit(",",1) # Note this uses String.rsplit() NOT String.split()
words[word] = words.get(word,0) + int(count)
for word in words:
print word,": ",words[word]
And outputs this:
say, claim : 30
love : 20
love, like : 250

Depending on what exactly your application is, I think I would actually recommend using a Counter here. A Counter is a python collections module that lets you keep track of how many of everything there are. For example, in your situation you could just iteratively update a counter object.
for instance:
from collections import Counter
with open("your_file.txt", "rb") as source:
counter = Counter()
for line in source:
entry, count = line.rsplit(",", 1)
counter[entry] += int(count)
At which point you can either write the data back out as a csv or just continue to use it.

Related

Counting Occurences of words in file

Just a preface, I have read--far too many--of the posts here about the same topic, and none of them quite cover the specific guidelines I'm under. I'm supposed to create an algorithm that counts the occurrence of each word in a text file, and display each as such:
"The: 4
Jump: 2
Fox: 6".
The terms I'm under is to use the skills we learned in our beginner python class, which means we cannot use dictionary, counters, sets or lists. (basically anything that would help shorten our code, tbh). I'm not the best at python so I've been struggling... pretty hard, to say the least. The closest I've gotten was scrabbling my old notes together from my previous class and finding a demo code that I reformatted.
wordsinlist = "words.txt"
word=input("Enter word to be searched:")
count = 0
with open("words.txt", 'r') as wordlist:
for line in wordlist:
words = line.split()
for i in words:
if(i==word):
count=count+1
print("Occurrences of the word:")
print(count)
The issue with this is that I need my code to display all of the words and their occurences at once, with no search input. There's definitely a way to do this, but I'm not the sharpest tool in the shed, and I've been going at it for like 5 hours now haha.
It definitely needs to look a little closer to this:
#Output
The: 112
History: 29
Learning: 25
Any help or hints are much appreciated! Thank you in advance! I know its a dumb question, these online classes are really frustrating.
without lists (or similar) I think is impossible...probably you're allowed to use lists , that is basic python!!
If you need to count the occurrance of all words, you don't need to insert them with input method, right?
So this is one simple solution:
with open("words.txt", 'r') as fp:
lines = fp.readlines()
lines_1 = [element.strip() for element in lines]
lines_2 = list(set(lines_1))
for w in lines_2:
for l in lines_1:
if(l==w):
count=count+1
print("Occurrences of {} : {}".format(w,count))
count = 0

Counting how many times a string appears in a CSV file

I have a piece of code what is supposed to tell me how many times a word occurs in a CSV file. Note: the file is pretty large (2 years of text messages)
This is my code:
key_word1 = 'Exmple_word1'
key_word2 = 'Example_word2'
counter = 0
with open('PATH_TO_FILE.csv',encoding='UTF-8') as a:
for line in a:
if (key_word1 or key_word2) in line:
counter = counter + 1
print(counter)
There are two words because I did not know how to make it non-case sensitive.
To test it I used the find function in word on the whole file (using only one of the words as I was able to do a non-case sensitive search there) and I received more than double of what my code has calculated.
At first I did use the value_counts() function BUT I received different values for the same word (searching Exmple_word1 appeared 32 and 56 times and 2 times and so on. I kind of got stuck there for a while but it got me thinking. I use two keyboards on my phone which I change regularly - could it be that the same words could actually be different and that would explain why I am getting these results?
Also, I pretty much checked all sources regarding this matter and I found different approaches that did not actually do what I want them to do. ( the value_counts() method for example)
If that is the case, how can I fix this?
Notice some mistakes in your code:
key_word1 or key_word2 - it's "lazy", meaning if the left part - "key_word1" evaluated to True, it won't even look at key_word2. The will cause checking only if key_word1 appeared in the line.
An example to emphesize:
w1 = 'word1'
w2 = 'word2'
s = 'bla word2'
(w1 or w2) in s
>> False
(w2 or w1) in s
>> True
2. Reading csv file: I recommend using csv package (just import it), something like:
import csv
with open('PATH_TO_FILE.csv') as f:
for line in csv.reader(f):
# do you logic here
Case sensitivity - don't work hard, you probably can lower case the line you read, just to not hold 2 words..
guess the solution you are looking for should look something like:
import csv
word_to_search = 'donald'
with open('PATH_TO_FILE.csv', encoding='UTF-8') as f:
for line in csv.reader(f):
if any(word_to_search in l for l in map(str.lower, line)):
counter += 1
Running on input:
bla,some other bla,donald rocks
make,who,great
again, donald is here, hura
will result:
counter=2

Create a code in python to get the most frequent tag and value pair from a list

I have a .txt file with 3 columns: word position, word and tag (NN, VB, JJ, etc.).
Example of txt file:
1 i PRP
2 want VBP
3 to TO
4 go VB
I want to find the frequency of the word and tag as a pair in the list in order to find the most frequently assigned tag to a word.
Example of Results:
3 (food, NN), 2 (Brave, ADJ)
My idea is to start by opening the file from the folder, read the file line by line and split, set a counter using dictionary and print with the most common to uncommon in descending order.
My code is extremely rough (I'm almost embarrassed to post it):
file=open("/Users/Desktop/Folder1/trained.txt")
wordcount={}
for word in file.read().split():
from collections import Counter
c = Counter()
for d in dicts.values():
c += Counter(d)
print(c.most_common())
file.close()
Obviously, i'm getting no results. Anything will help. Thanks.
UPDATE:
so i got this code posted on here which worked, but my results are kinda funky. here's the code (the author removed it so i don't know who to credit):
file=open("/Users/Desktop/Folder1/trained.txt").read().split('\n')
d = {}
for i in file:
if i[1:] in d.keys():
d[i[1:]] += 1
else:
d[i[1:]] = 1
print (sorted(d.items(), key=lambda x: x[1], reverse=True))
here are my results:
[('', 15866), ('\t.\t.', 9479), ('\ti\tPRP', 7234), ('\tto\tTO', 4329), ('\tlike\tVB', 2533), ('\tabout\tIN', 2518), ('\tthe\tDT', 2389), ('\tfood\tNN', 2092), ('\ta\tDT', 2053), ('\tme\tPRP', 1870), ('\twant\tVBP', 1713), ('\twould\tMD', 1507), ('0\t.\t.', 1427), ('\teat\tVB', 1390), ('\trestaurant\tNN', 1371), ('\tuh\tUH', 1356), ('1\t.\t.', 1265), ('\ton\tIN', 1237), ("\t'd\tMD", 1221), ('\tyou\tPRP', 1145), ('\thave\tVB', 1127), ('\tis\tVBZ', 1098), ('\ttell\tVB', 1030), ('\tfor\tIN', 987), ('\tdollars\tNNS', 959), ('\tdo\tVBP', 956), ('\tgo\tVB', 931), ('2\t.\t.', 912), ('\trestaurants\tNNS', 899),
there seem to be a mix of good results with words and other results with space or random numbers, anyone know a way to remove what aren't real words? also, i know \t is supposed to signify a tab, is there a way to remove that as well? you guys really helped a lot
You need to have a separate collections.Counter for each word. This code uses defaultdict to create a dictionary of counters, without checking every word to see if it is known.
from collections import Counter, defaultdict
counts = defaultdict(Counter)
for row in file: # read one line into `row`
if not row.strip():
continue # ignore empty lines
pos, word, tag = row.split()
counts[word.lower()][tag] += 1
That's it, you can now check the most common tag of any word:
print(counts["food"].most_common(1))
# Prints [("NN", 3)] or whatever
If you don't mind using pandas which is a great library for tabular data I would do the following:
import pandas as pd
df = pd.read_csv("/Users/Desktop/Folder1/trained.txt", sep=" ", header=None, names=["position", "word", "tag"])
df["word_tag_counts"] = df.groupby(["word", "tag"]).transform("count")
Then if you only want the maximum one from each group you can do:
df.groupby(["word", "tag"]).max()["word_tag_counts"]
which should give you a table with the values you want

Trying to read text file and count words within defined groups

I'm a novice Python user. I'm trying to create a program that reads a text file and searches that text for certain words that are grouped (that I predefine by reading from csv). For example, if I wanted to create my own definition for "positive" containing the words "excited", "happy", and "optimistic", the csv would contain those terms. I know the below is messy - the txt file I am reading from contains 7 occurrences of the three "positive" tester words I read from the csv, yet the results print out to be 25. I think it's returning character count, not word count. Code:
import csv
import string
import re
from collections import Counter
remove = dict.fromkeys(map(ord, '\n' + string.punctuation))
# Read the .txt file to analyze.
with open("test.txt", "r") as f:
textanalysis = f.read()
textresult = textanalysis.lower().translate(remove).split()
# Read the CSV list of terms.
with open("positivetest.csv", "r") as senti_file:
reader = csv.reader(senti_file)
positivelist = list(reader)
# Convert term list into flat chain.
from itertools import chain
newposlist = list(chain.from_iterable(positivelist))
# Convert chain list into string.
posstring = ' '.join(str(e) for e in newposlist)
posstring2 = posstring.split(' ')
posstring3 = ', '.join('"{}"'.format(word) for word in posstring2)
# Count number of words as defined in list category
def positive(str):
counts = dict()
for word in posstring3:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
total = sum (counts.values())
return total
# Print result; will write to CSV eventually
print ("Positive: ", positive(textresult))
I'm a beginner as well but I stumbled upon a process that might help. After you read in the file, split the text at every space, tab, and newline. In your case, I would keep all the words lowercase and include punctuation in your split call. Save this as an array and then parse it with some sort of loop to get the number of instances of each 'positive,' or other, word.
Look at this, specifically the "train" function:
https://github.com/G3Kappa/Adjustable-Markov-Chains/blob/master/markovchain.py
Also, this link, ignore the JSON stuff at the beginning, the article talks about sentiment analysis:
https://dev.to/rodolfoferro/sentiment-analysis-on-trumpss-tweets-using-python-
Same applies with this link:
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Good luck!
I looked at your code and passed through some of my own as a sample.
I have 2 idea's for you, based on what I think you may want.
First Assumption: You want a basic sentiment count?
Getting to 'textresult' is great. Then you did the same with the 'positive lexicon' - to [positivelist] which I thought would be the perfect action? Then you converted [positivelist] to essentially a big sentence.
Would you not just:
1. Pass a 'stop_words' list through [textresult]
2. merge the two dataframes [textresult (less stopwords) and positivelist] for common words - as in an 'inner join'
3. Then basically do your term frequency
4. It is much easier to aggregate the score then
Second assumption: you are focusing on "excited", "happy", and "optimistic"
and you are trying to isolate text themes into those 3 categories?
1. again stop at [textresult]
2. download the 'nrc' and/or 'syuzhet' emotional valence dictionaries
They breakdown emotive words by 8 emotional groups
So if you only want 3 of the 8 emotive groups (subset)
3. Process it like you did to get [positivelist]
4. do another join
Sorry, this is a bit hashed up, but if I was anywhere near what you were thinking let me know and we can make contact.
Second apology, Im also a novice python user, I am adapting what I use in R to python in the above (its not subtle either :) )

Getting count of certain word in txt file in Python?

i'm trying to get number of word in certain txt file.
I've tried this but it's not working due to "AttributeError: 'list' object has no attribute 'split'":
words = 0
for wordcount in textfile.readlines().split(":"):
if wordcount == event.getPlayer().getName():
words += 1
Is there any easier or less complicated way to do this?
Here's my text file:
b2:PlayerName:Location{world=CraftWorld{name=world},x=224.23016231506807,y=71.0,z=190.2291303186236,pitch=31.349741,yaw=-333.30002}
What I want is to search for "PlayerName" which is players name and if player has 5 entries (actually, if word "PlayerName" has been five times written to file) it will add +5 to words.
P.S. I'm not sure if this is good for security, because it's an multiplayer game, so it could be many nicknames starting with "PlayerName" such as "PlayerName1337" or whatever, will this cause problem?
Should work
words = 0
for wordcount in textfile.read().split(":"):
if wordcount == event.getPlayer().getName():
words += 1
Here's the difference: .readlines() produces a list and .read() produces a string that you can split into list.
Better approach that won't count wrong things:
words = 0
for line in textfile.readlines():
# I assume that player name position is fixed
word = line.split(':')[1]
if word == event.getPlayer().getName():
words += 1
And yes, there is a security concern if there are players with the same names or with : in their names.
The problem with equal names is that your code doesn't know to what
player a line belongs.
If there will be a colon in player's name you code will also split it.
I urge you to assign some sort of unique immutable identifier for every player and use a database instead of text files that will handle all this stuff for you.
there is an even easier way if you want to count multiple names at once... use the Counter from the collections module
from collections import Counter
counter = Counter([line.split(':') for line in textfile.readlines()])
Counter will behave like a dict, so you will count all the names at once and if you need to, you can efficiently look up the count for more than one name.
At the moment your script counts only one name at a time per loop
you can access the count like so
counter[event.getPlayer().getName()]
I bet you will eventually want to count more than one name. If you do, you should avoid reading the textfile more than once.
You can find how many times a word occurs in a string with count:
words = textfile.read().count('PlayerName')

Categories