How to count word frequencies from an input file? [duplicate]

How to count word frequencies from an input file? [duplicate] - python

This question already has answers here:
How do I count the occurrences of a list item?
(29 answers)
Closed 2 years ago.
I'm trying to have my program read a single line formed by words separated by commas. For example if we have:
hello,cat,man,hey,dog,boy,Hello,man,cat,woman,dog,Cat,hey,boy
in the input file, the program would need to separate each word on a single line and ditch the commas. After that the program would count frequencies of the words in the input file.
f = open('input1.csv') # create file object
userInput = f.read()
seperated = userInput.split(',')
for word in seperated:
freq = seperated.count(word)
print(word, freq)
The problem with this code is it prints the initial count for the same word that's counted twice. The output for this program would be:
hello 1
cat 2
man 2
hey 2
dog 2
boy 1
Hello 1
man 2
cat 2
woman 1
dog 2
Cat 1
hey 2
boy
1
The correct output would be:
hello 1
cat 2
man 2
hey 2
dog 2
boy 2
Hello 1
woman 1
Cat 1
Question is how do I make my output look more polished by having the final count instead of the initial one?

This is a common pattern and core programming skill. You should try collecting and counting words each time you encounter them, in a dictionary. I'll give you the idea, but it's best you practise the exact implementation yourself. Happy hacking!
(Also recommend the "pretty print" python built-in method)
import pprint
for word in file:
word_dict[word] += 1
pprint.pprint(word_dict)
A couple of extra tips - you may want to f.close() your file when you're finished, (E: I misread so disregard the rest...) and it looks like you want to look at converting your words to lower case so that different capitalisations aren't counted seperately. There are python built in methods to do this you can find by searching

try using a dictionary:
f = open('input1.csv') # create file object
userInput = f.read()
seperated = userInput.split(',')
wordsDict = {}
for word in seperated:
if word not in wordsDict:
wordsDict[word] = 1
else:
wordsDict[word] = int(wordsDict.get(word)) + 1
for i in wordsDict:
print i, wordsDict[i]
)

Create a new dictionary. Add the word as key and the count of that as value to it
count_dict={}
for w in seperated:
count_dict[w]=seperated.count(w)
for key,value in count_dict.items():
print(key,value)

Related

Formatted strings, decimals and commas question

I have a .txt file that I read in and wish to create formatted strings using these values. Columns 3 and 4 need decimals and the last column needs a percent sign and 2 decimal places. The formatted string will say something like "The overall attendance at Bulls was 894659, average attendance was 21,820 and the capacity was 104.30%’
the shortened .txt file has these lines:
1 Bulls 894659 21820 104.3
2 Cavaliers 843042 20562 100
3 Mavericks 825901 20143 104.9
4 Raptors 812863 19825 100.1
5 NY_Knicks 812292 19812 100
So far my code looks like this and its mostly working, minus the commas and decimal places.
file_1 = open ('basketball.txt', 'r')
count = 0
list_1 = [ ]
for line in file_1:
count += 1
textline = line.strip()
items = textline.split()
list_1.append(items)
print('Number of teams: ', count)
for line in list_1:
print ('Line: ', line)
file_1.close()
for line in list_1: #iterate over the lines of the file and print the lines with formatted strings
a, b, c, d, e = line
print (f'The overall attendance at the {b} game was {c}, average attendance was {d}, and the capacity was {e}%.')
Any help with how to format the code to show the numbers with commas (21820 ->21,828) and last column with 2 decimals and a percent sign (104.3 -> 104.30%) is greatly appreciated.

You've got some options for how to tackle this.
Option 1: Using f strings (Python 3 only)
Since your provided code already uses f strings, this solution should work for you. For others reading here, this will only work if you are using Python 3.
You can do string formatting within f strings, signified by putting a colon : after the variable name within the curly brackets {}, after which you can use all of the usual python string formatting options.
Thus, you could just change one of your lines of code to get this done. Your print line would look like:
print(f'The overall attendance at the {b} game was {int(c):,}, average attendance was {int(d):,}, and the capacity was {float(e):.2f}%.')
The variables are getting interpreted as:
The {b} just prints the string b.
The {int(c):,} and {int(d):,} print the integer versions of c and d, respectively, with commas (indicated by the :,).
The {float(e):.2f} prints the float version of e with two decimal places (indicated by the :.2f).
Option 2: Using string.format()
For others here who are looking for a Python 2 friendly solution, you can change the print line to the following:
print("The overall attendance at the {} game was {:,}, average attendance was {:,}, and the capacity was {:.2f}%.".format(b, int(c), int(d), float(e)))
Note that both options use the same formatting syntax, just the f string option has the benefit of having you write your variable name right where it will appear in the resulting printed string.

This is how I ended up doing it, very similar to the response from Bibit.
file_1 = open ('something.txt', 'r')
count = 0
list_1 = [ ]
for line in file_1:
count += 1
textline = line.strip()
items = textline.split()
items[2] = int(items[2])
items[3] = int(items[3])
items[4] = float(items[4])
list_1.append(items)
print('Number of teams/rows: ', count)
for line in list_1:
print ('Line: ', line)
file_1.close()
for line in list_1:
print ('The overall attendance at the {:s} games was {:,}, average attendance was {:,}, and the capacity was {:.2f}%.'.format(line[1], line[2], line[3], line[4]))

regex on reading a txt file in Python

I have a txt that contains data for classification purposes. The first column is the class, that is 0 or 1 and the other four columns contain the features of the class. Yet the features has numbers before them, that is 1: for feature 1, 2: for feature 2 etc. I tried to use regex in numpy split but I failed. How can I take only the columns I need? Below is the txt with the data.
1 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02
1 1:5.707397e+01 2:2.214040e+02 3:8.607959e-02 4:1.229114e+02
1 1:1.725900e+01 2:1.734360e+02 3:-1.298053e-01 4:1.250318e+02
1 1:2.177940e+01 2:1.249531e+02 3:1.538853e-01 4:1.527150e+02
1 1:9.133997e+01 2:2.935699e+02 3:1.423918e-01 4:1.605402e+02
1 1:5.537500e+01 2:1.792220e+02 3:1.654953e-01 4:1.112273e+02
1 1:2.956200e+01 2:1.913570e+02 3:9.901439e-02 4:1.034076e+02
1 1:1.451200e+02 2:2.088600e+02 3:-1.760859e-01 4:1.542257e+02
1 1:3.849699e+01 2:4.146600e+01 3:-1.886419e-01 4:1.239661e+02
1 1:2.927699e+01 2:1.072510e+02 3:1.149632e-01 4:1.077885e+02
1 1:2.886700e+01 2:1.090240e+02 3:-1.239433e-01 4:9.799130e+01
1 1:2.401300e+01 2:7.602000e+01 3:2.850990e-01 4:9.891692e+01
1 1:2.837900e+01 2:1.452160e+02 3:3.870011e-01 4:1.549975e+02
1 1:2.238140e+01 2:8.242810e+01 3:-2.814865e-01 4:8.998764e+01
1 1:1.232100e+02 2:4.561600e+02 3:-1.518468e-01 4:1.432996e+02
1 1:2.008405e+01 2:1.774510e+02 3:2.578101e-01 4:9.253101e+01
1 1:3.285699e+01 2:1.826750e+02 3:2.204406e-01 4:9.457175e+01
1 1:0.000000e+00 2:1.154780e+02 3:1.504970e-01 4:1.096315e+02
1 1:3.954504e+01 2:2.374420e+02 3:1.089429e-01 4:1.376333e+02
1 1:1.067980e+02 2:3.237560e+02 3:-1.509505e-01 4:1.754021e+02
1 1:3.408200e+01 2:1.198280e+02 3:2.200156e-01 4:1.383639e+02
1 1:0.000000e+00 2:8.671080e+01 3:4.201880e-01 4:1.298851e+02
1 1:4.865997e+01 2:3.071500e+02 3:1.756066e-01 4:1.640174e+02
1 1:2.341090e+01 2:8.347140e+01 3:1.766868e-01 4:9.803250e+01
1 1:1.222390e+02 2:4.357930e+02 3:-1.812907e-01 4:1.687663e+02
1 1:1.624560e+01 2:4.830620e+01 3:5.508614e-01 4:2.632639e+01
1 1:4.389899e+01 2:2.421300e+02 3:2.006008e-01 4:1.331948e+02
1 1:6.143698e+01 2:2.338500e+02 3:2.758731e-01 4:1.612433e+02
1 1:5.952499e+01 2:2.176700e+02 3:-8.601014e-02 4:1.170831e+02
1 1:2.915850e+01 2:1.259875e+02 3:1.910455e-01 4:1.279927e+02
1 1:5.059702e+01 2:2.430620e+02 3:1.863443e-01 4:1.352273e+02
1 1:6.024097e+01 2:1.977340e+02 3:-1.319924e-01 4:1.320220e+02
1 1:2.620490e+01 2:6.270790e+01 3:-1.402450e-01 4:1.135866e+02
1 1:2.847198e+01 2:1.483760e+02 3:-1.868249e-01 4:1.672337e+02
1 1:2.707990e+01 2:7.770390e+01 3:-2.509235e-01 4:9.798032e+01
1 1:2.068600e+01 2:8.446800e+01 3:1.761782e-01 4:1.199423e+02
1 1:1.962450e+01 2:4.923090e+01 3:4.302725e-01 4:9.361318e+01
1 1:4.961401e+01 2:3.234850e+02 3:-1.963741e-01 4:1.622486e+02
1 1:7.982401e+01 2:2.017540e+02 3:-1.412161e-01 4:1.310716e+02
1 1:6.696402e+01 2:2.214030e+02 3:-1.187778e-01 4:1.416626e+02
1 1:5.842999e+01 2:1.348610e+02 3:2.876077e-01 4:1.286684e+02
1 1:6.982007e+01 2:3.693401e+02 3:-1.539849e-01 4:1.511659e+02
1 1:1.902200e+01 2:2.210120e+02 3:1.689450e-01 4:1.368066e+02
1 1:4.582898e+01 2:2.215950e+02 3:2.419124e-01 4:1.627100e+02

I do hate pandas but try these three lines:
import pandas as pd
# Use pandas read_csv; sep is interpreted as a regex
x=pd.read_csv('file.txt',sep='[: ]').to_numpy()
# Now select the required columns
output=x[:,(2,4,6,8)]
print(output)
"""
array([[ 5.707397e+01, 2.214040e+02, 8.607959e-02, 1.229114e+02],
[ 1.725900e+01, 1.734360e+02, -1.298053e-01, 1.250318e+02],
[ 2.177940e+01, 1.249531e+02, 1.538853e-01, 1.527150e+02],
[ 9.133997e+01, 2.935699e+02, 1.423918e-01, 1.605402e+02],
[ 5.537500e+01, 1.792220e+02, 1.654953e-01, 1.112273e+02],
[ 2.956200e+01, 1.913570e+02, 9.901439e-02, 1.034076e+02]])
"""
Documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can also check out How to use regex as delimiter function while reading a file into numpy array or similar
I rediscovered the solution below independently but this answer follows the same strategy via sep:
https://stackoverflow.com/a/51195469/1021819

You wish to parse a line like this:
1 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02
Your parser reads the input with the usual idiom of:
for line in input_file:
Start by distinguishing the class label from the features.
label, *raw_features = line.split()
label = int(label)
Now it just remains to strip the unhelpful N: prefix from each feature.
features = [float(feat.split(':')[1])
for feat in raw_features]
Sure, you could solve this with a regex.
But that doesn't sound like the simplest solution, in this case.

I was bored :-) , So thought of writing a snippet for you. See below, which is kind of a dirty text processing and loading it to a dataframe.
lines = open("predictions.txt", "r").readlines()
column_lines = [
[fline[0]] + [feat[1] for feat in sorted([tuple(feature.split(":")) for feature in fline[1:]], key=lambda f: f[0])]
for fline in [line.split(" ") for line in lines]
]
import pandas as pd
table = pd.DataFrame(column_lines, columns = ["Class", "Feature1","Feature2","Feature3","Feature4"])
Instead of this, you can also think of tranforming the file to a csv, using a similar text processing and then use them directly to create a dataframe, so you dont need to run this code everytime.!
I hope this is helpful.

If you want to use regex to only extract you columns you can use this regex expression on each line:
import re
line = '1 1:1.067980e+02 2:3.237560e+02 3:-1.509505e-01 4:1.754021e+02'
reg = re.compile(r'(-*\d+\.\d+e[+|-]\d+)')
# Your columns:
reg.findall(line)
>>> ['1.067980e+02', '3.237560e+02', '-1.509505e-01', '1.754021e+02']
# Assuming you also want numbers of those values:
list(map(float, reg.findall(line)))
>>> [106.798, 323.756, -0.1509505, 175.4021]
What is does:
(-*\d+\.\d+e[+|-]\d+) the first brackets are used to create groups. Inside the group first -* is the optional minus sign. Thereafter there is at least 1 number, but there can be more than 1 \d+. The number does have a decimal point with decimals therefore \.\d+. Then there is an exponent with either + or - e[+|-] following with number \d+.

Python exception "too many vaues to unpack" thrown when assigning a string of numbers to a dictionary

I have a function that reads a file which contains a name followed by a space, then multiple numbers, each seperated by a space. I want to parse the name into one string, and all the numbers into another, then put them in a dictionary (with the name as the key). I have written the following code:
def read_users (user_file):
try:
file_in = open(user_file)
except:
return None
user_scores = {}
for line in file_in:
temp_lst = line.strip().split(' ', 1)
user_scores = [temp_lst[0]] = temp_lst[1]
return user_scores
This seems to do everything I need, but when it puts it into a dictionary it throws the exception "Too many values to unpack". I'm confused as to why this is thrown because I think I should be passing the dictionary a string with the name as the key, and a string with a bunch of numbers as the value.
If it's important the lines in the input file are formatted as follows:
Ben 1 0 2 3 4 -2 5 5 6 6 1
I have tried printing the list before I pass it to the dictionary and it appears as follows:
['Ben', '1 0 2 3 4 -1 5 5 6 6 1']
Anyone have any ideas? Thanks!

#I think the way you construct the dictionary is not quite right. Try below code to see if it works.
user_scores[temp_lst[0]] = temp_lst[1]

I need to make a simple file compression system for my GCSE Computer Science

I'm not that experienced with code and have a question pertaining to my GCSE Computer Science controlled assessment. I have got pretty far, it's just this last hurdle is holding me up.
This task requires me to use a previously made simple file compression system, and to "Develop a program that builds [upon it] to compress a text file with several sentences, including punctation. The program should be able to compress a file into a list of words and list of positions to recreate the original file. It should also be able to take a compressed file and recreate the full text, including punctuation and capitalisation, of the original file".
So far, I have made it possible to store everything as a text file with my first program:
sentence = input("Enter a sentence: ")
sentence = sentence.split()
uniquewords = []
for word in sentence:
if word not in uniquewords:
uniquewords.append(word)
positions = [uniquewords.index(word) for word in sentence]
recreated = " ".join([uniquewords[i] for i in positions])
print (uniquewords)
print (recreated)
positions=str(positions)
uniquewords=str(uniquewords)
positionlist= open("H:\Python\ControlledAssessment3\PositionList.txt","w")
positionlist.write(positions)
positionlist.close
wordlist=open("H:\Python\ControlledAssessment3\WordList.txt","w",)
wordlist.write(uniquewords)
wordlist.close
This makes everything into lists, and converts them into a string so that it is possible to write into a text document. Now, program number 2 is where the issue lies:
uniquewords=open("H:\Python\ControlledAssessment3\WordList.txt","r")
uniquewords= uniquewords.read()
positions=open("H:\Python\ControlledAssessment3\PositionList.txt","r")
positions=positions.read()
positions= [int(i) for i in positions]
print(uniquewords)
print (positions)
recreated = " ".join([uniquewords[i] for i in positions])
FinalSentence=
open("H:\Python\ControlledAssessment3\ReconstructedSentence.txt","w")
FinalSentence.write(recreated)
FinalSentence.write('\n')
FinalSentence.close
When I try and run this code, this error appears:
Traceback (most recent call last):
File "H:\Python\Task 3 Test 1.py", line 7, in <module>
positions= [int(i) for i in positions]
File "H:\Python\Task 3 Test 1.py", line 7, in <listcomp>
positions= [int(i) for i in positions]
ValueError: invalid literal for int() with base 10: '['
So, how do you suppose I get the second program to recompile the text into the sentence? Thanks, and I'm sorry if this was a lengthy post, I've spent forever trying to get this working.
I'm assuming this is something to do with the list that has been converted into a string including brackets, commas, and spaces etc. so is there a way to revert both strings back into their original state so I can recreate the sentence? Thanks.

So firstly, it is a big strange to save positions as a literal string; you should save each element (same with uniquewords). With this in mind, something like:
program1.py:
sentence = input("Type sentence: ")
# this is a test this is a test this is a hello goodbye yes 1 2 3 123
sentence = sentence.split()
uniquewords = []
for word in sentence:
if word not in uniquewords:
uniquewords.append(word)
positions = [uniquewords.index(word) for word in sentence]
with open("PositionList.txt","w") as f:
for i in positions:
f.write(str(i)+' ')
with open("WordList.txt","w") as f:
for i in uniquewords:
f.write(str(i)+' ')
program2.py:
with open("PositionList.txt","r") as f:
data = f.read().split(' ')
positions = [int(i) for i in data if i!='']
with open("WordList.txt","r") as f:
uniquewords = f.read().split(' ')
sentence = " ".join([uniquewords[i] for i in positions])
print(sentence)
PositionList.txt
0 1 2 3 0 1 2 3 0 1 2 4 5 6 7 8 9 10
WordList.txt
this is a test hello goodbye yes 1 2 3 123

Python: Search for the specific strings in a text file consisting of multiple columns

I have a text file which is tab-delimited (every column is responsible for a certain component.):
1 (who?) 2 (age) 3 (Does what?) 4 )(where lives?)
A Dog 4 Walks Lives in a box
B Parrot 2 Eats Lives in a cage
C Cat 3 Sleeps Lives in a house
User has to choose between two option of finding all the information about the animal. User can either type his guess into the first column or the fourth:
1=Cat*
OR
4=Live in a box**
And if there is a match it prints the whole line.
in the first case, if user decides to use first column(who?) to find information about the animal in would show:
Cat 3 Sleeps Lives in a house
in the second case, if user decides to use fourth column(where lives?) to find information about the animal in would show:
Dog 4 Walks Lives in a box
What I need is to find a proper function that will search for an animal. I tried using search() and match() but to no avail.
**EDIT after codesparkle's advice**
I can't seem to integrate the advised function into my program. I am particularly confused with line[column-1] bit. I tried to "play" with this function but I could not come up with anything useful. So here is the real piece of code that I need complete.
while True:
try:
OpenFile=raw_input(str("Please enter a file name: "))
infile=open(OpenFile,"r")
contents=infile.readlines()
infile.close()
user_input = raw_input(str("Enter A=<animal> for animal search or B=<where lives?> for place of living search: \n"))
if user_input.startswith("A="):
def find_animal(user_input,column):
return next(("\t".join(line) for line in contents
if line[column-1]==user_input),None)
find_animal(user_input[1:])
print str((find_animal(user_input[1:], "WHO?"))) # I messed up the first time and gave numeric names to the columns. In reality they should have been words that indicate the proper name of the column. In this case it is WHO- the first column which signifies the animal.
else:
print "Unknown option!"
except IOError:
print "File with this name does not exist!"

Assuming your input looks like this (the example you posted contained spaces, not tabs):
Dog 4 Walks Lives in a box
Parrot 2 Eats Lives in a cage
Cat 3 Sleeps Lives in a house
You can take the following approach:
Open the file and read the lines, splitting them by tab and removing the newline characters.
Create a function that returns the first line whose requested column contains the search term. In this example, None is returned if nothing is found.
with open('input.txt') as input_file:
lines = [line.rstrip().split('\t')
for line in input_file.readlines()]
def find_animal(search_term, column):
return next(('\t'.join(line) for line in lines
if line[column-1] == search_term), None)
print(find_animal('Cat', 1))
print(find_animal('Lives in a box', 4))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to count word frequencies from an input file? [duplicate] - python

Create a new dictionary. Add the word as key and the count of that as value to it count_dict={} for w in seperated: count_dict[w]=seperated.count(w) for key,value in count_dict.items(): print(key,value)

Related

Formatted strings, decimals and commas question

regex on reading a txt file in Python

Python exception "too many vaues to unpack" thrown when assigning a string of numbers to a dictionary

I need to make a simple file compression system for my GCSE Computer Science

Python: Search for the specific strings in a text file consisting of multiple columns

Categories

Resources