I am trying to get the counts of each word in a text file with the below code.
def count_words(file_name):
with open(file_name, 'r') as f: return reduce(lambda acc, x: acc.get(x, 0) + 1, sum([line.split() for line in f], []), dict())
but I get the error
File "C:\Python27\abc.py", line 173, in count_words
with open(file_name, 'r') as f: return reduce(lambda acc, x: acc.get(x, 0) + 1, sum([line.split() for line in f], []), dict())
File "C:\Python27\abc.py", line 173, in <lambda>
with open(file_name, 'r') as f: return reduce(lambda acc, x: acc.get(x, 0) + 1, sum([line.split() for line in f], []), dict())
AttributeError: 'int' object has no attribute 'get'
I am not able to understand the error message here. Why does it complain that 'int' has no attribute even when I passed a dict as accumulator?
You can use collections.Counter to count the words:
In [692]: t='I am trying to get the counts of each word in a text file with the below code'
In [693]: from collections import Counter
In [694]: Counter(t.split())
Out[694]: Counter({'the': 2, 'a': 1, 'code': 1, 'word': 1, 'get': 1, 'I': 1, 'of': 1, 'in': 1, 'am': 1, 'to': 1, 'below': 1, 'text': 1, 'file': 1, 'each': 1, 'trying': 1, 'with': 1, 'counts': 1})
In [695]: c=Counter(t.split())
In [696]: c['the']
Out[696]: 2
The problem is that your lambda function returns an int, but not a dict.
So, even if you use a dict as seed, when your lambda function is called the second time, acc will be the result of acc.get(x, 0) + 1 from the first call, and it's an int and not a dict.
So if you are looking for a one-liner, I almost have a one-liner in the spirit of what you were trying to do with get.
>>> words = """One flew over the ocean
... One flew over the sea
... My Bonnie loves pizza
... but she doesn't love me"""
>>>
>>> f = open('foo.txt', 'w')
>>> f.writelines(words)
>>> f.close()
The "one-liner" (two-liner actually)
>>> word_count = {}
>>> with open('foo.txt', 'r') as f:
... _ = [word_count.update({word:word_count.get(word,0)+1}) for word in f.read().split()]
...
Result:
>>> word_count
{'but': 1, 'One': 2, 'the': 2, 'she': 1, 'over': 2, 'love': 1, 'loves': 1, 'ocean': 1, "doesn't": 1, 'pizza': 1, 'My': 1, 'me': 1, 'flew': 2, 'sea': 1, 'Bonnie': 1}
I imagine there's something you could do with a dict comprehension, but I couldn't see how to use get in that case.
The f.read().split() gives you a nice list of words to work with, however, and should be easier than trying to get words out of a list of lines. It's a better approach unless you have a huge file.
Related
This question already has answers here:
How to convert string representation of list to a list
(19 answers)
Closed 5 months ago.
I have a text file and there is 3 lines on data in it.
[1, 2, 1, 1, 3, 1, 1, 2, 1, 3, 1, 1, 1, 3, 3]
[1, 1, 3, 3, 3, 1, 1, 1, 1, 2, 1, 1, 1, 3, 3]
[1, 2, 3, 1, 3, 1, 1, 3, 1, 3, 1, 1, 1, 3, 3]
I try to open and get data in it.
with open("rafine.txt") as f:
l = [line.strip() for line in f.readlines()]
f.close()
now i have list in list.
if i say print(l[0]) it shows me [1, 2, 1, 1, 3, 1, 1, 2, 1, 3, 1, 1, 1, 3, 3]
But i want to get numbers in it.
So when i write print(l[0][0])
i want to see 1 but it show me [
how can i fix this ?
You can use literal_eval to parse the lines from the file & build the matrix:
from ast import literal_eval
with open("test.txt") as f:
matrix = []
for line in f:
row = literal_eval(line)
matrix.append(row)
print(matrix[0][0])
print(matrix[1][4])
print(matrix[2][8])
result:
1
3
1
import json
with open("rafine.txt") as f:
for line in f.readlines():
line = json.loads(line)
print(line)
The best approach depends on what assumption you make about the data in your text file:
ast.literal_eval
If the data in your file is formatted the same way, it would be inside python source-code, the best approach is to use literal_eval:
from ast import literal_eval
data = [] # will contain list of lists
with open("filename") as f:
for line in f:
row = literal_eval(line)
data.append(row)
or, the short version:
with open(filename) as f:
data = [literal_eval(line) for line in f]
re.findall
If you can make few assumptions about the data, using regular expressions to find all digits might be a way forward. The below builds lists by simply extracting any digits in the text file, regardless of separators or other characters in the file:
import re
data = [] # will contain list of lists
with open("filename") as f:
for line in f:
row = [int(i) for i in re.findall(r'\d+', line)]
data.append(row)
or, in short:
with open(filename) as f:
data= [ [int(i) for i in re.findall(r'\d+', line)] for line in f ]
handwritten parsing
If both options are not suitable, there is always an option to parse by hand, to tailor for the exact format:
data = [] # will contain list of lists
with open(filename) as f:
for line in f:
row = [int(i) for i in line[1:-1].split(, )]
data.append(row)
The [1,-1] will remove the first and last character (the brackets), then split(", ") will split it into a list. for i in ... will iterate over the items in this list (assigning i to each item) and int(i) will convert i to an integer.
Is there a way to iterate your way through a dictionary to count the number of words with the string stored within the dictionary, and save it as a new dictionary that returns the word count for each item of that key?
For example...
#Input:
inputdict = {
'key1': 'The brown fox is brown and a fox.',
'key2': 'The red dog is the red and is a dog.'
}
newdict = {}
for k, v in inputdict:
newdict(str(k) + "_" + str(v)) = count(v)
#Output:
newdict = {
'key1_the': 1, 'key1_brown': 2, 'key1_is': 1, # ...
'key2_the': 2, 'key2_red': 2, # ...
}
Side Note:
This is kind of a follow up from an article at https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/. Where instead of reading in strings I'm trying to read in items from a dictionary.
Yes, just use the collections.Counter() class with a generator expression producing the values you want to count:
from collections import Counter
Counter(f"{k}_{word}" for k, v in inputdict.items() for word in v.split())
I've assumed a simple split on whitespace is sufficient, but if you need something more sophisticated then replace v.split() with something that produces an iterable of words to count.
Counter() is a subclass of dict, but with a few extra methods to help handle counts.
Demo:
>>> from collections import Counter
>>> inputdict = {
... 'key1': 'The brown fox is brown and a fox.',
... 'key2': 'The red dog is the red and is a dog.'
... }
>>> Counter(f"{k}_{word}" for k, v in inputdict.items() for word in v.split())
Counter({'key1_brown': 2, 'key2_red': 2, 'key2_is': 2, 'key1_The': 1, 'key1_fox': 1, 'key1_is': 1, 'key1_and': 1, 'key1_a': 1, 'key1_fox.': 1, 'key2_The': 1, 'key2_dog': 1, 'key2_the': 1, 'key2_and': 1, 'key2_a': 1, 'key2_dog.': 1})
Personally, I'd produce separate counts, using a nested dictionary structure:
{key: Counter(value.split()) for key, value in inputdict.items()}
and so produce:
{'key1': Counter({'brown': 2, 'The': 1, 'fox': 1, ... }),
'key2': Counter({'red': 2, 'is': 2, 'The': 1, ... })}
so you can access counts per sentence, with newdict["key1"] and newdict["key2"].
I have set up a function that finds the frequency of the number of times words appear in a text file, but the frequency is wrong for a couple of words because the function is not separating words from symbols like "happy,".
I have already tried to use the split function to split it with every "," and every "." but that does not work, I am also not allowed to import anything into the function as the professor does not want us to.
The code belows turns the text file into a dictionary and then uses the word or symbol as the key and the frequency as the value.
def getTokensFreq(file):
dict = {}
with open(file, 'r') as text:
wholetext = text.read().split()
for word in wholetext:
if word in dict:
dict[word] += 1
else:
dict[word] = 1
return dict
We are using the text file with the name of "f". This what is inside the file.
I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.
The desired results is this where both words and symbols are counted.
{'i': 5, 'felt': 1, 'happy': 4, 'because': 2, 'saw': 1,
'the': 1, 'others': 1, 'were': 1, 'and': 1, 'knew': 1, 'should': 1,
'feel': 1, ',': 1, 'but': 1, 'was': 1, 'not': 1, 'really': 1, '.': 1}
This is what I am getting, where some words and symbols are counted as a separate word
{'I': 5, 'felt': 1, 'happy': 2, 'because': 2, 'saw': 1, 'the': 1, 'others': 1, 'were': 1, 'and': 1, 'knew': 1, 'should': 1, 'feel': 1, 'happy,': 1, 'but': 1, 'was': 1, 'not': 1, 'really': 1, 'happy.': 1}
This is how to generate your desired frequency dictionary for one sentence. To do for the whole file, just call this code for each line to update the content of your dictionary.
# init vars
f = "I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy."
d = {}
# count punctuation chars
d['.'] = f.count('.')
d[','] = f.count(',')
# remove . and ,
for word in f.replace(',', '').replace('.','').split(' '):
if word not in d.keys():
d[word] = 1
else:
d[word] += 1
Alternatively, you can use a mix of regex and list expressions, like the following:
import re
# filter words and symbols
words = re.sub('[^A-Za-z0-9\s]+', '', f).split(' ')
symbols = re.sub('[A-Za-z0-9\s]+', ' ', f).strip().split(' ')
# count occurrences
count_words = dict(zip(set(words), [words.count(w) for w in set(words)]))
count_symbols = dict(zip(set(symbols), [symbols.count(s) for s in set(symbols)]))
# parse results in dict
d = count_symbols.copy()
d.update(count_words)
Output:
{',': 1,
'.': 1,
'I': 5,
'and': 1,
'because': 2,
'but': 1,
'feel': 1,
'felt': 1,
'happy': 4,
'knew': 1,
'not': 1,
'others': 1,
'really': 1,
'saw': 1,
'should': 1,
'the': 1,
'was': 1,
'were': 1}
Running the previous 2 approaches a 1000x times using a loop and capturing the run-times, proves that the second approach is faster than the first approach.
My solution is firstly replace all symbols into a space and then split by space. We will need a little help from regular expression.
import re
a = 'I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.'
b = re.sub('[^A-Za-z0-9]+', ' ', a)
print(b)
wholetext = b.split(' ')
print(wholetext)
My solution is similar to Verse's but it also takes makes an array of the symbols in the sentence. Afterwards, you can use the for loop and the dictionary to determine the counts.
import re
a = 'I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.'
b = re.sub('[^A-Za-z0-9\s]+', ' ', a)
print(b)
wholetext = b.split(' ')
print(wholetext)
c = re.sub('[A-Za-z0-9\s]+', ' ', a)
symbols = c.strip().split(' ')
print(symbols)
# do the for loop stuff you did in your question but with wholetext and symbols
Oh, I missed that you couldn't import anything :(
I have a text file(dummy.txt) which reads as below:
['abc',1,1,3,3,0,0]
['sdf',3,2,5,1,3,1]
['xyz',0,3,4,1,1,1]
I expect this to be in lists in python as below:
article1 = ['abc',1,1,3,3,0,0]
article2 = ['sdf',3,2,5,1,3,1]
article3 = ['xyz',0,3,4,1,1,1]
That many articles have to be created as many lines present in dummy.txt
I was trying the following things:
Opened the file, split it by '\n' and appended it to an empty list in python, it had extra quotes and square brackets hence tried to use 'ast.literal_eval' which did not work as well.
my_list = []
fvt = open("dummy.txt","r")
for line in fvt.read():
my_list.append(line.split('\n'))
my_list = ast.literal_eval(my_list)
I also tried to manually remove additional quotes and extra square brackets using replace, that did not help me either. Any leads much appreciated.
This should help.
import ast
myLists = []
with open(filename) as infile:
for line in infile: #Iterate Each line
myLists.append(ast.literal_eval(line)) #Convert to python object and append.
print(myLists)
Output:
[['abc', 1, 1, 3, 3, 0, 0], ['sdf', 3, 2, 5, 1, 3, 1], ['xyz', 0, 3, 4, 1, 1, 1]]
fvt.read() will produce the entire file string, so that means line will contain a single character string. So this will not work very well, you also use literal_eval(..) with the entire list of strings, and not a single string.
You can obtain the results by iterating over the file handler, and each time call literal_eval(..) on a single line:
from ast import literal_eval
with open("dummy.txt","r") as f:
my_list = [literal_eval(line) for line in f]
or by using map:
from ast import literal_eval
with open("dummy.txt","r") as f:
my_list = list(map(literal_eval, f))
We then obtain:
>>> my_list
[['abc', 1, 1, 3, 3, 0, 0], ['sdf', 3, 2, 5, 1, 3, 1], ['xyz', 0, 3, 4, 1, 1, 1]]
ast.literal_eval is the right approach. Note that creating a variable number of variables like article1, article2, ... is not a good idea. Use a dictionary instead if your names are meaningful, a list otherwise.
As Willem mentioned in his answer fvt.read() will give you the whole file as one string. It is much easier to exploit the fact that files are iterable line-by-line. Keep the for loop, but get rid of the call to read.
Additionally,
my_list = ast.literal_eval(my_list)
is problematic because a) you evaluate the wrong data structure - you want to evaluate the line, not the list my_list to which you append and b) because you reassign the name my_list, at this point the old my_list is gone.
Consider the following demo. (Replace fake_file with the actual file you are opening.)
>>> from io import StringIO
>>> from ast import literal_eval
>>>
>>> fake_file = StringIO('''['abc',1,1,3,3,0,0]
... ['sdf',3,2,5,1,3,1]
... ['xyz',0,3,4,1,1,1]''')
>>> result = [literal_eval(line) for line in fake_file]
>>> result
[['abc', 1, 1, 3, 3, 0, 0], ['sdf', 3, 2, 5, 1, 3, 1], ['xyz', 0, 3, 4, 1, 1, 1]]
Of course, you could also use a dictionary to hold the evaluated lines:
>>> result = {'article{}'.format(i):literal_eval(line) for i, line in enumerate(fake_file, 1)}
>>> result
{'article2': ['sdf', 3, 2, 5, 1, 3, 1], 'article1': ['abc', 1, 1, 3, 3, 0, 0], 'article3': ['xyz', 0, 3, 4, 1, 1, 1]}
where now you can issue
>>> result['article2']
['sdf', 3, 2, 5, 1, 3, 1]
... but as these names are not very meaningful, I'd just go for the list instead which you can index with 0, 1, 2, ...
When I do this:
import ast
x = '[ "A", 1]'
x = ast.literal_eval(x)
print(x)
I get:
["A", 1]
So, your code should be:
for line in fvt.read():
my_list.append(ast.literal_eval(line))
Try this split (no imports needed) (i recommend):
with open('dummy.txt','r') as f:
l=[i[1:-1].strip().replace("'",'').split(',') for i in f]
Now:
print(l)
Is:
[['abc', 1, 1, 3, 3, 0, 0], ['sdf', 3, 2, 5, 1, 3, 1], ['xyz', 0, 3, 4, 1, 1, 1]]
As expected!!!
Given the problem: Find a repeated substring in a string, is it possible to use hashing? I want to create a dictionary with the substrings as keys and the number of repeated instances as values. Here is what I have so far. I am getting an error because I am using a substring as a key for the dictionary. Can anyone spot my mistake? Thank you!!!
def findsubs(str):
d={}
for i in range(len(str)-1):
for j in range(i+2, len(str)-2):
if d[str[i:j]]>1:
return str[i:j]
else:
d[str[i:j]] = d[str[i:j]] +1
return 0
print findsubs("abcbc")
The general idea should work. It's just that if a key isn't found in the dictionary when you do a lookup, you get an error - so you have to check whether the key exists before doing a look-up and initialize if it is doesn't:
def findsubs(str):
d={}
for i in range(len(str)-1):
for j in range(i+2, len(str)-2):
if str[i:j] not in d:
d[str[i:j]] = 0
if d[str[i:j]]>1:
return str[i:j]
else:
d[str[i:j]] = d[str[i:j]] +1
return 0
Note that instead of if str[i:j] not in d: d[str[i:j]] = 0, you can do d.setdefault(str[i:j], 0), which sets the value to 0 if the key isn't in the dict, and leaves it unchanged if it does.
A few more comments though:
You should return None, not 0, if you don't find anything.
You shouldn't call a variable str since that's a built-in function.
You want to iterate j until the end of the string.
As-written, it'll only return a substring if it's been found 3 times. Really you can re-write it using a set of previously-found substrings instead:
So:
def findsubs(s):
found = set()
for i in range(len(s)-1):
for j in range(i+2, len(s)+1):
substr = s[i:j]
if substr in found:
return substr
found.add(substr)
return None
You were almost there
def findsubs(instr):
d={}
for i in range(len(instr)):
for j in range(i+2, len(instr)+1):
print instr[i:j]
d[instr[i:j]] = d.get(instr[i:j],0) + 1
return d
instr = 'abcdbcab'
print instr
print findsubs('abcdbcab')
This will work, i added an inside print for debug purposes, remove it after you test it.
The result is the dict with the substring count has you asked for :)
{'abcd': 1, 'ab': 2, 'cdb': 1, 'dbc': 1, 'cdbcab': 1, 'cd': 1, 'abc': 1, 'cdbc': 1, 'bcab': 1, 'abcdbc': 1, 'ca': 1, 'db
ca': 1, 'bc': 2, 'dbcab': 1, 'db': 1, 'cab': 1, 'bcdbcab': 1, 'bcdbc': 1, 'abcdbca': 1, 'cdbca': 1, 'abcdbcab': 1, 'bcdb
': 1, 'bcd': 1, 'abcdb': 1, 'bca': 1, 'bcdbca': 1}