NLTK FreqDist incomplete dictionary - python

I have a problem with the following script because I'm not able to get the full list of items for each line. What I get is something like FreqDist({'be#v': 3, 'have#v': 2, 'get#v': 2, 'publicly#r': 1, 'communicate#v': 1, 'goal#n': 1, 'end#n': 1, 'delight#v': 1, 'prescription#n': 1, 'fertilize#v': 1, ...}), FreqDist({'be#v': 2, 'have#v': 2, 'get#v': 2, '20s#n': 1, 'like#v': 1, 'school#n': 1, 'think#v': 1, 'i#n': 1, 'go#v': 1, 'community#n': 1, ...}), not every word with occurrence 1 is reported.
from nltk import FreqDist
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\s+', gaps=True)
m = [FreqDist(tokenizer.tokenize(line)) for line in open('1_tagged_copy.txt')]
print m
Solution: m = [FreqDist(tokenizer.tokenize(line)).items() for line in open('1_tagged_copy.txt')]

Related

How to loop through dictionary to get both frequency of words and symbols?

I have set up a function that finds the frequency of the number of times words appear in a text file, but the frequency is wrong for a couple of words because the function is not separating words from symbols like "happy,".
I have already tried to use the split function to split it with every "," and every "." but that does not work, I am also not allowed to import anything into the function as the professor does not want us to.
The code belows turns the text file into a dictionary and then uses the word or symbol as the key and the frequency as the value.
def getTokensFreq(file):
dict = {}
with open(file, 'r') as text:
wholetext = text.read().split()
for word in wholetext:
if word in dict:
dict[word] += 1
else:
dict[word] = 1
return dict
We are using the text file with the name of "f". This what is inside the file.
I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.
The desired results is this where both words and symbols are counted.
{'i': 5, 'felt': 1, 'happy': 4, 'because': 2, 'saw': 1,
'the': 1, 'others': 1, 'were': 1, 'and': 1, 'knew': 1, 'should': 1,
'feel': 1, ',': 1, 'but': 1, 'was': 1, 'not': 1, 'really': 1, '.': 1}
This is what I am getting, where some words and symbols are counted as a separate word
{'I': 5, 'felt': 1, 'happy': 2, 'because': 2, 'saw': 1, 'the': 1, 'others': 1, 'were': 1, 'and': 1, 'knew': 1, 'should': 1, 'feel': 1, 'happy,': 1, 'but': 1, 'was': 1, 'not': 1, 'really': 1, 'happy.': 1}
This is how to generate your desired frequency dictionary for one sentence. To do for the whole file, just call this code for each line to update the content of your dictionary.
# init vars
f = "I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy."
d = {}
# count punctuation chars
d['.'] = f.count('.')
d[','] = f.count(',')
# remove . and ,
for word in f.replace(',', '').replace('.','').split(' '):
if word not in d.keys():
d[word] = 1
else:
d[word] += 1
Alternatively, you can use a mix of regex and list expressions, like the following:
import re
# filter words and symbols
words = re.sub('[^A-Za-z0-9\s]+', '', f).split(' ')
symbols = re.sub('[A-Za-z0-9\s]+', ' ', f).strip().split(' ')
# count occurrences
count_words = dict(zip(set(words), [words.count(w) for w in set(words)]))
count_symbols = dict(zip(set(symbols), [symbols.count(s) for s in set(symbols)]))
# parse results in dict
d = count_symbols.copy()
d.update(count_words)
Output:
{',': 1,
'.': 1,
'I': 5,
'and': 1,
'because': 2,
'but': 1,
'feel': 1,
'felt': 1,
'happy': 4,
'knew': 1,
'not': 1,
'others': 1,
'really': 1,
'saw': 1,
'should': 1,
'the': 1,
'was': 1,
'were': 1}
Running the previous 2 approaches a 1000x times using a loop and capturing the run-times, proves that the second approach is faster than the first approach.
My solution is firstly replace all symbols into a space and then split by space. We will need a little help from regular expression.
import re
a = 'I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.'
b = re.sub('[^A-Za-z0-9]+', ' ', a)
print(b)
wholetext = b.split(' ')
print(wholetext)
My solution is similar to Verse's but it also takes makes an array of the symbols in the sentence. Afterwards, you can use the for loop and the dictionary to determine the counts.
import re
a = 'I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.'
b = re.sub('[^A-Za-z0-9\s]+', ' ', a)
print(b)
wholetext = b.split(' ')
print(wholetext)
c = re.sub('[A-Za-z0-9\s]+', ' ', a)
symbols = c.strip().split(' ')
print(symbols)
# do the for loop stuff you did in your question but with wholetext and symbols
Oh, I missed that you couldn't import anything :(

Reading a text document containing python list into a python program

I have a text file(dummy.txt) which reads as below:
['abc',1,1,3,3,0,0]
['sdf',3,2,5,1,3,1]
['xyz',0,3,4,1,1,1]
I expect this to be in lists in python as below:
article1 = ['abc',1,1,3,3,0,0]
article2 = ['sdf',3,2,5,1,3,1]
article3 = ['xyz',0,3,4,1,1,1]
That many articles have to be created as many lines present in dummy.txt
I was trying the following things:
Opened the file, split it by '\n' and appended it to an empty list in python, it had extra quotes and square brackets hence tried to use 'ast.literal_eval' which did not work as well.
my_list = []
fvt = open("dummy.txt","r")
for line in fvt.read():
my_list.append(line.split('\n'))
my_list = ast.literal_eval(my_list)
I also tried to manually remove additional quotes and extra square brackets using replace, that did not help me either. Any leads much appreciated.
This should help.
import ast
myLists = []
with open(filename) as infile:
for line in infile: #Iterate Each line
myLists.append(ast.literal_eval(line)) #Convert to python object and append.
print(myLists)
Output:
[['abc', 1, 1, 3, 3, 0, 0], ['sdf', 3, 2, 5, 1, 3, 1], ['xyz', 0, 3, 4, 1, 1, 1]]
fvt.read() will produce the entire file string, so that means line will contain a single character string. So this will not work very well, you also use literal_eval(..) with the entire list of strings, and not a single string.
You can obtain the results by iterating over the file handler, and each time call literal_eval(..) on a single line:
from ast import literal_eval
with open("dummy.txt","r") as f:
my_list = [literal_eval(line) for line in f]
or by using map:
from ast import literal_eval
with open("dummy.txt","r") as f:
my_list = list(map(literal_eval, f))
We then obtain:
>>> my_list
[['abc', 1, 1, 3, 3, 0, 0], ['sdf', 3, 2, 5, 1, 3, 1], ['xyz', 0, 3, 4, 1, 1, 1]]
ast.literal_eval is the right approach. Note that creating a variable number of variables like article1, article2, ... is not a good idea. Use a dictionary instead if your names are meaningful, a list otherwise.
As Willem mentioned in his answer fvt.read() will give you the whole file as one string. It is much easier to exploit the fact that files are iterable line-by-line. Keep the for loop, but get rid of the call to read.
Additionally,
my_list = ast.literal_eval(my_list)
is problematic because a) you evaluate the wrong data structure - you want to evaluate the line, not the list my_list to which you append and b) because you reassign the name my_list, at this point the old my_list is gone.
Consider the following demo. (Replace fake_file with the actual file you are opening.)
>>> from io import StringIO
>>> from ast import literal_eval
>>>
>>> fake_file = StringIO('''['abc',1,1,3,3,0,0]
... ['sdf',3,2,5,1,3,1]
... ['xyz',0,3,4,1,1,1]''')
>>> result = [literal_eval(line) for line in fake_file]
>>> result
[['abc', 1, 1, 3, 3, 0, 0], ['sdf', 3, 2, 5, 1, 3, 1], ['xyz', 0, 3, 4, 1, 1, 1]]
Of course, you could also use a dictionary to hold the evaluated lines:
>>> result = {'article{}'.format(i):literal_eval(line) for i, line in enumerate(fake_file, 1)}
>>> result
{'article2': ['sdf', 3, 2, 5, 1, 3, 1], 'article1': ['abc', 1, 1, 3, 3, 0, 0], 'article3': ['xyz', 0, 3, 4, 1, 1, 1]}
where now you can issue
>>> result['article2']
['sdf', 3, 2, 5, 1, 3, 1]
... but as these names are not very meaningful, I'd just go for the list instead which you can index with 0, 1, 2, ...
When I do this:
import ast
x = '[ "A", 1]'
x = ast.literal_eval(x)
print(x)
I get:
["A", 1]
So, your code should be:
for line in fvt.read():
my_list.append(ast.literal_eval(line))
Try this split (no imports needed) (i recommend):
with open('dummy.txt','r') as f:
l=[i[1:-1].strip().replace("'",'').split(',') for i in f]
Now:
print(l)
Is:
[['abc', 1, 1, 3, 3, 0, 0], ['sdf', 3, 2, 5, 1, 3, 1], ['xyz', 0, 3, 4, 1, 1, 1]]
As expected!!!

How to remove part of string from list

I have a file with data and i want to count numbers of macaddress:
file.txt:
Blockquote
D8:6C:E9:3C:77:FF;2016/01/10 14:02:47
D8:6C:E9:3C:77:FF;2016/01/10 14:02:47
D8:6C:E9:43:52:BF;2016/01/10 13:41:29
F0:82:61:31:6B:88;2016/01/10 13:43:41
8C:10:D4:D4:83:E5;2016/01/10 13:44:35
54:64:D9:E8:64:36;2016/01/10 13:46:13
18:1E:78:5A:CD:25;2016/01/10 13:46:27
18:1E:78:5A:D7:A5;2016/01/10 13:46:35
54:64:D9:75:1B:4B;2016/01/10 13:30:28
54:64:D9:75:1B:4B;2016/01/10 13:30:28
etc....
I put it to the list :
with open ('file.txt') as f:
mac = f.read().splitlines()
my_dic = {i:mac.count(i) for i in mac}
print my_dic
output:
{'18:1E:78:5A:D7:A5;2016/01/10 13:46:35': 1, 'D8:6C:E9:3C:77:FF;2016/01/10 14:02:47': 2, '54:64:D9:E8:64:36;2016/01/10 13:46:13': 1, 'D8:6C:E9:43:52:BF;2016/01/10 13:41:29': 1, 'F0:82:61:31:6B:88;2016/01/10 13:43:41': 1, '54:64:D9:75:1B:4B;2016/01/10 13:30:28': 2, '18:1E:78:5A:CD:25;2016/01/10 13:46:27': 1, '8C:10:D4:D4:83:E5;2016/01/10 13:44:35': 1}
how to rid of dates because i expected:
{'18:1E:78:5A:D7:A5 : 1, 'D8:6C:E9:3C:77:FF : 2, '54:64:D9:E8:64:36 : 1, 'D8:6C:E9:43:52:BF : 1, 'F0:82:61:31:6B:88 : 1, '54:64:D9:75:1B:4B : 2, '18:1E:78:5A:CD:25 : 1, '8C:10:D4:D4:83:E5 : 1}
Write a regexp that match this date format, and use re.sub() to remove the matching part.

Trying to print vertically in Python

I am trying to create an image that needs to be printed vertically:
From the for loop, I can print a image fine by indenting to a new line; however, I want the image to rotate counter clockwise 90 degrees (Is this transpose?).
I tried to use from itertools import zip_longest but it gives:
TypeError: zip_longest argument #1 must support iteration
class Reservoir:
def __init__(self,landscape):
self.landscape = landscape
self.image = ''
for dam in landscape:
self.image += '#'*dam + '\n'
print(self.image)
landscape = [4, 3, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0,
1, 1, 2, 5, 6, 5, 2, 2, 2, 3, 3, 3, 4, 5, 3, 2, 2]
lake = Reservoir(landscape)
print(lake)
I don't know if you will find a function or a lib that do that for you. But you can code this rotation by hand.
You don't want to display a real image here, but to print chars that represents a landscape. You have to print the "image" line by line, but since your landscape array represents the number of '#' you want in each column, you have to loop over the total number of lines you want, and for each char in that line, print a ' ' or a '#' depending on the corresponding landscape column value
With
h = max(landscape)
you calculate the total number of lines you want to print by finding the max of the landscape values.
Then, you loop over theses lines
for line in reversed(range(h)):
in that loop, line takes values 6, 5, 4, etc.
For each line, you have to loop over the whole landscape array to determine, for each column if you want to print a space or a '#', depending on the value of the landscape column (v) and the current line
for v in self.landscape:
self.image += ' ' if line >= v else '#'
The full program:
class Reservoir:
def __init__(self, landscape):
self.landscape = landscape
h = max(landscape)
self.image = ''
for line in reversed(range(h)):
for v in self.landscape:
self.image += ' ' if line >= v else '#'
self.image += '\n'
landscape = [4, 3, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 5, 6, 5, 2, 2, 2, 3, 3, 3, 4, 5, 3, 2, 2]
lake = Reservoir(landscape)
print(lake.image)
The result:
#
### #
# ### ##
## ### ######
#### ###############
###### #################

Why does python output to a file like this?

Trying to have a user input their name, copy that variable to a file, and then read it back. However, when read back, it only says [][]
My code looks like this (currently)
Name = raw_input("What is your Name? ")
print "you entered ", Name
fo = open("foo.txt", "r+")
fo.write (Name)
str = fo.read();
print "Read String is : ", str
fo.close()
When I look at the foo.txt file, it has all of this inside:
Mathew” ÿÿÿÿ _getresponse:16: thread woke up: response: ('OK', {'maybesave': 1, 'format': 1, 'runit': 1, 'remove_selection': 1, 'str': 1, '_file_line_helper': 1, '_asktabwidth': 1, '_filename_to_unicode': 1, 'open_stack_viewer': 1, 'get_region': 1, 'cut': 1, 'open_module': 1, 'showerror': 1, 'class': 1, 'smart_indent_event': 1, 'set_status_bar': 1, 'about_dialog': 1, 'indent_region_event': 1, 'load_extension': 1, 'set_region': 1, '_close': 1, 'cancel_callback': 1, 'postwindowsmenu': 1, 'subclasshook': 1, 'newline_and_indent_event': 1, 'toggle_debugger': 1, 'saved_change_hook': 1, 'eof_callback': 1, 'get_warning_stream': 1, 'get_standard_extension_names': 1, 'guess_indent': 1, 'ResetFont': 1, 'center_insert_event': 1, 'replace_event': 1, 'unload_extensions': 1, 'del_word_right': 1, 'close_debugger': 1, 'EditorWindow_extra_help_callback': 1, 'python_docs': 1, 'fill_menus': 1, 'flush': 1, 'close': 1, 'setattr': 1, 'set_notabs_indentwidth': 1, 'help_dialog': 1, 'set_saved': 1, 'get_selection_indices': 1, 'open_debugger': 1, 'tabify_region_event': 1, 'comment_region_event': 1, 'get_var_obj': 1, 'find_selection_event': 1, '_rmcolorizer': 1, 'goto_line_event': 1, 'load_standard_extensions': 1, 'reset_undo': 1, 'long_title': 1, 'paste': 1, 'close2': 1, 'reset_help_menu_entries': 1, 'set_indentation_params': 1, 'open_class_browser': 1, 'endexecuting': 1, 'delattr': 1, '_addcolorizer': 1, 'repr': 1, 'close_hook': 1, 'home_callback': 1, 'right_menu_event': 1, 'getlineno': 1, 'apply_bindings': 1, 'restart_shell': 1, '_make_blanks': 1, 'get_geometry': 1, 'ApplyKeybindings': 1, 'get_tabwidth': 1, 'ResetColorizer': 1, 'open_path_browser': 1, 'filename_change_hook': 1, '_build_char_in_string_func': 1, 'isatty': 1, 'find_event': 1, 'untabify_region_event': 1, 'reduce': 1, 'find_in_files_event': 1, 'new_callback': 1, 'getvar': 1, 'copy': 1, 'center': 1, 'writelines': 1, 'recall': 1, 'load_extensions': 1, 'showprompt': 1, 'close_event': 1, 'reindent_to': 1, 'askinteger': 1, 'hash': 1, 'RemoveKeybindings': 1, 'dedent_region_event': 1, 'linefeed_callback': 1, 'is_char_in_string': 1, 'getattribute': 1, 'move_at_edge_if_selection': 1, 'beginexecuting': 1, 'enter_callback': 1, 'short_title': 1, 'getwindowlines': 1, 'smart_backspace_event': 1, 'sizeof': 1, 'set_tabwidth': 1, 'find_again_event': 1, 'init': 1, 'del_word_left': 1, 'get_saved': 1, 'reduce_ex': 1, 'new': 1, 'select_all': 1, 'gotoline': 1, 'view_restart_mark': 1, 'change_indentwidth_event': 1, 'write': 1, 'set_debugger_indicator': 1, 'config_dialog': 1, 'set_warning_stream': 1, 'setvar': 1, 'createmenubar': 1, 'begin': 1, 'toggle_tabs_event': 1, 'askyesno': 1, 'ispythonsource': 1, 'resetoutput': 1, 'set_close_hook': 1, 'goto_file_line': 1, 'readline': 1, 'toggle_jit_stack_viewer': 1, 'make_rmenu': 1, 'EditorWindow_recent_file_callback': 1, 'uncomment_region_event': 1, 'update_recent_files_list': 1, 'set_line_and_column': 1}) ã èã”po” èã”po”
Any idea why?
First, you've opened the file in mode "r+" which is read-write. This will not empty the file, and anything you write will overwrite existing bytes. This is almost certainly not what you want: either 'a' if you want to append to the file, or 'w' if you want to delete the file first if it already exists.
Second, you're reading from where the write left off, and not repositioning the file cursor. In fact it's slightly worse than that: behavior of file objects isn't very well defined if you don't seek between reads and writes.
From C reference for fopen
For the modes where both read and writing (or appending) are allowed
(those which include a "+" sign), the stream should be flushed
(fflush) or repositioned (fseek, fsetpos, rewind) between either a
reading operation followed by a writing operation or a writing
operation followed by a reading operation.
The Python reference makes it clear that open() is implemented using standard C file objects.
Here's what I would write:
with open('foo.txt', 'w') as f:
f.write(name)
with open('foo.txt', 'r') as f:
print 'Text is:', f.read()
The with statement is nice here as it automatically closes the file once the write is done. By closing the file and reopening it in read mode, you guarantee that the written text made it into the file and isn't being cached.
As for why you get nothing back, that's probably because you have to seek to the beginning first:
fo.seek(0)
result = fo.read()
There is a pointer which marks the "current" position in a file. When you open a file, it is set at the beginning of the file. Next thing you do is write to it. As you write, the pointer keeps advancing. When you have written completely, the pointer is at the end of the file. And if you start reading then (which is what you are doing here), you'll get nothing but junk. So, you need to reset the pointer to the beginning before you start reading which can be done by seek as you can see above or you can close the file after writing and open it again before reading.
Name = raw_input("What is your Name? ")
print "you entered ", Name
fo = open("foo.txt", "r+")
fo.write (Name)
fo.flush()
fo.close()
fo = open("foo.txt", "r+")
str = fo.read();
print "Read String is : ", str
fo.close()
It is also a good idea to call flush() after writing to the file.

Categories