Python cycle through dict - python

I have a problem with creating cycle using dict. I have a dictionary: the keys are unique numbers, and the values are words. I need to create a matrix: rows are numbers of the sentences, and columns are the unique numbers for words (from the dict). The element of the matrix will show the number of each word in each sentence. This is my code for creatind the dict. (At the beginning I had a raw text file with sentences)
with open ('sentences.txt', 'r') as file_obj:
lines=[]
for line in file_obj:
line_split=re.split('[^a-z]',line.lower().strip()
j=0
new_line=[]
while j<=len(line_split)-1:
if (line_split[j]):
new_line.append(line_split[j])
j+=1
lines.append(new_line)
vocab = {}
k = 1
for i in range(len(lines)):
for j in range(len(lines[i])):
if lines[i][j] not in vocab.values():
vocab[k]=lines[i][j]
k+=1
import numpy as np //now I am trying to create a matrix
matr = np.array(np.zeros((len(lines),len(vocab))))
m=0
l=0
while l<22:
for f in range (len(lines[l])):
if vocab[1]==lines[l][f]: //this works only for the 1 word in dict
matr[l][0]+=1
l+=1
print(matr[3][0])
matr = np.array(np.zeros((len(lines),len(vocab)))) // this also works
for values in range (len(vocab)):
for line in lines:
a=line.count(vocab[1])
print(a)
But when I'm trying to make a cycle to go through the dict, nothing works! Could you please tell me how I can fill the whole matrix?
Thank you very much in advance!

A few careless errors: line 7 needs a closing parenthesis, // is not Python syntax.
Looking at your code I have no idea what your general algorithm is, for creating just a basic word count dictionary. So I propose this much shorter code:
import re
import sys
def get_vocabulary (filename):
vocab_dict = {}
with open (filename, 'r') as file_obj:
for line in file_obj:
for word in re.findall(r'[a-z]+',line.lower()):
if word in vocab_dict: # see below for an interesting alternative
vocab_dict[word] += 1
else:
vocab_dict[word] = 1
return vocab_dict
if len(sys.argv) > 1:
vocab = get_vocabulary (sys.argv[1])
for word in vocab:
print (word, '->', str(vocab[word]))
Note I replaced your own
line_split=re.split('[^a-z]',line.lower().strip())
with the reverse
re.findall(r'[a-z]+',line.lower())
because yours can return empty elements, and mine will not. Originally I had to add a test if word: before inserting it into the dictionary, to prevent adding lots of empties. With a better check for 'word', that is not necessary anymore.
(Fun with Python: The alternative for an if..else looks like this single line:
vocab_dict[word] = 1 if word not in vocab_dict else vocab_dict[word]+1
It is slightly less efficient because vocab_dict[word] has to be retrieved twice – you can't say .. + 1 on its own. Still, it's a nice line to read.)
Converting the dictionary to a 'matrix' (actually a simple array suffices) can be done, with a bit of help, using
matrix = [[vocab[word], word] for word in sorted(vocab)]
for row in matrix:
print (row)

Related

When I import a data set into an array, the length of each element appears as 1

I'm making a program where I import a long list of words from a .txt file into one array called wordlist. I then want to sort them into categories based on the length of the words. However for some reason, when the words are stored in the array, the length shows up as 1 for every single one.
Here is the code
wordlist = []
with open('words.txt', 'r') as words:
for line in words:
strplines = line.strip()
list = strplines.split()
wordlist.append(list)
loading = loading + 1
print(loading,'/ 113809 words loaded')
If I then do something like this
print(len(wordlist[15000]))
The output is 1 despite that word actually being 6 characters long.
I tried this in another program, but the only difference was that I manually inputted a few elements into the array and it worked. That means theres probably an issue with the way I strip the lines from the .txt file.
So the wordlist is an array of arrays? If so when you check the len of it's element, it would return the number of elements in this array so 1. But if you do something like
len(wordlist[1500][0])
you get the len of the first word that is stored in array at index 1500.
It looks that you do not want to append to the array (you would add a list), but you want to extend the array.
And please, please, even if builtins are not reserved words, avoid to use them! So call your list lst or mylist or whatever but not list...
The code could become:
wordlist = []
with open('words.txt', 'r') as words:
for line in words:
strplines = line.strip()
lst = strplines.split()
wordlist.extend(lst)
loading = loading + 1
print(loading,'/ 113809 words loaded')

Reading random lines from a file in Python that don't repeat untill 4 other lines have passed

So I am trying to make a program that can help people with learning new languages but I am already stuck at the beginning. One of the requirements is to let Python print the lines in a random order. So I made this.
import random
def randomline(file):
with open(file) as f:
lines=f.readlines()
print(random.choice(lines))
But now I got a problem with one of the other requirements. There has to be 4 other words in between before the word can show again and I have no idea how to do that.
I have a very primitive solution for you:
import random
def randomline(file):
with open(file) as f:
lines=f.readlines()
return random.choice(lines)
isOccuredInLastFourExistence = True
LastFourWords = []
file = "text_file.txt"
for i in range(0,15):
new_word = randomline(file)
print(LastFourWords)
if new_word in LastFourWords:
print("I have skipped")
print(new_word)
continue
print(new_word)
LastFourWords.append(new_word)
if(len(LastFourWords)) > 4:
LastFourWords.pop(0)
The file looked like this:
The output looks like:(Showing only partial result)
[]
New
['New\n']
Example
['New\n', 'Example\n']
After
['New\n', 'Example\n', 'After\n']
Some
['New\n', 'Example\n', 'After\n', 'Some\n']
I have skipped
Example
['New\n', 'Example\n', 'After\n', 'Some\n']
Please
['Example\n', 'After\n', 'Some\n', 'Please\n']
I have skipped
Please
['Example\n', 'After\n', 'Some\n', 'Please\n']
Only
['After\n', 'Some\n', 'Please\n', 'Only\n']
Word
['Some\n', 'Please\n', 'Only\n', 'Word']
New
So every time you have something which is already present in your list, it will be skipped. And the list clears the first position element when there are more than 4 elements.
you can use a queue:
# create list with empty elements against which choice is checked
queue = 4*['']
def randomline(file):
with open(file) as f:
lines=f.readlines()
choice = random.choice(lines)
if not choice in queue:
print(choice)
# appendcurrent word to the queue
queue.append(choice)
# remove the first element of the list
queue.pop(0)
You can utilise deque from the collections library. This will allow you to specify a max length for your seen words list. As you append items to the list, if your list is at the max length and you append a new item the oldest item will be remove. This allows you to make a cache. So if you create a list using deque with max length 4. Then you chose a word and check if its in the list, If it is then chose another word, if its not in the list then print the word and add it to the list. you dont have to worry about managing the items in the list as they oldest will automatically drop out when you append something new
from collections import deque
from random import choice, sample
with open('test.dat') as words_file:
words = words_file.readlines()
word_cache = deque(maxlen=4)
for _ in range(30):
word = choice(words).strip()
while word in word_cache:
word = choice(words).strip()
print(word)
word_cache.append(word)
I would use linecache. It's from the standard library and allows you to select a specific line. If you know the number of lines in your file, this could work:
import linecache
import random
def random_lines(filename, repeat_after=4):
n_lines = len(open(filename, "r").readlines())
last_indices = []
while True:
index = random.randint(1, n_lines)
if index not in last_indices:
last_indices.append(index)
last_indices = last_indices[-repeat_after:]
line = linecache.getline(filename, index)
yield line
This will create a generate which will output a random line from your file without needing to keep your lines in memory (which is great if you start having many lines).
As for your requirement of only allowing repetition after n number of times. This will take care of it. However, this has a very small chance of getting stuck in an infinite loop.
Another approach would be to create a list with all indices (i.e. line number), shuffle it, and then loop through them. This has the advantage of not risking to be in an infinite loop, but this also means that you'll need to go through every other line before you see the same line again, which may not be ideal for you.

Counting Hashtag

I'm writing a function called HASHcount(name,list), which receives 2 parameters, the name one is the name of the file that will be analized, a text file structured like this:
Date|||Time|||Username|||Follower|||Text
So, basically my input is a tweets list, with several rows structured like above. The list parameter is a list of hashtags I want to count in that text file. I want my function to check how many times each word of the list given occurred in the tweets list, and give as output a dictionary with each word count, even if the word is missing.
For instance, with the instruction HASHcount(December,[Peace, Love]) the program should give as output a dictionary made by checking how many times the word Peace and the word Love have been used as hashtag in the Text field of each tweet in the file called December.
Also, in the dictionary the words have to be without the hashtag simbol.
I'm stuck on making this function, I'm at this point but I'm having some issues concerning the dictionary:
def HASHcount(name,list):
f = open(name,"r")
dic={}
l = f.readline()
for word in list:
dic[word]=0
for line in f:
li_lis=line.split("|||")
li_tuple=tuple(li_lis)
if word in li_tuple[4]:
dic[word]=dic[word]+1
return dic
The main issue is that you are iterating over the lines in the file for each word, rather than the reverse. Thus the first word will consume all the lines of the file, and each subsequent word will have 0 matches.
Instead, you should do something like this:
def hash_count(name, words):
dic = {word:0 for word in words}
with open(name) as f:
for line in f:
line_text = line.split('|||')[4]
for word in words:
# Check if word appears as a hashtag in line_text
# If so, increment the count for word
return dic
There are several issues with your code, some of which have already been pointed out, while others (e.g concerning the identification of hashtags in a tweet's text) have not. Here's a partial solution not covering the fine points of the latter issue:
def HASHcount(name, words):
dic = dict.fromkeys(words, 0)
with open(name,"r") as f:
for line in f:
for w in words:
if '#' + w in line:
dic[w] += 1
return dic
This offers several simplifications keyed on the fact that hashtags in a tweet do start with # (which you don't want in the dic) -- as a result it's not worth analyzing each line since the # cannot be present except in the text.
However, it still has a fraction of a problem seen in other answers (except the one which just commented out this most delicate of parts!-) -- it can get false positives by partial matches. When the check is just like word in linetext the problem would be huge -- e.g if a word is cat it gets counted as hashtag even if present in perfectly ordinary text (on its own or as part of another word, e.g vindicative). With the '#' + approach, it's a bit better, but still, prefix matches would lead to a false positive, e.g #catalog would erroneously be counted as a hit for cat.
As some suggested, regular expressions can help with that. However, here's an alternative for the body of the for w in words loop...
for w in words:
where = line.find('#' + w)
if where == -1: continue
after = line[where + len(w) + 1]
if after in chars_acceptable_in_hashes: continue
dic[w] += 1
The only issue remaining is to determine which characters can be part of hashtags, i.e, the set chars_acceptable_in_hashes -- I haven't memorized Twitter's specs so I don't know it offhand, but surely you can find out. Note that this works at end of line, too, because line has not be stripped, so it's known to end with a \n. which is not in the acceptable set (so a hashtag at the very end of the line will be "properly terminated" too).
I like using collections module. This worked for me.
from collections import defaultdict
def HASHcount(file_to_open, lst):
with open(file_to_open) as my_file:
my_dict= defaultdict(int)
for line in my_file:
line = line.split('|||')
txt = line[4].strip(" ")
if txt in lst:
my_dict[txt] += 1
return my_dict

Python: counting unique instance of words across several lines

I have a text file with several observations. Each observation is in one line. I would like to detect unique occurrence of each word in a line. In other words, if same word occurs twice or more on the same line, it is still counted as once. However, I would like to count the frequency of occurrence of each words across all observations. This means that if a word occurs in two or more lines,I would like to count the number of lines it occurred in. Here is the program I wrote and it is really slow in processing large number of file. I also remove certain words in the file by referencing another file. Please offer suggestions on how to improve speed. Thank you.
import re, string
from itertools import chain, tee, izip
from collections import defaultdict
def count_words(in_file="",del_file="",out_file=""):
d_list = re.split('\n', file(del_file).read().lower())
d_list = [x.strip(' ') for x in d_list]
dict2={}
f1 = open(in_file,'r')
lines = map(string.strip,map(str.lower,f1.readlines()))
for line in lines:
dict1={}
new_list = []
for char in line:
new_list.append(re.sub(r'[0-9#$?*_><#\(\)&;:,.!-+%=\[\]\-\/\^]', "_", char))
s=''.join(new_list)
for word in d_list:
s = s.replace(word,"")
for word in s.split():
try:
dict1[word]=1
except:
dict1[word]=1
for word in dict1.keys():
try:
dict2[word] += 1
except:
dict2[word] = 1
freq_list = dict2.items()
freq_list.sort()
f1.close()
word_count_handle = open(out_file,'w+')
for word, freq in freq_list:
print>>word_count_handle,word, freq
word_count_handle.close()
return dict2
dict = count_words("in_file.txt","delete_words.txt","out_file.txt")
You're running re.sub on each character of the line, one at a time. That's slow. Do it on the whole line:
s = re.sub(r'[0-9#$?*_><#\(\)&;:,.!-+%=\[\]\-\/\^]', "_", line)
Also, have a look at sets and the Counter class in the collections module. It may be faster if you just count and then discard those you don't want afterwards.
Without having done any performance testing, the following come to mind:
1) you're using regexes -- why? Are you just trying to get rid of certain characters?
2) you're using exceptions for flow control -- although it can be pythonic (better to ask forgiveness than permission), throwing exceptions can often be slow. As seen here:
for word in dict1.keys():
try:
dict2[word] += 1
except:
dict2[word] = 1
3) turn d_list into a set, and use python's in to test for membership, and simultaneously ...
4) avoid heavy use of replace method on strings -- I believe you're using this to filter out the words that appear in d_list. This could be accomplished instead by avoiding replace, and just filtering the words in the line, either with a list comprehension:
[word for word words if not word in del_words]
or with a filter (not very pythonic):
filter(lambda word: not word in del_words, words)
import re
u_words = set()
u_words_in_lns = []
wordcount = {}
words = []
# get unique words per line
for line in buff.split('\n'):
u_words_in_lns.append(set(line.split(' ')))
# create a set of all unique words
map( u_words.update, u_words_in_lns )
# flatten the sets into a single list of words again
map( words.extend, u_words_in_lns)
# count everything up
for word in u_words:
wordcount[word] = len(re.findall(word,str(words)))

python -- trying to count the length of the words from a file with dictionaries

def myfunc(filename):
filename=open('hello.txt','r')
lines=filename.readlines()
filename.close()
lengths={}
for line in lines:
for punc in ".,;'!:&?":
line=line.replace(punc," ")
words=line.split()
for word in words:
length=len(word)
if length not in lengths:
lengths[length]=0
lengths[length]+=1
for length,counter in lengths.items():
print(length,counter)
filename.close()
Use Counter. (<2.7 version)
You are counting the frequency of words in a single line.
for line in lines:
for word in length.keys():
print(wordct,length)
length is dict of all distinct words plus their frequency, not their length
length.get(word,0)+1
so you probably want to replace the above with
for line in lines:
....
#keep this at this indentaiton - will have a v large dict but of all words
for word in sorted(length.keys(), key=lambda x:len(x)):
#word, freq, length
print(word, length[word], len(word), "\n")
I would also suggest
Dont bring the file into memory like that, the file objects and handlers are now iterators and well optimised for reading from files.
drop the wordct and so on in the main lines loop.
rename length to something else - perhaps words or dict_words
Errr, maybe I misunderstood - are you trying to count the number of distinct words in the file, in which case use len(length.keys()) or the length of each word in the file, presumably ordered by length....
The question has been more clearly defined now so replacing the above answer
The aim is to get a frequency of word lengths throughout the whole file.
I would not even bother with line by line but use something like:
fo = open(file)
d_freq = {}
st = 0
while 1:
next_space_index = fo.find(" ", st+1)
word_len = next_space_index - st
d_freq.get(word_len,0) += 1
print d_freq
I think that will work, not enough time to try it now. HTH

Categories