Check Values of A Dictionary for Repeating Numbers - python

I am trying to take a text file and take all the words longer then three letters and print them in a column. I then want to match them with the line numbers that they appear on, in a second column. e.g.
Chicken 8,7
Beef 9,4,1
....
The problem is I don't want to have duplicates. Right now I have the word kings which appears in a line twice, and I only want it to print once. I am thoroughly stumped and am in need of the assistance of a wise individual.
My Code:
storyFile=open('StoryTime.txt', 'r')
def indexMaker(inputFile):
''
# Will scan in each word at a time and either place in index as a key or
# add to value.
index = {}
lineImOn = 0
for line in inputFile:
individualWord = line[:-1].split(' ')
lineImOn+=1
placeInList=0
for word in individualWord:
index.get(individualWord[placeInList])
if( len(word) > 3): #Makes sure all words are longer then 3 letters
if(not individualWord[placeInList] in index):
index[individualWord[placeInList]] = [lineImOn]
elif(not index.get(individualWord[placeInList]) == str(lineImOn)):
type(index.get(individualWord[placeInList]))
index[individualWord[placeInList]].append(lineImOn)
placeInList+=1
return(index)
print(indexMaker(storyFile))
Also if anyone knows anything about making columns you would be a huge help and my new best friend.

I would do this using a dictionary of sets to keep track of the line numbers. Actually to simplify things a bit I'd use acollections.defaultdictwith values that were of typeset. As mentioned in another answer, it's probably best to parse of the words using a regular expression via theremodule.
from collections import defaultdict
import re
# Only process words at least a minimum number of letters long.
MIN_WORD_LEN = 3
WORD_RE = re.compile('[a-zA-Z]{%s,}' % MIN_WORD_LEN)
def make_index(input_file):
index = defaultdict(set)
for line_num, line in enumerate(input_file, start=1):
for word in re.findall(WORD_RE, line.lower()):
index[word].add(line_num) # Make sure line number is in word's set.
# Convert result into a regular dictionary of simple sequence values.
return {word:tuple(line_nums) for word, line_nums in index.iteritems()}
Alternative not usingremodule:
from collections import defaultdict
import string
# Only process words at least a minimum number of letters long.
MIN_WORD_LEN = 3
def find_words(line, min_word_len=MIN_WORD_LEN):
# Remove punctuation and all whitespace characters other than spaces.
line = line.translate(None, string.punctuation + '\t\r\n')
return (word for word in line.split(' ') if len(word) >= min_word_len)
def make_index(input_file):
index = defaultdict(set)
for line_num, line in enumerate(input_file, start=1):
for word in find_words(line.lower()):
index[word].add(line_num) # Ensure line number is in word's set.
# Convert result into a regular dictionary of simple sequence values.
return {word:tuple(line_nums) for word, line_nums in index.iteritems()}
Either way, themake_index()function could be used and the results output in two columns like this:
with open('StoryTime.txt', 'rt') as story_file:
index = make_index(story_file)
longest_word = max((len(word) for word in index))
for word, line_nums in sorted(index.iteritems()):
print '{:<{}} {}'.format(word, longest_word, line_nums)
As a test case I used the following passage (notice the word "die" is in the last line twice):
Now the serpent was more subtle than any beast of the field which
the LORD God had made. And he said unto the woman, Yea, hath God said,
Ye shall not eat of every tree of the garden? And the woman said
unto the serpent, We may eat of the fruit of the trees of the garden:
But of the fruit of the tree which is in the midst of the garden,
God hath said, Ye shall not eat of it, neither shall ye touch it, lest
ye die, or we all die.
And get the following results:
all (7,)
and (2, 3)
any (1,)
beast (1,)
but (5,)
die (7,)
eat (3, 4, 6)
every (3,)
field (1,)
fruit (4, 5)
garden (3, 4, 5)
god (2, 6)
had (2,)
hath (2, 6)
lest (6,)
lord (2,)
made (2,)
may (4,)
midst (5,)
more (1,)
neither (6,)
not (3, 6)
now (1,)
said (2, 3, 6)
serpent (1, 4)
shall (3, 6)
subtle (1,)
than (1,)
the (1, 2, 3, 4, 5)
touch (6,)
tree (3, 5)
trees (4,)
unto (2, 4)
was (1,)
which (1, 5)
woman (2, 3)
yea (2,)

First of all I would use regex to find words. To remove line repeats simply make set() from a list (or use set). "Pretty format" is possible with str.format() from 2.6+ (other solutions tabulate, clint, ..., column -t)
import re
data = {}
word_re = re.compile('[a-zA-Z]{4,}')
with open('/tmp/txt', 'r') as f:
current_line = 1
for line in f:
words = re.findall(word_re, line)
for word in words:
if word in data.keys():
data[word].append(current_line)
else:
data[word] = [current_line]
current_line += 1
for word, lines in data.iteritems():
print("{: >20} {: >20}".format(word, ", ".join([str(l) for l in set(lines)])))

Related

Preprocess list of tuples by keeping track of each word's counts in a tuple

I have the following list of tuples:
words = [('This', 0), ('showed', 0), ('good', 0), ('Patency', 140),
('of', 1), ('the', 1), ('vein', 7), ('graft', 30), ('with', 0),
('absence', 2), ('.', 0), ('FINDINGS', 0), (':', 0), (2, 5)]
The variable words has the structure [(token, n_of_occurrence)].
I want to preprocess words where I want to remove all tuples that their token is a digit, NAN, stopword, punctuation, and remove all duplicates in such a way I keep track of their occurrences. I am expecting to have as output:
[('showed', 0), ('good', 0), ('patency', 140),
('vein', 7), ('graft', 30), ('absence', 2)('findings', 0)]
I tried the following but end up with an appropriate output; an empty list.
tokens = [w[0].lower() for w in [str(w) for w in set(words)] if (w[0] != 'nan' and
w[0].isdigit() != True and not
w[0].replace('.', '', 1).isdigit())]
items = [t for t in tokens if (t not in stopwords.words('english') and
t not in string.punctuation)]
here items will be my final preprocessed list of tuples.
Thanks! I tried with what was proposed as solutions and found the correct answer.
tokens = [(w.lower(), cnt) for w, cnt in [(str(w), cnt) for w, cnt in set(words)]
if (w != 'nan' and
w.isdigit() != True and not
w.replace('.', '', 1).isdigit())]
items = [(t, cnt) for t, cnt in tokens if (t not in stopwords.words('english') and
t not in string.punctuation)]
This code will output exactly what was expected in the question.
For now, I'd stick preprocessWords in a def(), and work on each case one by one. I.e., write the code to test for NaN, punctuation, each in a different block of code, rather than trying to do it all at once, and then fault-find each area seperately. Therefore:
def preprocessWords(words):
goodwords = [] #This is Returned as the answer
temp = [] #Used to spot duplicates
for word,q in words: #word gets tested against multiple criteria, if none match, the final "else" adds it to goodwords
if isnumeric(word):
pass #Disregard purely numerical items. This also negatives strings such as "12.5".
elif word in ";:,.!": #Fill out full list of denied punctuation
pass
elif word in stopwords.words('english'): #Denied, for some reason...
pass
elif word.lower() in temp: #Duplicate. Temp only has .lower() values, so covert to lower
pass
else:
goodwords.append([word,q]) #Gets returned as the answer
temp.append(word.lower()) #List used to spot duplicates
return goodwords
Structure: You COULD merge all of those into one mega line of "IF NOT (TEST1) OR (TEST2) OR (TEST3)" but whilst in development, I'd keep each test separate and it gives you a chance of adding print(word,"failed due to having punctuation") and print(word,"failed due to duplicate") whilst debugging.
Duplicates: If there are duplicates, surely you want to add up the q values? You didn't say in your question - but if so, on the "if word in temp" line, you'll want to add logic to += the q value in goodwords for that word.
Punctuation-words: You'll want to define a full list of punctuation. An alternative way would be to search for any word that is just one letter long, and then check to see if that letter is a-z, A-Z. But that would be more complicated than you think, given that á,é,í and lots of other possible letters from different language would also have to specified.

How would you use a loop to print an ordered list?

How would I get this to print an ordered list so if I entered kiwi, dog, cat, it would print
cat
kiwi
dog
Here is the code I have:
input_string = input("Enter a list element separated by comma:")
lisp = input_string. split(',')
for i in lisp:
if 'cat' == i:
print ('cat')
elif 'kiwi' == i:
print ('kiwi')
else:
print (i)
Here is what it produces:
kiwi
dog
cat
[Updated the code]
I know how to use the sort method to alphabetize, but I need the list to be in a certain order with the random words (ex.dog) just added at the bottom. I am not a coder, and am not a student, I am trying to just learn. So I appreciate all help, all approaches, and your patience.
just do :
input_string = input("Enter a list element separated by comma")
lisp = input_string. split(',')
print(sorted(lisp))
input:
[c,b,a]
output:
[a,b,c]
the "sorted" method sorts elements. you can specify the sort method if you wish.
e.g:
sorted(iterable[, key][, reverse])
It sounds like you want to print your list in order except that "cat" and "kiwi" should be moved to the front. This would have worked:
lisp = ['kiwi', 'cat', 'dog']
if 'cat' in lisp:
print('cat')
if 'kiwi' in lisp:
print('kiwi')
for i in lisp:
if i not in ('cat', 'kiwi'):
print(i)
Output:
cat
kiwi
dog
In response to the comment from my first answer. set up a key and then use sorted. more here : https://www.programiz.com/python-programming/methods/built-in/sorted
check example 3. it defines a custom sort function and passes that in to the key flag:
# take second element for sort
def takeSecond(elem):
return elem[1]
# random list
random = [(2, 2), (3, 4), (4, 1), (1, 3)]
# sort list with key
sortedList = sorted(random, key=takeSecond)
This snipit of code prints the values in the list sorted by the third character. This is an example of using a lambda function to do your bidding. Remember that a sort is a potentially destructive function. If you want to preserve the original list, you should enclose it in a copy-list function.
(loop for an ele in (sort '("cat" "kiwi" "dog")
#'(lambda (x y) (char-lessp (elt x 2) (elt y 2))))
do (print ele))

how to find the longest N words from a list, using python?

I am now studying Python, and I am trying to solve the following exercise:
Assuming there is a list of words in a text file,
My goal is to print the longest N words in this list.
Where there are several important points:
The print order does not matter
Words that appear later in the file are given priority to be selected (when there are several words with the same length, i added an example for it)
assume that each row in the file contains only one single word
Is there a simple and easy solution for a short list of words, as opposed to a more complex solution for a situation where the list contains several thousand words?
I have attached an example of the starting code to a single word with a maximum length,
And an example of output for N = 4, for an explanation of my question.
Thanks for your advice,
word_list1 = open('WORDS.txt', 'r')
def find_longest_word(word_list):
longest_word = ''
for word in word_list:
if len(word) > len(longest_word):
longest_word = word
print(longest_word)
find_longest_word(word_list1)
example(N=4):
WORDS.TXT
---------
Mother
Dad
Cat
Bicycle
House
Hat
The result will be (as i said before, print order dosen't matter):
Hat
House
Bicycle
Mother
thanks in advance!
One alternative is to use a heap to maintain the top-n elements:
import heapq
from operator import itemgetter
def top(lst, n=4):
heap = [(0, i, '') for i in range(n)]
heapq.heapify(heap)
for i, word in enumerate(lst):
item = (len(word), i, word)
if item > heap[0]:
heapq.heapreplace(heap, item)
return list(map(itemgetter(2), heap))
words = ['Mother', 'Dad', 'Cat', 'Bicycle', 'House', 'Hat']
print(top(words))
Output
['Hat', 'House', 'Bicycle', 'Mother']
In the heap we keep items that correspond to length and position, so in case of ties the last one to appear gets selected.
sort the word_list based on length of the words and then based on a counter variable, so that words occurring later gets higher priority
>>> from itertools import count
>>> cnt = count()
>>> n = 4
>>> sorted(word_list, key=lambda word:(len(word), next(cnt)), reverse=True)[:n]
['Bicycle', 'Mother', 'House', 'Hat']
You can use sorted with a custom tuple key and then list slicing.
from io import StringIO
x = StringIO("""Mother
Dad
Cat
Bicycle
House
Hat
Brother""")
def find_longest_word(word_list, n):
idx, words = zip(*sorted(enumerate(word_list), key=lambda x: (-len(x[1]), -x[0]))[:n])
return words
res = find_longest_word(map(str.strip, x.readlines()), 4)
print(*res, sep='\n')
# Brother
# Bicycle
# Mother
# House

How i can remove space between integer between brakets in python

i have a problem i want to submit it on online judge it want me to print the result in co-ordinates x,y like
print (2,3)
(2, 3) # i want to remove this space between the , and 3 to be accepted
# i want it like that
(2,3)
i make it with c++ but i want python i challenge my friends that python make any thing please help me
the whole code of proplem i work on it
Bx,By,Dx,Dy=map(int, raw_input().split())
if Bx>Dx:
Ax=Dx
Ay=By
Cx=Bx
Cy=Dy
print (Ax,Ay),(Bx,By),(Cx,Cy),(Dx,Dy) #i want this line to remove the comma between them to print like that (Ax,Ay) not that (Ax, Ay) and so on the line
else:
Ax=Bx
Ay=Dy
Cx=Dx
Cy=By
print (Ax,Ay),(Dx,Dy),(Cx,Cy),(Bx,By) # this too
you can use format:
>>> print "({},{})".format(2,3)
(2,3)
your code should be like this:
print "({},{})({},{}),({},{}),({},{})".format(Ax,Ay,Bx,By,Cx,Cy,Dx,Dy)
To do this in the general case, manipulate the string representation. I've kept this a little too simple, as the last item demonstrates:
def print_stripped(item):
item_str = item.__repr__()
print item_str.replace(', ', ',')
tuple1 = (2, 3)
tuple2 = (2, ('a', 3), "hello")
tuple3 = (2, "this, will, lose some spaces", False)
print_stripped(tuple1)
print_stripped(tuple2)
print_stripped(tuple3)
My space removal is a little too simple; here's the output
(2,3)
(2,('a',3),'hello')
(2,'this,will,lose some spaces',False)
"Strip" the tuple whitespace with listcomprehension;
tuple_ = (2, 3)
tuple_ = [i[0] for i in tuple]
in function
def strip_tuple(tuple_):
return [i[0] for i in tuple_]

Create tuples consisting of pairs of words

I have a string (or a list of words). I would like to create tuples of every possible word pair combination in order to pass them to a Counter for dictionary creation and frequency calculation. The frequency is calculated in the following manner: if the pair exists in a string (regardless of the order or if there are any other words between them) the frequency = 1 (even the word1 has a frequency of 7 and word2 of 3 the frequency of a pair word1 and word2 is still 1)
I am using loops to create tuples of all pairs but got stuck
tweetList = ('I went to work but got delayed at other work and got stuck in a traffic and I went to drink some coffee but got no money and asked for money from work', 'We went to get our car but the car was not ready. We tried to expedite our car but were told it is not ready')
words = set(tweetList.split())
n = 10
for tweet in tweetList:
for word1 in words:
for word2 in words:
pairW = [(word1, word2)]
c1 = Counter(pairW for pairW in tweet)
c1.most_common(n)
However, the ouput is very bizzare:
[('k', 1)]
It seems instead of words it is iterating over letters
How can this be addressed? Converting a string into a list of words using split() ?
Another question: how to avoid creating duplicate tuples such as: (word1, word2) and (word2, word1)? Enumerate?
As an Output I expect a dictionary where key = all word pairs (see duplicate comment though), and the value = frequency of a pair in the list
Thank you!
I wonder if that's what you want:
import itertools, collections
tweets = ['I went to work but got delayed at other work and got stuck in a traffic and I went to drink some coffee but got no money and asked for money from work',
'We went to get our car but the car was not ready. We tried to expedite our car but were told it is not ready']
words = set(word.lower() for tweet in tweets for word in tweet.split())
_pairs = list(itertools.permutations(words, 2))
# We need to clean up similar pairs: sort words in each pair and then convert
# them to tuple so we can convert whole list into set.
pairs = set(map(tuple, map(sorted, _pairs)))
c = collections.Counter()
for tweet in tweets:
for pair in pairs:
if pair[0] in tweet and pair[1] in tweet:
c.update({pair: 1})
print c.most_common(10)
Result is: [(('a', 'went'), 2), (('a', 'the'), 2), (('but', 'i'), 2), (('i', 'the'), 2), (('but', 'the'), 2), (('a', 'i'), 2), (('a', 'we'), 2), (('but', 'we'), 2), (('no', 'went'), 2), (('but', 'went'), 2)]
tweet is a string so Counter(pairW for pairW in tweet) will compute the frequency of the letters in tweet, which is probably not want you want.

Categories