How to print each letter in a string only once - python

Hello everyone I have a python question.
I'm trying to print each letter in the given string only once.
How do I do this using a for loop and sort the letters from a to z?
Heres what I have;
import string
sentence_str = ("No punctuation should be attached to a word in your list,
e.g., end. Not a correct word, but end is.")
letter_str = sentence_str
letter_str = letter_str.lower()
badchar_str = string.punctuation + string.whitespace
Alist = []
for i in badchar_str:
letter_str = letter_str.replace(i,'')
letter_str = list(letter_str)
letter_str.sort()
for i in letter_str:
Alist.append(i)
print(Alist))
Answer I get:
['a']
['a', 'a']
['a', 'a', 'a']
['a', 'a', 'a', 'a']
['a', 'a', 'a', 'a', 'a']
['a', 'a', 'a', 'a', 'a', 'b']
['a', 'a', 'a', 'a', 'a', 'b', 'b']
['a', 'a', 'a', 'a', 'a', 'b', 'b', 'c']....
I need:
['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'l', 'n', 'o', 'p', 'r', 's', 't', 'u', 'w', 'y']
no errors...

Just check if the letter is not already in your array before appending it:
for i in letter_str:
if not(i in Alist):
Alist.append(i)
print(Alist))
or alternatively use the Set data structure that Python provides instead of an array. Sets do not allow duplicates.
aSet = set(letter_str)

Using itertools ifilter which you can say has an implicit for-loop:
In [20]: a=[i for i in itertools.ifilter(lambda x: x.isalpha(), sentence_str.lower())]
In [21]: set(a)
Out[21]:
set(['a',
'c',
'b',
'e',
'd',
'g',
'i',
'h',
'l',
'o',
'n',
'p',
's',
'r',
'u',
't',
'w',
'y'])

Malvolio correctly states that the answer should be as simple as possible. For that we use python's set type which takes care of the issue of uniqueness in the most efficient and simple way possible.
However, his answer does not deal with removing punctuation and spacing. Furthermore, all answers as well as the code in the question do that pretty inefficiently(loop through badchar_str and replace in the original string).
The best(ie, simplest and most efficient as well as idiomatic python) way to find all unique letters in the sentence is this:
import string
sentence_str = ("No punctuation should be attached to a word in your list,
e.g., end. Not a correct word, but end is.")
bad_chars = set(string.punctuation + string.whitespace)
unique_letters = set(sentence_str.lower()) - bad_chars
If you want them to be sorted, simply replace the last line with:
unique_letters = sorted(set(sentence_str.lower()) - bad_chars)

If the order in which you want to print doesn't matter you can use:
sentence_str = ("No punctuation should be attached to a word in your list,
e.g., end. Not a correct word, but end is.")
badchar_str = string.punctuation + string.whitespace
for i in badchar_str:
letter_str = letter_str.replace(i,'')
print(set(sentence_str))
Or if you want to print in sorted order you could convert it back to list and use sort() and then print.

First principles, Clarice. Simplicity.
list(set(sentence_str))

You can use set() for remove duplicate characters and sorted():
import string
sentence_str = "No punctuation should be attached to a word in your list, e.g., end. Not a correct word, but end is."
letter_str = sentence_str
letter_str = letter_str.lower()
badchar_str = string.punctuation + string.whitespace
for i in badchar_str:
letter_str = letter_str.replace(i,'')
characters = list(letter_str);
print sorted(set(characters))

Related

How can I reference a string (e.g. 'A') to the index of a larger list (e.g. ['A', 'B', 'C', 'D', ...])?

I have been racking my brain and scouring the internet for some hours now, please help.
Effectively I am trying to create a self-contained function (in python) for producing a caesar cipher. I have a list - 'cache' - of all letters A-Z.
def caesarcipher(text, s):
global rawmessage #imports a string input - the 'raw message' which is to be encrypted.
result = ''
cache = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O',
'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
Is it possible to analyze the string input (the 'rawmessage') and attribute each letter to its subsequent index position in the list 'cache'? e.g. if the input was 'AAA' then the console would recognise it as [0,0,0] in the list 'cache'. Or if the input was 'ABC' then the console would recognise it as [0,1,2] in the list 'cache'.
Thank you to anyone who makes the effort to help me out here.
Use a list comprehension:
positions = [cache.index(letter) for letter in rawmessage if letter in cache]
You can with a list comprehension. Also you can get the letter from string.
import string
print([string.ascii_uppercase.index(c) for c in "AAA"])
# [0, 0, 0]
print([string.ascii_uppercase.index(c) for c in "ABC"])
# [0, 1, 2]
result = []
for i in list(rawmessage):
result.append(cache.index(i))

Why can't I concatenate this list in order to switch characters?

I'm having this problem and have been up for hours trying to work it out, does anyone know what might be going wrong?
When I write
plant = 'blackberry'
temp = list(plant)
def switch(xs):
return [xs[1]] + xs[0] + [xs[2:-1]]
and then call it in
switch(temp)
I get the error 'TypeError: can only concatenate list (not "str") to list'
I'm wondering why? I'm trying to do a switch of the first two letters of blackberry where the input isn't modified to memory, so that if I was to call
temp
afterwards, then it would still be 'blackberry' rather than the switched version - does anyone know how to get rid of this error?
I'd be very grateful for any clarification on this
Look at what each item you are returning actually is:
print([xs[1]]) # ['l']
print(xs[0]) # 'b'
print([xs[2:-1]]) # [['a', 'c', 'k', 'b', 'e', 'r', 'r']]
return [xs[1]] + xs[0] + [xs[2:-1]]
# ['l'] + 'b' + [['a', 'c', 'k', 'b', 'e', 'r', 'r']]
You cannot add different elements like this. You need to make sure they are all the same type, i.e.:
return [xs[1]] + [xs[0]] + xs[2:-1]
# ['l'] + ['b'] + ['a', 'c', 'k', 'b', 'e', 'r', 'r']
xs[1] is "l"
[xs[1]] is ["l"]
xs[0] is "b"
xs[2:-1] is ['a', 'c', 'k', 'b', 'e', 'r', 'r']
[xs[2:-1]] is [['a', 'c', 'k', 'b', 'e', 'r', 'r']]
As you can see:
[xs[2:-1]] is a list
[xs[1]] is a list
xs[0] is a string
In python, string and list cannot be concatenated. That means, one cannot add a string with a list.
In general you can use type() function to see the type of a variable. For example:
print(type(xs[0])) results in this output: <class 'list'>
The xs[2:-1] is an array whereas xs[1], xs[0] are just characters which work as strings so there is conflict in concatenation
Further list concatenation to strings is done via join methods
String = "<anyString>".join(listarr)
The following solution will work for your case
plant = 'blackberry'
temp = list(plant)
def switch(xs):
print("".join([xs[1], xs[0], "".join(xs[2:])]))
return "".join([xs[1], xs[0], "".join(xs[2:])])
switch(temp)

Remove stop words from list using only Numpy in Python

I am working on removing stop words in python using only numpy. The stopwords file is imported as a list. So here is what I came up:
method 1, I try to loop through the stop words list, and remove everyone from the tw_line
# loop through the stop words list, and remove each one from the splitted line list
for line in stopwords:
if line in words:
words.remove(line)
continue
print (tw_line)
Result: NO stop words are removed.
0 my whole body feels itchy and like its on fire
method 2, I try to loop the word through the stopwords list,
# loop through the line, and check with stop words, if not in stop words, add to clean_line
clean_line=[]
tw_line.split(" ")
for line in tw_line:
if line in stopwords:
clean_line.append(line)
print(clean_line)
Result: All words are broken into characters
['m', 'y', 'w', 'h', 'o', 'l', 'e', 'b', 'o', 'd', 'y', 'f', 'e', 'e', 'l', 's', 'i', 'c', 'h', 'y', 'a', 'n', 'd', 'l', 'i', 'k', 'e', 'i', 's', 'o', 'n', 'f', 'i', 'r', 'e']
Any help?
Try apply this:
>>> str1 = "my whole body feels itchy and like its on fire"
>>> str1.split()
['my', 'whole', 'body', 'feels', 'itchy', 'and', 'like', 'its', 'on', 'fire']
>>>
And then remove words which are in stopwords. BTW, I don't see any numpy here.
You should print word not tw_line since word is where you removed the stopword?
for line in stopwords:
if line in words:
words.remove(line)
continue
print (words)
The method 2 is clearly what you want to do. However, there are some things you can improve:
as Paul Panzer stated, split doesn't work in place so you need to do
tw_list = tw_line.split(" ")
You could make use of list comprehension rather than looping (or even a generator if you intend to join afterward).
clean_line = [word for word in tw_list if word not in stopwords]
I saw from your code comment that stopwords is a list. You might want to make it a set for efficiency reasons ( https://wiki.python.org/moin/TimeComplexity).

How to get all substrings in a list of characters (python)

I want to iterate over a list of characters
temp = ['h', 'e', 'l', 'l', 'o', '#', 'w', 'o', 'r', 'l', 'd']
so that I can obtain two strings, "hello" and "world"
My current way to do this is:
#temp is the name of the list
#temp2 is the starting index of the first alphabetical character found
for j in range(len(temp)):
if temp[j].isalpha() and temp[j-1] != '#':
temp2 = j
while (temp[temp2].isalpha() and temp2 < len(temp)-1:
temp2 += 1
print(temp[j:temp2+1])
j = temp2
The issue is that this prints out
['h', 'e', 'l', 'l', 'o']
['e', 'l', 'l', 'o']
['l', 'l', 'o']
['l', 'o']
['o']
etc. How can I print out only the full valid string?
Edit: I should have been more specific about what constitutes a "valid" string. A string is valid as long as all characters within it are either alphabetical or numerical. I didn't include the "isnumerical()" method within my check conditions because it isn't particularly relevant to the question.
If you want only hello and world and your words are always # seperated, you can easily do it by using join and split
>>> temp = ['h', 'e', 'l', 'l', 'o', '#', 'w', 'o', 'r', 'l', 'd']
>>> "".join(temp).split('#')
['hello', 'world']
Further more if you need to print the full valid string you need to
>>> t = "".join(temp).split('#')
>>> print(' '.join(t))
hello world
You can do it like this:
''.join(temp).split('#')
List has the method index which returns position of an element. You can use slicing to join the characters.
In [10]: temp = ['h', 'e', 'l', 'l', 'o', '#', 'w', 'o', 'r', 'l', 'd']
In [11]: pos = temp.index('#')
In [14]: ''.join(temp[:pos])
Out[14]: 'hello'
In [17]: ''.join(temp[pos+1:])
Out[17]: 'world'
An alternate, itertools-based solution:
>>> temp = ['h', 'e', 'l', 'l', 'o', '#', 'w', 'o', 'r', 'l', 'd']
>>> import itertools
>>> ["".join(str)
for isstr, str in itertools.groupby(temp, lambda c: c != '#')
if isstr]
['hello', 'world']
itertools.groupby is used to ... well ... group consecutive items depending if they are of not equal to #. The comprehension list will discard the sub-lists containing only # and join the non-# sub-lists.
The only advantage is that way, you don't have to build the full-string just to split it afterward. Probably only relevant if the string in really long.
If you just want alphas just use isalpha() replacing the # and any other non letters with a space and then split of you want a list of words:
print("".join(x if x.isalpha() else " " for x in temp).split())
If you want both words in a single string replace the # with a space and join using the conditional expression :
print("".join(x if x.isalpha() else " " for x in temp))
hello world
To do it using a loop like you own code just iterate over items and add to the output string is isalpha else add a space to the output:
out = ""
for s in temp:
if s.isalpha():
out += s
else:
out += " "
Using a loop to get a list of words:
words = []
out = ""
for s in temp:
if s.isalpha():
out += s
else:
words.append(out)
out = ""

Is that a tag list or something else?

I am new to NLP and NLTK, and I want to find ambiguous words, meaning words with at least n different tags. I have this method, but the output is more than confusing.
Code:
def MostAmbiguousWords(words, n):
# wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
wordsUniqeTags = {}
for (w,t) in words:
if wordsUniqeTags.has_key(w):
wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
else:
wordsUniqeTags[w] = set([t])
# Starting to count
res = []
for w in wordsUniqeTags:
if len(wordsUniqeTags[w]) >= n:
res.append((w, wordsUniqeTags[w]))
return res
MostAmbiguousWords(brown.tagged_words(), 13)
Output:
[("what's", set(['C', 'B', 'E', 'D', 'H', 'WDT+BEZ', '-', 'N', 'T', 'W', 'V', 'Z', '+'])),
("who's", set(['C', 'B', 'E', 'WPS+BEZ', 'H', '+', '-', 'N', 'P', 'S', 'W', 'V', 'Z'])),
("that's", set(['C', 'B', 'E', 'D', 'H', '+', '-', 'N', 'DT+BEZ', 'P', 'S', 'T', 'W', 'V', 'Z'])),
('that', set(['C', 'D', 'I', 'H', '-', 'L', 'O', 'N', 'Q', 'P', 'S', 'T', 'W', 'CS']))]
Now I have no idea what B,C,Q, ect. could represent. So, my questions:
What are these?
What do they mean? (In case they are tags)
I think they are not tags, because who and whats don't have the WH tag indicating "wh question words".
I'll be happy if someone could post a link that includes a mapping of all possible tags and their meaning.
It looks like you have a typo. In this line:
wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
you should have set([t]) (not set(t)), like you do in the else case.
This explains the behavior you're seeing because t is a string and set(t) is making a set out of each character in the string. What you want is set([t]) which makes a set that has t as its element.
>>> t = 'WHQ'
>>> set(t)
set(['Q', 'H', 'W']) # bad
>>> set([t])
set(['WHQ']) # good
By the way, you can correct the problem and simplify things by just changing that line to:
wordsUniqeTags[w].add(t)
But, really, you should make use of the setdefault method on dict and list comprehension syntax to improve the method overall. So try this instead:
def most_ambiguous_words(words, n):
# wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
wordsUniqeTags = {}
for (w,t) in words:
wordsUniqeTags.setdefault(w, set()).add(t)
# Starting to count
return [(word,tags) for word,tags in wordsUniqeTags.iteritems() if len(tags) >= n]
You are splitting your POS tags into single characters in this line:
wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
set('AT') results in set(['A', 'T']).
How about making use of the Counter and defaultdict functionality in the collections module?
from collection import defaultdict, Counter
def most_ambiguous_words(words, n):
counts = defaultdict(Counter)
for (word,tag) in words:
counts[word][tag] += 1
return [(w, counts[w].keys()) for w in counts if len(counts[word]) > n]

Categories