remove characters from a python string - python

I have several python strings from which I want unwanted characters removed.
Examples:
"This is '-' a test"
should be "This is a test"
"This is a test L)[_U_O-Y OH : l’J1.l'}/"
should be "This is a test"
"> FOO < BAR"
should be "FOO BAR"
"I<<W5§!‘1“¢!°\" I"
should be ""
(because if only words are extracted then it returns I W I and none of them form words)
"l‘?£§l%nbia ;‘\\~siI.ve_rswinq m"
should be ""
"2|'J]B"
should be ""
this is what I have so far, however, it is not keeping the original spaces between words.
>>> line = re.sub(r"\W+","","This is '-' a test")
>>> line
'Thisisatest'
>>> line = re.sub(r"\W+","","This is a test L)[_U_O-Y OH : l’J1.l'}/")
>>> line
'ThisisatestL_U_OYOHlJ1l'
#although i would prefer this to be "This is a test" but if not possible i would
prefer "This is a test L_U_OYOHlJ1l"
>>> line = re.sub(r"\W+","","> FOO < BAR")
>>> line
'FOOBAR'
>>> line = re.sub(r"\W+","","I<<W5§!‘1“¢!°\" I")
>>> line
'IW51I'
>>> line = re.sub(r"\W+","","l‘?£§l%nbia ;‘\\~siI.ve_rswinq m")
>>> line
'llnbiasiIve_rswinqm'
>>> line = re.sub(r"\W+","","2|'J]B")
>>> line
'2JB'
I will be filtering the regex cleaned words through a list of predefined words later.

I'd go with a split and filter, like this:
' '.join(word for word in line.split() if word.isalpha() and word.lower() in list)
This will remove all non-alphabetic words and alphabetic words that are not in the list.
Examples:
def myfilter(string):
words = {'this', 'test', 'i', 'a', 'foo', 'bar'}
return ' '.join(word for word in line.split() if word.isalpha() and word.lower() in words)
>>> myfilter("This is '-' a test")
'This a test'
>>> myfilter("This is a test L)[_U_O-Y OH : l’J1.l'}/")
'This a test'
>>> myfilter("> FOO < BAR")
'FOO BAR'
>>> myfilter("I<<W5§!‘1“¢!°\" I")
'I'
>>> myfilter("l‘?£§l%nbia ;‘\\~siI.ve_rswinq m")
''
>>> myfilter("2|'J]B")
''

This one clears out any group of non-space symbols with at least one non alphabetic character. It will leaves some unwanted group of letters though :
re.sub(r"\w*[^a-zA-Z ]+\w*","","This is a test L)[_U_O-Y OH : l’J1.l'}/")
gives :
'This is a test OH '
It will also leave groups of more than one space :
re.sub(r"[^a-zA-Z ]+\w*","","This is '-' a test")
'This is a test' # two spaces

Related

Python split() String into a list with spaces

user_words = raw_input()
word_list = user_words.split()
user_words = []
for word in word_list:
user_words.append(word.capitalize())
user_words = " ".join(user_words)
print(user_words)
Current Output:
Input:
hello world(two spaces in between)
Output:
Hello World
Desired Output:
Input:
hello world(two spaces in between)
Output:
Hello World(two spaces in between)
Note: I want to be able to split the string by spaces, but still have the extra spaces between words in the original string that's inputted by the user.
If you split using the space character, you'll get extra '' in your list
>>> "Hello world".split()
['Hello', 'world']
>>> "Hello world".split(' ')
['Hello', '', 'world']
Those generate the extra spaces again after a join
>>> ' '.join(['Hello', '', 'world'])
'Hello world'
Use re.split for this and join by the space original string has.
user_words = raw_input()
word_list = re.split(r"(\s+)",user_words)
user_words = []
user_words.append(word_list[0].capitalize())
user_words.append(word_list[2].capitalize())
user_words = word_list[1].join(user_words)
print(user_words)

Python: Grab each word after certain character in string

I want to grab each word that has a + before it
If I input the string:
word anotherword +aspecialword lameword +heythisone +test hello
I want it to return:
aspecialword heythisone test
Have a split combined with a list comp
>>> a = 'word anotherword +aspecialword lameword +heythisone +test hello'
>>> [i[1:] for i in a.split() if i[0] == '+']
['aspecialword', 'heythisone', 'test']
try like this:
>>> my_str = "word anotherword +aspecialword lameword +heythisone +test hello"
>>> " ".join(x[1:] for x in my_str.split() if x.startswith("+"))
'aspecialword heythisone test'
str.startswith(prefix[, start[, end]])
Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.
You could use a regular expression.
>>> import re
>>> re.findall(r'(?<=\+)\S+', "word anotherword +aspecialword lameword +heythisone +test hello")
['aspecialword', 'heythisone', 'test']
r'(?<=\+)\S+' matches any sequence of non-space characters that are preceded by a plus sign.

Python how to strip a string from a string based on items in a list

I have a list as shown below:
exclude = ["please", "hi", "team"]
I have a string as follows:
text = "Hi team, please help me out."
I want my string to look as:
text = ", help me out."
effectively stripping out any word that might appear in the list exclude
I tried the below:
if any(e in text.lower()) for e in exclude:
print text.lower().strip(e)
But the above if statement returns a boolean value and hence I get the below error:
NameError: name 'e' is not defined
How do I get this done?
Something like this?
>>> from string import punctuation
>>> ' '.join(x for x in (word.strip(punctuation) for word in text.split())
if x.lower() not in exclude)
'help me out
If you want to keep the trailing/leading punctuation with the words that are not present in exclude:
>>> ' '.join(word for word in text.split()
if word.strip(punctuation).lower() not in exclude)
'help me out.'
First one is equivalent to:
>>> out = []
>>> for word in text.split():
word = word.strip(punctuation)
if word.lower() not in exclude:
out.append(word)
>>> ' '.join(out)
'help me out'
You can use Use this (remember it is case sensitive)
for word in exclude:
text = text.replace(word, "")
This is going to replace with spaces everything that is not alphanumeric or belong to the stopwords list, and then split the result into the words you want to keep. Finally, the list is joined into a string where words are spaced. Note: case sensitive.
' '.join ( re.sub('\W|'+'|'.join(stopwords),' ',sentence).split() )
Example usage:
>>> import re
>>> stopwords=['please','hi','team']
>>> sentence='hi team, please help me out.'
>>> ' '.join ( re.sub('\W|'+'|'.join(stopwords),' ',sentence).split() )
'help me out'
Using simple methods:
import re
exclude = ["please", "hi", "team"]
text = "Hi team, please help me out."
l=[]
te = re.findall("[\w]*",text)
for a in te:
b=''.join(a)
if (b.upper() not in (name.upper() for name in exclude)and a):
l.append(b)
print " ".join(l)
Hope it helps
if you are not worried about punctuation:
>>> import re
>>> text = "Hi team, please help me out."
>>> text = re.findall("\w+",text)
>>> text
['Hi', 'team', 'please', 'help', 'me', 'out']
>>> " ".join(x for x in text if x.lower() not in exclude)
'help me out'
In the above code, re.findall will find all words and put them in a list.
\w matches A-Za-z0-9
+ means one or more occurrence

Python and Line Breaks

With Python I know that the "\n" breaks to the next line in a string, but what I am trying to do is replace every "," in a string with a '\n'. Is that possible? I am kind of new to Python.
Try this:
text = 'a, b, c'
text = text.replace(',', '\n')
print text
For lists:
text = ['a', 'b', 'c']
text = '\n'.join(text)
print text
>>> str = 'Hello, world'
>>> str = str.replace(',','\n')
>>> print str
Hello
world
>>> str_list=str.split('\n')
>>> print str_list
['Hello', ' world']
For futher operations you may check: http://docs.python.org/library/stdtypes.html
You can insert a literal \n into your string by escaping the backslash, e.g.
>>> print '\n'; # prints an empty line
>>> print '\\n'; # prints \n
\n
The same principle is used in regular expressions. Use this expresion to replace all , in a string with \n:
>>> re.sub(",", "\\n", "flurb, durb, hurr")
'flurb\n durb\n hurr'

Search list: match only exact word/string

How to match exact string/word while searching a list. I have tried, but its not correct. below I have given the sample list, my code and the test results
list = ['Hi, hello', 'hi mr 12345', 'welcome sir']
my code:
for str in list:
if s in str:
print str
test results:
s = "hello" ~ expected output: 'Hi, hello' ~ output I get: 'Hi, hello'
s = "123" ~ expected output: *nothing* ~ output I get: 'hi mr 12345'
s = "12345" ~ expected output: 'hi mr 12345' ~ output I get: 'hi mr 12345'
s = "come" ~ expected output: *nothing* ~ output I get: 'welcome sir'
s = "welcome" ~ expected output: 'welcome sir' ~ output I get: 'welcome sir'
s = "welcome sir" ~ expected output: 'welcome sir' ~ output I get: 'welcome sir'
My list contains more than 200K strings
It looks like you need to perform this search not only once so I would recommend to convert your list into dictionary:
>>> l = ['Hi, hello', 'hi mr 12345', 'welcome sir']
>>> d = dict()
>>> for item in l:
... for word in item.split():
... d.setdefault(word, list()).append(item)
...
So now you can easily do:
>>> d.get('hi')
['hi mr 12345']
>>> d.get('come') # nothing
>>> d.get('welcome')
['welcome sir']
p.s. probably you have to improve item.split() to handle commas, point and other separators. maybe use regex and \w.
p.p.s. as cularion mentioned this won't match "welcome sir". if you want to match whole string, it is just one additional line to proposed solution. but if you have to match part of string bounded by spaces and punctuation regex should be your choice.
>>> l = ['Hi, hello', 'hi mr 12345', 'welcome sir']
>>> search = lambda word: filter(lambda x: word in x.split(),l)
>>> search('123')
[]
>>> search('12345')
['hi mr 12345']
>>> search('hello')
['Hi, hello']
if you search for exact match:
for str in list:
if set (s.split()) & set(str.split()):
print str
Provided s only ever consists of just a few words, you could do
s = s.split()
n = len(s)
for x in my_list:
words = x.split()
if s in (words[i:i+n] for i in range(len(words) - n + 1)):
print x
If s consists of many words, there are more efficient, but also much more complex algorithm for this.
use regular expression here to match exact word with word boundary \b
import re
.....
for str in list:
if re.search(r'\b'+wordToLook+'\b', str):
print str
\b only matches a word which is terminated and starts with word terminator e.g. space or line break
or do something like this to avoid typing the word for searching again and again.
import re
list = ['Hi, hello', 'hi mr 12345', 'welcome sir']
listOfWords = ['hello', 'Mr', '123']
reg = re.compile(r'(?i)\b(?:%s)\b' % '|'.join(listOfWords))
for str in list:
if reg.search(str):
print str
(?i) is to search for without worrying about the case of words, if you want to search with case sensitivity then remove it.

Categories