How to find unknown character from a list of strings? - python

How would I delete an unknown character from a list of strings?
For example my list is ['hi', 'h#w', 'are!', 'you;', '25'] and I want to delete all the characters that are not words or numbers?
How would I do this?

Regex:
s = ['hi', 'h#w', 'are!', 'you;', '25']
[re.sub(r'[^A-Za-z0-9 ]+', '', x) for x in s]
['hi', 'hw', 'are', 'you', '25']

Use re.sub:
from re import sub
lst = ['hi', 'h#w', 'are!', 'you;', '25']
lst = [sub('[^\w]', '', i) for i in lst]
print(lst)
Output:
['hi', 'hw', 'are', 'you', '25']
Explanation:
This line: sub('[^\w]', '', i) tells python to replace all the substrings of pattern "[^\w]" with an empty string, "" inside the i string, and return the result.
The pattern [^\w] finds all the characters in a string that are not letters or numbers.

Related

Python Regex Compile Split string so that words appear first

Say I was given a string like so
text = "1234 I just ? shut * the door"
I want to use a regex with re.compile() such that when I split the list all of the words are in front.
I.e. it should look like this.
text = ["I", "just", "shut", "the", "door", "1234", "?", "*"]
How can I use re.compile() to split the string this way?
import re
r = re.compile('regex to split string so that words are first').split(text)
Please let me know if you need any more information.
Thank you for the help.
IIUC, you don't need re. Just use str.split with sorted:
sorted(text.split(), key=lambda x: not x.isalpha())
Output:
['I', 'just', 'shut', 'the', 'door', '1234', '?', '*']
You can use sorted with re.findall:
import re
text = "1234 I just ? shut * the door"
r = sorted(text.split(), key=lambda x:(x.isalpha(), x.isdigit(), bool(re.findall('^\W+$', x))), reverse=True)
Output:
['I', 'just', 'shut', 'the', 'door', '1234', '?', '*']
You can't do that with a single regex. You can write one regex to get all words, then another regex to get everything else.
import re
text = "1234 I just ? shut * the door"
r = re.compile(r'[a-zA-Z]+')
words = r.findall(text)
r = re.compile(r'[^a-zA-Z\s]+')
other = r.findall(text)
print(words + other) # ['I', 'just', 'shut', 'the', 'door', '1234', '?', '*']

split values in list based on multiple separators python

I have a string in a list. I want to split values based on my separator. I don't wanna use any regex expression. regex performs it in a single operation. but i want to use for loops and split() functions to achieve it. How to make it possible.
Here's my code:
aa = ['prinec-how,are_you&&smile#isfine1']
separator = ["-",",","_","&","#"]
l1 = []
for sep in separator:
for i in aa:
#print("i:",i)
split_list = i.split(sep)
aa = split_list
print("aa:",aa)
#print("split_list:",split_list)
l1 =l1 + split_list
print(l1)
Required output:
['prinec','how','are','you','smile','isfine1']
Using str.replace and str.split()
Ex:
aa = ['prinec-how,are_you&&smile#isfine1']
separator = ["-",",","_","&","#"]
for i in aa:
for sep in separator:
i = i.replace(sep, " ")
print(i.split())
Output:
['prinec', 'how', 'are', 'you', 'smile', 'isfine1']
Intead of using a regular expression (which would be the sensible thing to do here), you could e.g. use itertools.groupby to group characters by whether they are separators or not, and then keep those groups that are not.
aa = ['prinec-how,are_you&&smile#isfine1']
separator = ["-",",","_","&","#"]
from itertools import groupby
res = [''.join(g) for k, g in groupby(aa[0], key=separator.__contains__) if not k]
# res: ['prinec', 'how', 'are', 'you', 'smile', 'isfine1']
As I understand your approach, you want to iteratively split the strings in the list by the different separators and add their parts back to the list. This way, it also makes sense for aa to be a list initially holding a single string. You could do this much easier with a list comprehension, replacing aa with a new list holding the words from the previous aa split by the next separator:
aa = ['prinec-how,are_you&&smile#isfine1']
separator = ["-",",","_","&","#"]
for s in separator:
aa = [x for a in aa for x in a.split(s) if x]
# aa: ['prinec', 'how', 'are', 'you', 'smile', 'isfine1']
using regex
import re
a=re.compile(r'[^-,_&#]+')
ST = 'prinec-how,are_you&&smile#isfine1'
b=a.findall(ST)
print(b)
"""
output
['prinec', 'how', 'are', 'you', 'smile', 'isfine1']
"""
USING for loop
aa = ['prinec-how,are_you&&smile#isfine1','prinec-how,are_you&&smile#isfi-ne1']
separator = ["-",",","_","&","#"]
for i in range(len(aa)):
j =aa[i]
for sep in separator:
j = j.replace(sep, ' ')
aa[i]=j.split()
print(aa)
OUTPUT
[['prinec', 'how', 'are', 'you', 'smile', 'isfine1'], ['prinec', 'how', 'are', 'you', 'smile', 'isfi', 'ne1']]

Get a string after a character in python

Getting a string that comes after a '%' symbol and should end before other characters (no numbers and characters).
for example:
string = 'Hi %how are %YOU786$ex doing'
it should return as a list.
['how', 'you']
I tried
string = text.split()
sample = []
for i in string:
if '%' in i:
sample.append(i[1:index].lower())
return sample
but it I don't know how to get rid of 'you786$ex'.
EDIT: I don't want to import re
You can use a regular expression.
>>> import re
>>>
>>> s = 'Hi %how are %YOU786$ex doing'
>>> re.findall('%([a-z]+)', s.lower())
>>> ['how', 'you']
regex101 details
This can be most easily done with re.findall():
import re
re.findall(r'%([a-z]+)', string.lower())
This returns:
['how', 'you']
Or you can use str.split() and iterate over the characters:
sample = []
for token in string.lower().split('%')[1:]:
word = ''
for char in token:
if char.isalpha():
word += char
else:
break
sample.append(word)
sample would become:
['how', 'you']
Use Regex (Regular Expressions).
First, create a Regex pattern for your task. You could use online tools to test it. See regex for your task: https://regex101.com/r/PMSvtK/1
Then just use this regex in Python:
import re
def parse_string(string):
return re.findall("\%([a-zA-Z]+)", string)
print(parse_string('Hi %how are %YOU786$ex doing'))
Output:
['how', 'YOU']

Splitting a string after certain characters?

I will be given a string, and I need to split it every time that it has an "|", "/", "." or "_"
How can I do this fast? I know how to use the command split, but is there any way to give more than 1 split condition to it? For example, if the input given was
Hello test|multiple|36.strings/just36/testing
I want the output to give:
"['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']"
Use a regex and the regex module:
>>> import re
>>> s='You/can_split|multiple'
>>> re.split(r'[/_|.]', s)
['You', 'can', 'split', 'multiple']
In this case, [/_|.] will split on any of those characters.
Or, you can use a list comprehension to insert a single (perhaps multiple character) delimiter and then split on that:
>>> ''.join(['-><-' if c in '/_|.' else c for c in s]).split('-><-')
['You', 'can', 'split', 'multiple']
With the added example:
>>> s2="Hello test|multiple|36.strings/just36/testing"
Method 1:
>>> re.split(r'[/_|.]', s2)
['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']
Method 2:
>>> ''.join(['-><-' if c in '/_|.' else c for c in s2]).split('-><-')
['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']
Use groupby:
from itertools import groupby
s = 'You/can_split|multiple'
separators = set('/_|.')
result = [''.join(group) for k, group in groupby(s, key=lambda x: x not in separators) if k]
print(result)
Output
['You', 'can', 'split', 'multiple']

Splitting the sentences in python

I am trying to split the sentences in words.
words = content.lower().split()
this gives me the list of words like
'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'
and with this code:
def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for i in range(0, len(symbols)):
word = word.replace(symbols[i], "")
if len(word) > 0:
clean_word_list.append(word)
I get something like:
'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'
if you see the word "morningthe" in the list, it used to have "--" in between words. Now, is there any way I can split them in two words like "morning","the"??
I would suggest a regex-based solution:
import re
def to_words(text):
return re.findall(r'\w+', text)
This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace.
>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']
Note that if you're looping over the words, using re.finditer which returns a generator object is probably better, as you don't have store the whole list of words at once.
Alternatively, you may also use itertools.groupby along with str.alpha() to extract alphabets-only words from the string as:
>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'
>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']
PS: Regex based solution is much cleaner. I have mentioned this as an possible alternative to achieve this.
Specific to OP: If all you want is to also split on -- in the resultant list, then you may firstly replace hyphens '-' with space ' ' before performing split. Hence, your code should be:
words = content.lower().replace('-', ' ').split()
where words will hold the value you desire.
Trying to do this with regexes will send you crazy e.g.
>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']
Definitely look at the nltk package.
Besides the solutions given already, you could also improve your clean_up_list function to do a better work.
def clean_up_list(word_list):
clean_word_list = []
# Move the list out of loop so that it doesn't
# have to be initiated every time.
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for word in word_list:
current_word = ''
for index in range(len(word)):
if word[index] in symbols:
if current_word:
clean_word_list.append(current_word)
current_word = ''
else:
current_word += word[index]
if current_word:
# Append possible last current_word
clean_word_list.append(current_word)
return clean_word_list
Actually, you could apply the block in for word in word_list: to the whole sentence to get the same result.
You could also do this:
import re
def word_list(text):
return list(filter(None, re.split('\W+', text)))
print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))
Returns:
['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']

Categories