Removing specific set of characters in a list of strings

Removing specific set of characters in a list of strings - python

I have a list of strings, and want to use another list of strings and remove any instance of the combination of bad list in my list. Such as the output of the below would be foo, bar, foobar, foofoo... Currently I have tried a few things for example below
mylist = ['foo!', 'bar\\n', 'foobar!!??!!', 'foofoo::!*']
remove_list = ['\\n', '!', '*', '?', ':']
for remove in remove_list:
for strings in mylist:
strings = strings.replace(bad, ' ')
The above code doesnt work, I did at one point set it to a new variable and append that afterwords but that wasnt working well becuase if their was two issues in a string it would be appended twice.

You changed the temporary variable, not the original list. Instead, assign the result back into mylist
for bad in remove_list:
for pos, string in enumerate(mylist):
mylist[pos] = string.replace(bad, ' ')

Try this:
mylist = ['foo!', 'bar\\n', 'foobar!!??!!', 'foofoo::!*']
bads = ['\\n', '!', '*', '?', ':']
result = []
for s in mylist:
# s is a temporary copy
for bad in bads:
s = s.replace(bad, '') # for all bad remove it
result.append(s)
print(result)
Could be implemented more concise, but this way it's more understandable.

I had a hard time interpreting the question, but I see you have the result desired at the top of your question.
mylist = ['foo!', 'bar\\n', 'foobar!!??!!', 'foofoo::!*']
remove_list = ['\\n', '!', '*', '?', ':']
output = output[]
for strings in mylist:
for remove in remove_list:
strings = strings.replace(remove, '')
output.append(strings)

import re
for list1 in mylist:
t = regex.sub('', list1)
print(t)
If you just want to get rid of non-chars do this. It works a lot better than comparing two separate array lists.

Why not have regex do the work for you? No nested loops this way (just make sure to escape correctly):
import re
mylist = ['foo!', 'bar\\n', 'foobar!!??!!', 'foofoo::!*']
remove_list = [r'\\n', '\!', '\*', '\?', ':']
removals = re.compile('|'.join(remove_list))
print([removals.sub('', s) for s in mylist])
['foo', 'bar', 'foobar', 'foofoo']

Another solution you can use is a comprehension list and remove the characters you want. After that, you delete duplicates.
list_good = [word.replace(bad, '') for word in mylist for bad in remove_list]
list_good = list(set(list_good))

my_list = ["foo!", "bar\\n", "foobar!!??!!", "foofoo::*!"]
to_remove = ["!", "\\n", "?", ":", "*"]
for index, item in enumerate(my_list):
for char in to_remove:
if char in item:
item = item.replace(char, "")
my_list[index] = item
print(my_list) # outputs [“foo”,”bar”,”foobar”,”foofoo”]

Related

python string match partial match [duplicate]

How can I check if any of the strings in an array exists in another string?
For example:
a = ['a', 'b', 'c']
s = "a123"
if a in s:
print("some of the strings found in s")
else:
print("no strings found in s")
How can I replace the if a in s: line to get the appropriate result?

You can use any:
a_string = "A string is more than its parts!"
matches = ["more", "wholesome", "milk"]
if any([x in a_string for x in matches]):
Similarly to check if all the strings from the list are found, use all instead of any.

any() is by far the best approach if all you want is True or False, but if you want to know specifically which string/strings match, you can use a couple things.
If you want the first match (with False as a default):
match = next((x for x in a if x in str), False)
If you want to get all matches (including duplicates):
matches = [x for x in a if x in str]
If you want to get all non-duplicate matches (disregarding order):
matches = {x for x in a if x in str}
If you want to get all non-duplicate matches in the right order:
matches = []
for x in a:
if x in str and x not in matches:
matches.append(x)

You should be careful if the strings in a or str gets longer. The straightforward solutions take O(S*(A^2)), where S is the length of str and A is the sum of the lenghts of all strings in a. For a faster solution, look at Aho-Corasick algorithm for string matching, which runs in linear time O(S+A).

Just to add some diversity with regex:
import re
if any(re.findall(r'a|b|c', str, re.IGNORECASE)):
print 'possible matches thanks to regex'
else:
print 'no matches'
or if your list is too long - any(re.findall(r'|'.join(a), str, re.IGNORECASE))

A surprisingly fast approach is to use set:
a = ['a', 'b', 'c']
str = "a123"
if set(a) & set(str):
print("some of the strings found in str")
else:
print("no strings found in str")
This works if a does not contain any multiple-character values (in which case use any as listed above). If so, it's simpler to specify a as a string: a = 'abc'.

You need to iterate on the elements of a.
a = ['a', 'b', 'c']
str = "a123"
found_a_string = False
for item in a:
if item in str:
found_a_string = True
if found_a_string:
print "found a match"
else:
print "no match found"

a = ['a', 'b', 'c']
str = "a123"
a_match = [True for match in a if match in str]
if True in a_match:
print "some of the strings found in str"
else:
print "no strings found in str"

jbernadas already mentioned the Aho-Corasick-Algorithm in order to reduce complexity.
Here is one way to use it in Python:
Download aho_corasick.py from here
Put it in the same directory as your main Python file and name it aho_corasick.py
Try the alrorithm with the following code:
from aho_corasick import aho_corasick #(string, keywords)
print(aho_corasick(string, ["keyword1", "keyword2"]))
Note that the search is case-sensitive

The regex module recommended in python docs, supports this
words = {'he', 'or', 'low'}
p = regex.compile(r"\L<name>", name=words)
m = p.findall('helloworld')
print(m)
output:
['he', 'low', 'or']
Some details on implementation: link

A compact way to find multiple strings in another list of strings is to use set.intersection. This executes much faster than list comprehension in large sets or lists.
>>> astring = ['abc','def','ghi','jkl','mno']
>>> bstring = ['def', 'jkl']
>>> a_set = set(astring) # convert list to set
>>> b_set = set(bstring)
>>> matches = a_set.intersection(b_set)
>>> matches
{'def', 'jkl'}
>>> list(matches) # if you want a list instead of a set
['def', 'jkl']
>>>

Just some more info on how to get all list elements availlable in String
a = ['a', 'b', 'c']
str = "a123"
list(filter(lambda x: x in str, a))

It depends on the context
suppose if you want to check single literal like(any single word a,e,w,..etc) in is enough
original_word ="hackerearcth"
for 'h' in original_word:
print("YES")
if you want to check any of the character among the original_word:
make use of
if any(your_required in yourinput for your_required in original_word ):
if you want all the input you want in that original_word,make use of all
simple
original_word = ['h', 'a', 'c', 'k', 'e', 'r', 'e', 'a', 'r', 't', 'h']
yourinput = str(input()).lower()
if all(requested_word in yourinput for requested_word in original_word):
print("yes")

flog = open('test.txt', 'r')
flogLines = flog.readlines()
strlist = ['SUCCESS', 'Done','SUCCESSFUL']
res = False
for line in flogLines:
for fstr in strlist:
if line.find(fstr) != -1:
print('found')
res = True
if res:
print('res true')
else:
print('res false')

I would use this kind of function for speed:
def check_string(string, substring_list):
for substring in substring_list:
if substring in string:
return True
return False

Yet another solution with set. using set.intersection. For a one-liner.
subset = {"some" ,"words"}
text = "some words to be searched here"
if len(subset & set(text.split())) == len(subset):
print("All values present in text")
if subset & set(text.split()):
print("Atleast one values present in text")

If you want exact matches of words then consider word tokenizing the target string. I use the recommended word_tokenize from nltk:
from nltk.tokenize import word_tokenize
Here is the tokenized string from the accepted answer:
a_string = "A string is more than its parts!"
tokens = word_tokenize(a_string)
tokens
Out[46]: ['A', 'string', 'is', 'more', 'than', 'its', 'parts', '!']
The accepted answer gets modified as follows:
matches_1 = ["more", "wholesome", "milk"]
[x in tokens for x in matches_1]
Out[42]: [True, False, False]
As in the accepted answer, the word "more" is still matched. If "mo" becomes a match string, however, the accepted answer still finds a match. That is a behavior I did not want.
matches_2 = ["mo", "wholesome", "milk"]
[x in a_string for x in matches_1]
Out[43]: [True, False, False]
Using word tokenization, "mo" is no longer matched:
[x in tokens for x in matches_2]
Out[44]: [False, False, False]
That is the additional behavior that I wanted. This answer also responds to the duplicate question here.

data = "firstName and favoriteFood"
mandatory_fields = ['firstName', 'lastName', 'age']
# for each
for field in mandatory_fields:
if field not in data:
print("Error, missing req field {0}".format(field));
# still fine, multiple if statements
if ('firstName' not in data or
'lastName' not in data or
'age' not in data):
print("Error, missing a req field");
# not very readable, list comprehension
missing_fields = [x for x in mandatory_fields if x not in data]
if (len(missing_fields)>0):
print("Error, missing fields {0}".format(", ".join(missing_fields)));

Splitting a string with numbers and letters [duplicate]

I'd like to split strings like these
'foofo21'
'bar432'
'foobar12345'
into
['foofo', '21']
['bar', '432']
['foobar', '12345']
Does somebody know an easy and simple way to do this in python?

I would approach this by using re.match in the following way:
import re
match = re.match(r"([a-z]+)([0-9]+)", 'foofo21', re.I)
if match:
items = match.groups()
print(items)
>> ("foofo", "21")

def mysplit(s):
head = s.rstrip('0123456789')
tail = s[len(head):]
return head, tail
>>> [mysplit(s) for s in ['foofo21', 'bar432', 'foobar12345']]
[('foofo', '21'), ('bar', '432'), ('foobar', '12345')]

Yet Another Option:
>>> [re.split(r'(\d+)', s) for s in ('foofo21', 'bar432', 'foobar12345')]
[['foofo', '21', ''], ['bar', '432', ''], ['foobar', '12345', '']]

>>> r = re.compile("([a-zA-Z]+)([0-9]+)")
>>> m = r.match("foobar12345")
>>> m.group(1)
'foobar'
>>> m.group(2)
'12345'
So, if you have a list of strings with that format:
import re
r = re.compile("([a-zA-Z]+)([0-9]+)")
strings = ['foofo21', 'bar432', 'foobar12345']
print [r.match(string).groups() for string in strings]
Output:
[('foofo', '21'), ('bar', '432'), ('foobar', '12345')]

I'm always the one to bring up findall() =)
>>> strings = ['foofo21', 'bar432', 'foobar12345']
>>> [re.findall(r'(\w+?)(\d+)', s)[0] for s in strings]
[('foofo', '21'), ('bar', '432'), ('foobar', '12345')]
Note that I'm using a simpler (less to type) regex than most of the previous answers.

here is a simple function to seperate multiple words and numbers from a string of any length, the re method only seperates first two words and numbers. I think this will help everyone else in the future,
def seperate_string_number(string):
previous_character = string[0]
groups = []
newword = string[0]
for x, i in enumerate(string[1:]):
if i.isalpha() and previous_character.isalpha():
newword += i
elif i.isnumeric() and previous_character.isnumeric():
newword += i
else:
groups.append(newword)
newword = i
previous_character = i
if x == len(string) - 2:
groups.append(newword)
newword = ''
return groups
print(seperate_string_number('10in20ft10400bg'))
# outputs : ['10', 'in', '20', 'ft', '10400', 'bg']

import re
s = raw_input()
m = re.match(r"([a-zA-Z]+)([0-9]+)",s)
print m.group(0)
print m.group(1)
print m.group(2)

without using regex, using isdigit() built-in function, only works if starting part is text and latter part is number
def text_num_split(item):
for index, letter in enumerate(item, 0):
if letter.isdigit():
return [item[:index],item[index:]]
print(text_num_split("foobar12345"))
OUTPUT :
['foobar', '12345']

This is a little longer, but more versatile for cases where there are multiple, randomly placed, numbers in the string. Also, it requires no imports.
def getNumbers( input ):
# Collect Info
compile = ""
complete = []
for letter in input:
# If compiled string
if compile:
# If compiled and letter are same type, append letter
if compile.isdigit() == letter.isdigit():
compile += letter
# If compiled and letter are different types, append compiled string, and begin with letter
else:
complete.append( compile )
compile = letter
# If no compiled string, begin with letter
else:
compile = letter
# Append leftover compiled string
if compile:
complete.append( compile )
# Return numbers only
numbers = [ word for word in complete if word.isdigit() ]
return numbers

Here is simple solution for that problem, no need for regex:
user = input('Input: ') # user = 'foobar12345'
int_list, str_list = [], []
for item in user:
try:
item = int(item) # searching for integers in your string
except:
str_list.append(item)
string = ''.join(str_list)
else: # if there are integers i will add it to int_list but as str, because join function only can work with str
int_list.append(str(item))
integer = int(''.join(int_list)) # if you want it to be string just do z = ''.join(int_list)
final = [string, integer] # you can also add it to dictionary d = {string: integer}
print(final)

In Addition to the answer of #Evan
If the incoming string is in this pattern 21foofo then the re.match pattern would be like this.
import re
match = re.match(r"([0-9]+)([a-z]+)", '21foofo', re.I)
if match:
items = match.groups()
print(items)
>> ("21", "foofo")
Otherwise, you'll get UnboundLocalError: local variable 'items' referenced before assignment error.

How to check for multi strings in a line [duplicate]

How can I check if any of the strings in an array exists in another string?
For example:
a = ['a', 'b', 'c']
s = "a123"
if a in s:
print("some of the strings found in s")
else:
print("no strings found in s")
How can I replace the if a in s: line to get the appropriate result?

You can use any:
a_string = "A string is more than its parts!"
matches = ["more", "wholesome", "milk"]
if any([x in a_string for x in matches]):
Similarly to check if all the strings from the list are found, use all instead of any.

any() is by far the best approach if all you want is True or False, but if you want to know specifically which string/strings match, you can use a couple things.
If you want the first match (with False as a default):
match = next((x for x in a if x in str), False)
If you want to get all matches (including duplicates):
matches = [x for x in a if x in str]
If you want to get all non-duplicate matches (disregarding order):
matches = {x for x in a if x in str}
If you want to get all non-duplicate matches in the right order:
matches = []
for x in a:
if x in str and x not in matches:
matches.append(x)

You should be careful if the strings in a or str gets longer. The straightforward solutions take O(S*(A^2)), where S is the length of str and A is the sum of the lenghts of all strings in a. For a faster solution, look at Aho-Corasick algorithm for string matching, which runs in linear time O(S+A).

Just to add some diversity with regex:
import re
if any(re.findall(r'a|b|c', str, re.IGNORECASE)):
print 'possible matches thanks to regex'
else:
print 'no matches'
or if your list is too long - any(re.findall(r'|'.join(a), str, re.IGNORECASE))

A surprisingly fast approach is to use set:
a = ['a', 'b', 'c']
str = "a123"
if set(a) & set(str):
print("some of the strings found in str")
else:
print("no strings found in str")
This works if a does not contain any multiple-character values (in which case use any as listed above). If so, it's simpler to specify a as a string: a = 'abc'.

You need to iterate on the elements of a.
a = ['a', 'b', 'c']
str = "a123"
found_a_string = False
for item in a:
if item in str:
found_a_string = True
if found_a_string:
print "found a match"
else:
print "no match found"

a = ['a', 'b', 'c']
str = "a123"
a_match = [True for match in a if match in str]
if True in a_match:
print "some of the strings found in str"
else:
print "no strings found in str"

jbernadas already mentioned the Aho-Corasick-Algorithm in order to reduce complexity.
Here is one way to use it in Python:
Download aho_corasick.py from here
Put it in the same directory as your main Python file and name it aho_corasick.py
Try the alrorithm with the following code:
from aho_corasick import aho_corasick #(string, keywords)
print(aho_corasick(string, ["keyword1", "keyword2"]))
Note that the search is case-sensitive

The regex module recommended in python docs, supports this
words = {'he', 'or', 'low'}
p = regex.compile(r"\L<name>", name=words)
m = p.findall('helloworld')
print(m)
output:
['he', 'low', 'or']
Some details on implementation: link

A compact way to find multiple strings in another list of strings is to use set.intersection. This executes much faster than list comprehension in large sets or lists.
>>> astring = ['abc','def','ghi','jkl','mno']
>>> bstring = ['def', 'jkl']
>>> a_set = set(astring) # convert list to set
>>> b_set = set(bstring)
>>> matches = a_set.intersection(b_set)
>>> matches
{'def', 'jkl'}
>>> list(matches) # if you want a list instead of a set
['def', 'jkl']
>>>

Just some more info on how to get all list elements availlable in String
a = ['a', 'b', 'c']
str = "a123"
list(filter(lambda x: x in str, a))

It depends on the context
suppose if you want to check single literal like(any single word a,e,w,..etc) in is enough
original_word ="hackerearcth"
for 'h' in original_word:
print("YES")
if you want to check any of the character among the original_word:
make use of
if any(your_required in yourinput for your_required in original_word ):
if you want all the input you want in that original_word,make use of all
simple
original_word = ['h', 'a', 'c', 'k', 'e', 'r', 'e', 'a', 'r', 't', 'h']
yourinput = str(input()).lower()
if all(requested_word in yourinput for requested_word in original_word):
print("yes")

flog = open('test.txt', 'r')
flogLines = flog.readlines()
strlist = ['SUCCESS', 'Done','SUCCESSFUL']
res = False
for line in flogLines:
for fstr in strlist:
if line.find(fstr) != -1:
print('found')
res = True
if res:
print('res true')
else:
print('res false')

I would use this kind of function for speed:
def check_string(string, substring_list):
for substring in substring_list:
if substring in string:
return True
return False

Yet another solution with set. using set.intersection. For a one-liner.
subset = {"some" ,"words"}
text = "some words to be searched here"
if len(subset & set(text.split())) == len(subset):
print("All values present in text")
if subset & set(text.split()):
print("Atleast one values present in text")

If you want exact matches of words then consider word tokenizing the target string. I use the recommended word_tokenize from nltk:
from nltk.tokenize import word_tokenize
Here is the tokenized string from the accepted answer:
a_string = "A string is more than its parts!"
tokens = word_tokenize(a_string)
tokens
Out[46]: ['A', 'string', 'is', 'more', 'than', 'its', 'parts', '!']
The accepted answer gets modified as follows:
matches_1 = ["more", "wholesome", "milk"]
[x in tokens for x in matches_1]
Out[42]: [True, False, False]
As in the accepted answer, the word "more" is still matched. If "mo" becomes a match string, however, the accepted answer still finds a match. That is a behavior I did not want.
matches_2 = ["mo", "wholesome", "milk"]
[x in a_string for x in matches_1]
Out[43]: [True, False, False]
Using word tokenization, "mo" is no longer matched:
[x in tokens for x in matches_2]
Out[44]: [False, False, False]
That is the additional behavior that I wanted. This answer also responds to the duplicate question here.

data = "firstName and favoriteFood"
mandatory_fields = ['firstName', 'lastName', 'age']
# for each
for field in mandatory_fields:
if field not in data:
print("Error, missing req field {0}".format(field));
# still fine, multiple if statements
if ('firstName' not in data or
'lastName' not in data or
'age' not in data):
print("Error, missing a req field");
# not very readable, list comprehension
missing_fields = [x for x in mandatory_fields if x not in data]
if (len(missing_fields)>0):
print("Error, missing fields {0}".format(", ".join(missing_fields)));

Partitioning elements in list by \t . Python

my_list = ['1\tMelkor\tMorgoth\tSauronAtDolGoldul','2\tThingols\tHeirIsDior\tSilmaril','3\tArkenstone\tIsProbablyA\tSilmaril']
I'm trying to split this list into sublists separated by \t
output = [['1','Melkor','Morgoth','SauronAtDolGoldul'],['2','Thigols','HeirIsDior','Silmaril'],['3','Arkenstone','IsProbablyA','Silmaril']]
I was thinking something on the lines of
output = []
for k_string in my_list:
temp = []
for i in k_string:
temp_s = ''
if i != '\':
temp_s = temp_s + i
elif i == '\':
break
temp.append(temp_s)
it gets messed up with the t . . i'm not sure how else I would go about doing it. I've seen people use .join for similar things but I don't really understand how to use .join

You want to use str.split(); a list comprehension lets you apply this to all elements in one line:
output = [sub.split('\t') for sub in my_list]
There is no literal \ in the string; the \t is an escape code that signifies the tab character.
Demo:
>>> my_list = ['1\tMelkor\tMorgoth\tSauronAtDolGoldul','2\tThingols\tHeirIsDior\tSilmaril','3\tArkenstone\tIsProbablyA\tSilmaril']
>>> [sub.split('\t') for sub in my_list]
[['1', 'Melkor', 'Morgoth', 'SauronAtDolGoldul'], ['2', 'Thingols', 'HeirIsDior', 'Silmaril'], ['3', 'Arkenstone', 'IsProbablyA', 'Silmaril']]

>>> import csv
>>> my_list = ['1\tMelkor\tMorgoth\tSauronAtDolGoldul','2\tThingols\tHeirIsDior\tSilmaril','3\tArkenstone\tIsProbablyA\tSilmaril']
>>> list(csv.reader(my_list, delimiter='\t'))
[['1', 'Melkor', 'Morgoth', 'SauronAtDolGoldul'], ['2', 'Thingols', 'HeirIsDior', 'Silmaril'], ['3', 'Arkenstone', 'IsProbablyA', 'Silmaril']]

How to prevent an procedure similar to the split () function (but with multiple separators) returns ' ' in its output

guys, I'm a programming newbie trying to improve the procedure bellow in a way that when I pass it this argument: split_string("After the flood ... all the colors came out."," .") it returns it:
['After', 'the', 'flood', 'all', 'the', 'colors', 'came', 'out']
and not this:
['After', 'the', 'flood', '', '', '', '', 'all', 'the', 'colors', 'came', 'out', '']
Any hint of how to do this? (I could just iterate again the list and delete the '' elements, but I wanted a more elegant solution)
This is the procedure:
def split_string(source, separatorList):
splited = [source]
for separator in splitlist:
source = splited
splited = []
print 'separator= ', separator
for sequence in source:
print 'sequence = ', sequence
if sequence not in splitlist and sequence != ' ':
splited = splited + sequence.split(separator)
return splited
print split_string("This is a test-of the,string separation-code!", " ,!-")
print
print split_string("After the flood ... all the colors came out."," .")

You can filter out the empty strings in the return statement:
return [x for x in split if x]
As a side note, I think it would be easier to write your function based on re.split():
def split_string(s, separators):
pattern = "|".join(re.escape(sep) for sep in separators)
return [x for x in re.split(pattern, s) if x]

print re.split('[. ]+', 'After the flood ... all the colors came out.')
or, better, the other way round
print re.findall('[^. ]+', 'After the flood ... all the colors came out.')

Let's see where did the empty strings come from first, try to execute this in shell:
>>> 'After the'.split(' ')
result:
['After', '', 'the']
This was because when split method came to ' ' in the string, it find nothing but '' between two spaces.
So the solution is simple, just check the boolean value of every item get from .split(
def split_string(source, separatorList):
splited = [source]
for separator in separatorList:
# if you want to exchange two variables, then write in one line can make the code more clear
source, splited = splited, []
for sequence in source:
# there's no need to check `sequence` in advance, just split it
# if sequence not in separatorList and sequence != ' ':
# splited = splited + sequence.split(separator)
# code to prevent appearance of `''` is here, do a if check in list comprehension.
# `+=` is equivalent to `= splited +`
splited += [i for i in sequence.split(separator) if i]
return splited
More details about [i for i in a_list if i] see PEP 202

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing specific set of characters in a list of strings - python

You changed the temporary variable, not the original list. Instead, assign the result back into mylist for bad in remove_list: for pos, string in enumerate(mylist): mylist[pos] = string.replace(bad, ' ')

import re for list1 in mylist: t = regex.sub('', list1) print(t) If you just want to get rid of non-chars do this. It works a lot better than comparing two separate array lists.

Another solution you can use is a comprehension list and remove the characters you want. After that, you delete duplicates. list_good = [word.replace(bad, '') for word in mylist for bad in remove_list] list_good = list(set(list_good))

my_list = ["foo!", "bar\\n", "foobar!!??!!", "foofoo::!"] to_remove = ["!", "\\n", "?", ":", ""] for index, item in enumerate(my_list): for char in to_remove: if char in item: item = item.replace(char, "") my_list[index] = item print(my_list) # outputs [“foo”,”bar”,”foobar”,”foofoo”]

Related

python string match partial match [duplicate]

Splitting a string with numbers and letters [duplicate]

How to check for multi strings in a line [duplicate]

Partitioning elements in list by \t . Python

How to prevent an procedure similar to the split () function (but with multiple separators) returns ' ' in its output

Categories

Resources