Removing empty strings from a list in python - python

I need to split a string. I am using this:
def ParseStringFile(string):
p = re.compile('\W+')
result = p.split(string)
But I have an error: my result has two empty strings (''), one before 'Лев'. How do I get rid of them?

As nhahtdh pointed out, the empty string is expected since there's a \n at the start and end of the string, but if they bother you, you can filter them very quickly and efficiently.
>>> filter(None, ['', 'text', 'more text', ''])
['text', 'more text']

You could remove all newlines from the string before matching it:
p.split(string.strip('\n'))
Alternatively, split the string and then remove the first and last element:
result = p.split(string)[1:-1]
The [1:-1] takes a copy of the result and includes all indexes starting at 1 (i.e. removing the first element), and ending at -2 (i.e. the second to last element. The second index is exclusive)
A longer and less elegant alternative would be to modify the list in-place:
result = p.split(string)
del result[-1] # remove last element
del result[0] # remove first element
Note that in these two solutions the first and last element must be the empty string. If sometimes the input doesn't contain these empty strings at the beginning or end, then they will misbehave. However they are also the fastest solutions.
If you want to remove all empty strings in the result, even if they happen inside the list of results you can use a list-comprehension:
[word for word in p.split(string) if word]

Related

Split in python with character special

I split within a string traversing an array with values, this split must contain the following rule:
Split the string into two parts when there is a special character, and select the first part as a result;
SCRIPT
array = [
'srv1 #s',
'srv2;192.168.9.1'
]
result = []
for x in array:
outfinally = [line.split(';')[0] and line.split()[0] for line in x.splitlines() if line and line[0].isalpha()]
for srv in outfinally:
if srv != None:
result.append(srv)
for i in result:
print(i)
OUTPUT
srv1
srv2;192.168.9.1
DESIRED OUTPUT
srv1
srv2
This should split on any special charters and append the first part of the split to a new list:
array = [
'srv1 #s',
'srv2;192.168.9.1'
]
sep = (r'[`\-=~!##$%^&*()_+\[\]{};\'\\:"|<,./<>?]')
rest = text.split(sep, 1)[0]
new_array =[]
for i in array:
new_array.append(re.split(sep,i)[0])
Output:
['srv1 ', 'srv2']
You can split twice with the two different separators instead:
result = [s.split()[0].split(';')[0] for s in array]
result becomes:
['srv1', 'srv2']
The problem is here: line.split(';')[0] and line.split()[0]
Your second condition splits on whitespace. As a result, it'll always return the whitespace-split version unless there's a semicolon at the start of the input (in which case you get empty string).
You probably want to chain the two splits instead:
line.split(';')[0].split()[0]
To see what the code in your question is doing, take a look at what your conditional expression does in a few different cases:
array = ['srv1 s', 'srv2;192.168.9.1', ';192.168.1.1', 'srv1;srv2 192.168.1.1']
>>> for item in array:
... print("Original: {}\n\tSplit: {}".format(item, item.split(';')[0] and item.split()[0]))
...
Original: srv1 s
Split: srv1 # split on whitespace
Original: srv2;192.168.9.1
Split: srv2;192.168.9.1 # split on whitespace!
Original: ;192.168.1.1
Split: # split on special char, returned empty which is falsey, returns empty str
Original: srv1;srv2 192.168.1.1
Split: srv1;srv2 # split only on whitespace
Change
outfinally = [line.split(';')[0] and line.split()[0] for line in x.splitlines() if line and line[0].isalpha()]
To
outfinally = [line.replace(';', ' ').split()[0] for line in x.splitlines() if line and line[0].isalpha()]
When you use and like that, it will always return the first result as long as the first result is truthy. The split function returns the full string in a list when a match is not found. Since it's returning something truthy, you'll never move on to the second condition (and if you use or like I first tried to do, you'll always move on to the second condition). Instead of having 2 conditions, what you'll have to do is combine them into one. Something like line.replace(';', ' ').split()[0] or blhsing's solution is even better.

Removing item in list during loop

I have the code below. I'm trying to remove two strings from lists predict strings and test strings if one of them has been found in the other. The issue is that I have to split up each of them and check if there is a "portion" of one string inside the other. If there is then I just say there is a match and then delete both strings from the list so they are no longer iterated over.
ValueError: list.remove(x): x not in list
I get the above error though and I am assuming this is because I can't delete the string from test_strings since it is being iterated over? Is there a way around this?
Thanks
for test_string in test_strings[:]:
for predict_string in predict_strings[:]:
split_string = predict_string.split('/')
for string in split_string:
if (split_string in test_string):
no_matches = no_matches + 1
# Found match so remove both
test_strings.remove(test_string)
predict_strings.remove(predict_string)
Example input:
test_strings = ['hello/there', 'what/is/up', 'yo/do/di/doodle', 'ding/dong/darn']
predict_strings =['hello/there/mister', 'interesting/what/that/is']
so I want there to be a match between hello/there and hello/there/mister and for them to be removed from the list when doing the next comparison.
After one iteration I expect it to be:
test_strings == ['what/is/up', 'yo/do/di/doodle', 'ding/dong/darn']
predict_strings == ['interesting/what/that/is']
After the second iteration I expect it to be:
test_strings == ['yo/do/di/doodle', 'ding/dong/darn']
predict_strings == []
You should never try to modify an iterable while you're iterating over it, which is still effectively what you're trying to do. Make a set to keep track of your matches, then remove those elements at the end.
Also, your line for string in split_string: isn't really doing anything. You're not using the variable string. Either remove that loop, or change your code so that you're using string.
You can use augmented assignment to increase the value of no_matches.
no_matches = 0
found_in_test = set()
found_in_predict = set()
for test_string in test_strings:
test_set = set(test_string.split("/"))
for predict_string in predict_strings:
split_strings = set(predict_string.split("/"))
if not split_strings.isdisjoint(test_set):
no_matches += 1
found_in_test.add(test_string)
found_in_predict.add(predict_string)
for element in found_in_test:
test_strings.remove(element)
for element in found_in_predict:
predict_strings.remove(element)
From your code it seems likely that two split_strings match the same test_string. The first time through the loop removes test_string, the second time tries to do so but can't, since it's already removed!
You can try breaking out of the inner for loop if it finds a match, or use any instead.
for test_string, predict_string in itertools.product(test_strings[:], predict_strings[:]):
if any(s in test_string for s in predict_string.split('/')):
no_matches += 1 # isn't this counter-intuitive?
test_strings.remove(test_string)
predict_strings.remove(predict_string)

How to Check if the substring is matching in a list of strings in Python

I have a list and I want to find if the string is present in the list of strings.
li = ['Convenience','Telecom Pharmacy']
txt = '1 convenience store'
I want to match the txt with the Convenience from the list.
I have tried
if any(txt.lower() in s.lower() for s in li):
print s
print [s for s in li if txt in s]
Both the methods didn't give the output.
How to match the substring with the list?
You could use set() and intersection:
In [19]: set.intersection(set(txt.lower().split()), set(s.lower() for s in list1))
Out[19]: {'convenience'}
I think split is your answer. Here is the description from the python documentation:
string.split(s[, sep[, maxsplit]])
Return a list of the words of the string s. If the optional second argument sep is absent or None, the words are separated by arbitrary
strings of whitespace characters (space, tab, newline, return,
formfeed). If the second argument sep is present and not None, it
specifies a string to be used as the word separator. The returned list
will then have one more item than the number of non-overlapping
occurrences of the separator in the string. If maxsplit is given, at
most maxsplit number of splits occur, and the remainder of the string
is returned as the final element of the list (thus, the list will have
at most maxsplit+1 elements). If maxsplit is not specified or -1, then
there is no limit on the number of splits (all possible splits are
made).
The behavior of split on an empty string depends on the value of sep. If sep is not specified, or specified as None, the result will be
an empty list. If sep is specified as any string, the result will be a
list containing one element which is an empty string.
Use the split command on your txt variable. It will give you a list back. You can then do a compare on the two lists to find any matches. I personally would write the nested for loops to check the lists manually, but python provides lots of tools for the job. The following link discusses different approaches to matching two lists.
How can I compare two lists in python and return matches
Enjoy. :-)
I see two things.
Do you want to find if the pattern string matches EXACTLY an item in the list? In this case, nothing simpler:
if txt in list1:
#do something
You can also do txt.upper() or .lower() if you want list case insensitive
But If you want as I understand, to find if there is a string (in the list) which is part of txt, you have to use "for" loop:
def find(list1, txt):
#return item if found, false otherwise
for i in list1:
if i.upper() in txt.upper(): return i
return False
It should work.
Console output:
>>>print(find(['Convenience','Telecom Pharmacy'], '1 convenience store'))
Convenience
>>>
You can try this,
>> list1 = ['Convenience','Telecom Pharmacy']
>> txt = '1 convenience store'
>> filter(lambda x: txt.lower().find(x.lower()) >= 0, list1)
['Convenience']
# Or you can use this as well
>> filter(lambda x: x.lower() in txt.lower(), list1)
['Convenience']

Dot notation string manipulation

Is there a way to manipulate a string in Python using the following ways?
For any string that is stored in dot notation, for example:
s = "classes.students.grades"
Is there a way to change the string to the following:
"classes.students"
Basically, remove everything up to and including the last period. So "restaurants.spanish.food.salty" would become "restaurants.spanish.food".
Additionally, is there any way to identify what comes after the last period? The reason I want to do this is I want to use isDigit().
So, if it was classes.students.grades.0 could I grab the 0 somehow, so I could use an if statement with isdigit, and say if the part of the string after the last period (so 0 in this case) is a digit, remove it, otherwise, leave it.
you can use split and join together:
s = "classes.students.grades"
print '.'.join(s.split('.')[:-1])
You are splitting the string on . - it'll give you a list of strings, after that you are joining the list elements back to string separating them by .
[:-1] will pick all the elements from the list but the last one
To check what comes after the last .:
s.split('.')[-1]
Another way is to use rsplit. It works the same way as split but if you provide maxsplit parameter it'll split the string starting from the end:
rest, last = s.rsplit('.', 1)
'classes.students'
'grades'
You can also use re.sub to substitute the part after the last . with an empty string:
re.sub('\.[^.]+$', '', s)
And the last part of your question to wrap words in [] i would recommend to use format and list comprehension:
''.join("[{}]".format(e) for e in s.split('.'))
It'll give you the desired output:
[classes][students][grades]
The best way to do this is using the rsplit method and pass in the maxsplit argument.
>>> s = "classes.students.grades"
>>> before, after = s.rsplit('.', maxsplit=1) # rsplit('.', 1) in Python 2.x onwards
>>> before
'classes.students'
>>> after
'grades'
You can also use the rfind() method with normal slice operation.
To get everything before last .:
>>> s = "classes.students.grades"
>>> last_index = s.rfind('.')
>>> s[:last_index]
'classes.students'
Then everything after last .
>>> s[last_index + 1:]
'grades'
if '.' in s, s.rpartition('.') finds last dot in s,
and returns (before_last_dot, dot, after_last_dot):
s = "classes.students.grades"
s.rpartition('.')[0]
If your goal is to get rid of a final component that's just a single digit, start and end with re.sub():
s = re.sub(r"\.\d$", "", s)
This will do the job, and leave other strings alone. No need to mess with anything else.
If you do want to know about the general case (separate out the last component, no matter what it is), then use rsplit to split your string once:
>>> "hel.lo.there".rsplit(".", 1)
['hel.lo', 'there']
If there's no dot in the string you'll just get one element in your array, the entire string.
You can do it very simply with rsplit (str.rsplit([sep[, maxsplit]]) , which will return a list by breaking each element along the given separator.
You can also specify how many splits should be performed:
>>> s = "res.spa.f.sal.786423"
>>> s.rsplit('.',1)
['res.spa.f.sal', '786423']
So the final function that you describe is:
def dimimak_cool_function(s):
if '.' not in s: return s
start, end = s.rsplit('.', 1)
return start if end.isdigit() else s
>>> dimimak_cool_function("res.spa.f.sal.786423")
'res.spa.f.sal'
>>> dimimak_cool_function("res.spa.f.sal")
'res.spa.f.sal'

Remove substrings of variable length from string

I have a list of strings where all of the strings roughly follow the format 'foo\tbar\tfoo\n' in that there are three segments of variable length that are separated by two tabs (\t) and with a newline indicator at the end (\n).
I want to remove everything except for the text before the first \, so that it would return as 'foo'. Given that the first segment is of variable length, I'm not sure how I can do that.
Use str.split():
>>> string = 'foo\tbar\tfoo\n'
>>> string.split('\t', 1)[0]
'foo'
This splits the string by the first occurrence of the '\t' tab character, which returns a list with two elements. The [0] selects the first element in the list, which is the part of the string before the first '\t' occurrence.
Just search for the first \t character, and get everything before it. Slicing makes this easy.
newstr = oldstr[:oldstr.find("\t")]
Try with:
t = 'foo\tbar\tfoo\n'
t[:t.index("\t")]

Categories