How to delete multiple substrings by their positions in Python? - python

I have a string, and I have a list of positions of the substrings that I need to delete:
text = 'ab cd ef gh'
positions = [[2, 5], [8, 11]]
Every element of the list contains start and end position of substring. End position is exclusive, and start position is inclusive. So the string should be transformed to:
text = 'ab ef'
Length of the list with positions is unknown, so the soultion can't be just hardcoded.
Is there any effective way to delete multiple substrings by their positions? Positions can't overlap.

Strings are immutable, so in-place deletion is a no-go. And successive concatenation is suboptimal.
You can convert the string to list so it can be mutated and simply wipe off the desired positions by deleting each unwanted slice. Use str.join to recreate your string:
text = 'ab cd ef gh'
lst = list(text)
for i in positions[::-1]: # iterate from behind so index does not shrink inwards
del lst[slice(*i)]
text = ''.join(lst)
print(text)
# 'ab ef'
Note that conversion to list for the mutation of immutable types is also suggested by the docs as best practice:
Concatenating immutable sequences always results in a new object. This
means that building up a sequence by repeated concatenation will have
a quadratic runtime cost in the total sequence length. To get a linear runtime cost, you must switch to one of the alternatives below:
if concatenating str objects, you can build a list and use
str.join() at the end or else write to an io.StringIO instance and
retrieve its value when complete

You have to offset for future indexes. So we first get the rest of the string (excluding the two indexes) via text[:2] + text[5:] and then we also need to offset it too, because we are removing items from the string. So, we'll add our offset to each position item.
text = 'ab cd ef gh'
positions = [[2,5],[8,11]]
offsetNextIndexes = 0
for position in positions:
text = text[:position[0] + offsetNextIndexes] + text[position[1] + offsetNextIndexes:]
offsetNextIndexes += position[0] - position[1]
print(text)

This should work easily.
" ".join(text.split()[0::2])
The slicing will help here to skip some parts, it works as
[start:end:difference]

Related

Return only the first instance of a value found in a for loop

I have a list of strings that are split in half like the following;
fhlist = [['BzRmmzZHzVBzgVQmZ'],['efmt']]
shlist = [['LPtqqffPqWqJmPLlL', ['abcm']]
The first half is stored in a list fhlist whilst the second in shlist.
So the combined string of fhlist[0] and shlist[0] is BzRmmzZHzVBzgVQmZLPtqqffPqWqJmPLlL.
and fhlist[1] and shlist[1] is efmtabcm
I've written some code that iterates through each letter in the first and second half strings, and if any letters appear in both halfs it adds this character to another list found;
found = []
for i in range(len(fhlist)):
for char in fhlist[i]:
if char in shlist[i]:
found.append(char)
However, with the above example, the example list returns me m m m as it is returning every instance of the letter occurring, the letter m occurs 3 times in the combined string BzRmmzZHzVBzgVQmZLPtqqffPqWqJmPLlL I only want to return the code to return m
I previously had;
found = []
for i in range(len(fhlist)):
for char in fhlist[i]:
if char in shlist[i] and char not in found:
found.append(char)
but this essentially 'blacklisted' any characters that appeared in other strings, so if another two strings both contained m such as the combined string efmtabcm it would ignore it as this character had already been found.
Thanks for any help!
Expanding my suggestion from the comments since it apparently solves the problem in the desired way:
To dedupe per pairing, you can replace:
found = []
for i in range(len(fhlist)):
for char in fhlist[i]:
if char in shlist[i]:
found.append(char)
with (making some slight idiomatic improvements):
found = []
for fh, sh in zip(fhlist, shlist): # Pair them up directly, don't operate by index
found.extend(set(fh).intersection(sh))
or as a possibly too complex listcomp:
found = [x for fh, sh in zip(fhlist, shlist) for x in set(fh).intersection(sh)]
This gets the unique overlapping items from each pairing (with set(fh).intersection(sh)) more efficiently (O(m+n) rather than O(m*n) in terms of the lengths of each string), then you add them all to found in bulk (keeping it as a list to avoid deduping across pairings).
IIUC, you are trying to find common characters between each of the respective strings in fhlist and shlist
You can use set.intersection for this after using zip on the 2 lists and iterating on them together with a list comprehension, as follows -
[list(set(f[0]).intersection(s[0])) for f,s in zip(fhlist, shlist)]
[['m'], ['m']]
This works as follows -
1. BzRmmzZHzVBzgVQmZ, LPtqqffPqWqJmPLlL -> Common is `m`
2. efmt, abcm -> Common is `m`
...
till end of list
You can try this
fhlist = [['BzRmmzZHzVBzgVQmZ'],['efmt']]
shlist = [['LPtqqffPqWqJmPLlL'], ['abcm']]
found = []
for i in range(len(fhlist)):
for char in ''.join(fhlist[i]):
for a in ''.join(shlist[i]):
if a==char and char not in found:
found.append(char)
print(found)
Output:
['m']

Slicing the second last character

I have this string here '[2,3,1,1,]'
Im new to slicing and I only know how to slice from the start and from the end but not somewhere between, not even sure if that is possible.
could someone tell me how I can slice this '[2,3,1,1,]' to this '[2,3,1,1]'
So removing the second last character only.
If you just want to delete the second last character, your can do like this,
s = "[2,3,1,1,]"
s[:-2] + s[-1]
# '[2,3,1,1]'
s[:-2] -> Will slice the string from 0 to -2 index location (without -2 index)
s[-1] -> Will fetch the last element
s[:-2] + s[-1] -> concatenation of the strigs
If you're sure you have that string, slice both characters and add the ] back on!
source_string = "[2,3,1,1,]"
if source_string.endswith(",]"):
source_string = source_string[:-2] + "]"
However, often lists stored as strings are not very useful - you may really want to convert the whole thing to a collection of numbers (perhaps manually removing the "[]" and splitting by ,, or using ast.literal_eval()), potentially converting it back to a string to display later
>>> source_string = "[2,3,1,1,]"
>>> import ast
>>> my_list = ast.literal_eval(source_string)
>>> my_list
[2, 3, 1, 1]
>>> str(my_list)
'[2, 3, 1, 1]'
You can use this for this one case
txt = "[2,3,1,1,]"
print(f"{txt[:-2]}{txt[-1]}")
Even tho storing lists as string is a bit weird
txt[:-2] will get the characters from 0 index to -2 index
txt[-1] will get the last character which is "]"
then I concatenate both with an f"" string
You can use this if you don't wanna use an f string
print(txt[:-2], txt[-1], sep="")
the "sep" argument is so there won't be space between the two prints
Using built-in functions such as str.rstrip
l = '[2,3,1,1,]'
l_new = f"{l.rstrip('],')}]" # it doesn't matter the order!
print(l_new)
or str.rpartition
a, _, b = l.rpartition(',')
l_new = a + b
print(l_new)
you can avoid an explicit slicing. See doc for details.
The 1st approach is universal, hence doesn't produce any side-effects. The 2nd may give rise to side-effects if the final last character is not a ,. If necessary use a check, i.e. str.endswith, to fix it.

filter rows if alphabetical letter from a (smiles) string not from a list of elements

QUESTION
How to filter out the SMILES strings if any alphabetical letter (atoms) in the string, insensitive to capitalization, come from a dataframe of elements H, He, Li, Be, B? This is a truncated list, and there are 80 of them.
BACKGROUND
I have a database containing SMILES strings:
The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings.
(More info Wikipedia link)
The purpose of this was to get rid of rare elements and organometallics from the database.
I am starting with a string to test the code before proceeding to a data frame. I write loops to test for characters inside of a string.
strings = "[O+]#C[Ni-4](C#[O+])(C#[O+])C#[O+]"
list = ['Ni']
for i in list:
if i in strings:
print(i)
How to iterate over a dataframe and filter?
For the list/simplified version, do the opposite would work. Use the list to find matches in strings.
strings = "[O+]#C[Ni-4](C#[O+])(C#[O+])C#[O+]"
list = ['Ni', 'Sc']
for i in list:
if i in strings:
print(i)
else:
print('nah')
> Ni
> nah
To loop over a dataframe, use np.where
df = pd.DataFrame({'smiles': ['sdflk', '[O+]#C[Ni-4](C#[O+])(C#[O+])C#[O+]']})
list = ['Ni', 'Sc']
df['element'] = np.where(df.smiles.str.contains('|'.join(list)), 1, 0) # mark element that contains string in the list as 1, else 0
df[df['element'] == 1] # remove rows that have the element
Note that this would be problematic when the dataframe contains Sc1 like string, where S and c actually mean sulfur and carbon on a simple aromatic ring rather than the Scandium Sc. So we need a way to recognize Sc only and only if there is no number attached to it. Negative lookahead would help us right here.
df['Sc'] = df['smiles'].str.match('Sc(?!\d)')

Python- Look for string only in the beginning of a string

I am using difflib.Differ() on two lists.
The way Differ works, it appends a + if a line is unique to sequence 2 and a - if a line is unique to sequence 1. It appends this right at the beginning of the sequence.
I want to search for sequences in my list that begin with - or + but only if the string begins with this character as the majority of my sequences have these characters in other places within the string.
In the code snippet below, diff_list is the list. I want it to check for a + or - in the very first place in the string value of every sequence in this list:
for x in diff_list:
if "+" or "-" in x[0]:
print x
This output seems to print all of the lines even those that don't begin with - or +
Did you try startswith?
s = '+asdf' # sample data
if s.startswith('+') or s.startswith('-'):
pass # do work here
Docs:
https://docs.python.org/3.4/library/stdtypes.html#str.startswith

Save the index value of blanks in a string to a tuple in Python 3

How can I save the index position of spaces in a text sentence to a tuple so I can reconvert the sentence back after removing the strings?
For example in this text the are spaces which causes an error as there are no space in the ascii alphabet. So I want to remove the spaces, convert and then reformat spaces back to original position.
import string
text = "This is the text I wish to convert"
alpha = [c for c in list(string.ascii_lowercase)]
alphaTup = tuple(alpha)
myConvert = list(text.lower())
blanks = myConvert.index(' ')
print(blanks)
# This doesn't work
for item in myConvert:
#find index of item in alphaTup and return value at index + 1
newLet = alphaTup[alphaTup.index(item) + 1]
print(newLet)
If you want to know the indices of all the blanks, I suppose using enumerate could be helpful:
blank_indices = [i for i, c in enumerate(text) if c == ' ']
(This gives a list instead of a tuple, but it's not hard to get the tuple if you really need it...)
With that said, looping over a string character-by-character, is not the best way to accomplish most tasks in python. It's hard for me to tell exactly what transform you're trying to perform on the string, so I'm not sure how to advise better ...

Categories