Python: Replace single character with multiple values - python

I have a string with "?" as placeholders. I need to loop through the string and replace each "?" with the next value in a list.
For example:
my_str = "Take the source data from ? and pass to ?;"
params = ['temp_table', 'destination_table']
Here's what I've tried, which works:
params_counter = 0
for letter in my_str:
if letter == '?':
# Overwrite my_str with a new version, where the first ? is replaced
my_str = my_str.replace('?', params[params_counter], 1)
params_counter += 1
The problem is it seems rather slow to loop through every single letter in the string, multiple times. My example is a simple string but real situations could be strings that are much longer.
What are more elegant or efficient ways to accomplish this?
I saw this question, which is relatively similar, but they're using dictionaries and are replacing multiple values with multiple values, rather than my case which is replacing one value with multiple values.

you don't need to iterate, replace will replace the first occcurrence of the string:
a clever solution is to split by the string you are searching, then zipping the list with a list of replacements of length the same and then joining
list1 = my_str.split('?')
params = ['temp_table', 'destination_table']
zipped = zip(list1, params)
replaced = ''.join([elt for sublist in zipped for elt in sublist])
In [19]: replaced
Out[19]: 'Take the source data from temp_table and pass to destination_table'
You can use multiple characters strings, which is what would kill your method:
my_str = "Take the source data from magicword and pass to magicword;
list1 = my_str.split('magicword')
params = ['temp_table', 'destination_table']
zipped = zip(list1, params)
replaced = ''.join([elt for sublist in zipped for elt in sublist])
In [25]: replaced
Out[25]: 'Take the source data from temp_table and pass to destination_table'
Note that if your params are shorter than the number of occurrences of your search string this will cut it
IN []my_str = "Take the source data from ? and pass to ?; then do again with ? and to ?"
Out[22]: 'Take the source data from temp_table and pass to destination_table'
Also note that the last bit after finding the string is deleted :(, something like
replaced = ''.join([elt for sublist in zipped for elt in sublist] + [list1[-1]])
Will do the trick

While #E. Serra's answer is a great answer, the comment from #jasonharper made me realize there's a MUCH simpler answer. So incredibly simple that I'm surprised that I completely missed it!
Instead of looping through the string, I should loop through the parameters. That'll allow me to replace the first instance of "?" with the current parameter that I'm looking at. Then I overwrite the string, allowing it to execute correctly on the next iteration.
Unlike the other solution posted, it won't cut off the end of my string either.
my_str = "Take the source data from ? and pass to ?;"
params = ['temp_table', 'destination_table']
for item in params:
query_str = query_str.replace('?', item, 1)

Related

How to remove a substrings from a list of strings?

I have a list of strings, all of which have a common property, they all go like this "pp:actual_string". I do not know for sure what the substring "pp:" will be, basically : acts as a delimiter; everything before : shouldn't be included in the result.
I have solved the problem using the brute force approach, but I would like to see a clever method, maybe something like regex.
Note : Some strings might not have this "pp:string" format, and could be already a perfect string, i.e. without the delimiter.
This is my current solution:
ll = ["pp17:gaurav","pp17:sauarv","pp17:there","pp17:someone"]
res=[]
for i in ll:
g=""
for j in range(len(i)):
if i[j] == ':':
index=j+1
res.append(i[index:len(i)])
print(res)
Is there a way that I can do it without creating an extra list ?
Whilst regex is an incredibly powerful tool with a lot of capabilities, using a "clever method" is not necessarily the best idea you are unfamiliar with its principles.
Your problem is one that can be solved without regex by splitting on the : character using the str.split() method, and just returning the last part by using the [-1] index value to represent the last (or only) string that results from the split. This will work even if there isn't a :.
list_with_prefixes = ["pp:actual_string", "perfect_string", "frog:actual_string"]
cleaned_list = [x.split(':')[-1] for x in list_with_prefixes]
print(cleaned_list)
This is a list comprehension that takes each of the strings in turn (x), splits the string on the : character, this returns a list containing the prefix (if it exists) and the suffix, and builds a new list with only the suffix (i.e. item [-1] in the list that results from the split. In this example, it returns:
['actual_string', 'perfect_string', 'actual_string']
Here are a few options, based upon different assumptions.
Most explicit
if s.startswith('pp:'):
s = s[len('pp:'):] # aka 3
If you want to remove anything before the first :
s = s.split(':', 1)[-1]
Regular expressions:
Same as startswith
s = re.sub('^pp:', '', s)
Same as split, but more careful with 'pp:' and slower
s = re.match('(?:^pp:)?(.*)', s).group(1)

Need to match string in one list to string in another list

I have 2 lists, the first list keywords contains keywords ['aca','old'] and the second list Tbl_names contains table names from a database. I need to fetch the table names which match the keywords in the first list. The problem is that using the in operator in Python is giving me the wrong results if there is a staging_vaca_2019 or tapi_sold table in the second list, as these two outputs should not be returned. If i use the '=' operator then a table with the name 'aca_2019' will not be returned which should be returned .
I am saving the matching table names in another list called Tbl_keywords.
The problem is that if I try to separate using delimiters, then I wont be able to append it like I am doing in the code below.
for a in keywords:
for j in Tbl_names:
if a in j:
Tbl_keywords.append(j)
With the information given, I'd just add .split("_") to your j.
It all depends on the format of your table names. If your table names are always separated by an underscore (like your example 'aca_2019'), then you can split the table names on underscores into a new list. So, using the same example, you can use 'aca_2019'.split("_"), giving you the following list: ['aca', '2019'].
You can then check if 'aca' is in this list. Even if there is no underscore, you will always receive a list by using split(). This makes sure you will not match aca to vaca, as would be the case when using in against a string (like in your working example).
for a in keywords:
for j in Tbl_names:
if a in j.split("_"):
Tbl_keywords.append(j)
But if your table names are stored differently (we cannot know), then i'd start looking into regular expression (the re module in Python).
If you want to continue to use the double loop like you have.
Change the if statement.
Old:
if a in j:
New:
if a == j:
I provided both answers, in case you want exact matches only, or just whole words in the table name. You want keywords to be a set. You can iterate over just the table names (not both lists, that's O(n^2) and do O(1) set lookup, n times.
I used list comprehension syntax instead of a for loop. It's slightly more efficient, but not a big algorithmic problem like your double for loop. If you want, I can translate those to for loops, but I recommend getting used to them.
# keywords should be a set for faster lookup
keywords = {'aca', 'old', 'exact', 'partial'}
# Tbl_names will be a list (resultset)
table_names = ['staging_vaca_2019', 'tapi_sold', 'exact', 'find_partial_match']
# exact matches
exact_matches = [table for table in table_names if table in keywords]
print ('exact matches:', exact_matches)
# keywords that appear as whole words inside the table name
all_table_keywords = [word for table in table_names for word in table.split('_') if word in keywords]
print ('partial matches:', all_table_keywords)
exact matches: ['exact']
partial matches: ['exact', 'partial']
It should be possible to iterate only through Tbl_names:
result = [item for item in Tbl_names if item in keywords]

Searching for duplicates and remove them

sometimes I have a string like this
string = "Hett, Agva,"
and sometimes I will have duplicates in it.
string = "Hett, Agva, Delf, Agva, Hett,"
how can I check if my string has duplicates and then if it does remove them?
UPDATE.
So in the second string i need to remove Agva, and Hett, because there is 2x of them in the string
Iterate over the parts (words) and add each part to a set of seen parts and to a list of parts if it is not already in that set. Finally. reconstruct the string:
seen = set()
parts = []
for part in string.split(','):
if part.strip() not in seen:
seen.add(part.strip())
parts.append(part)
no_dups = ','.join(parts)
(note that I had to add some calls to .strip() as there are spaces at the start of some of the words which this method removes)
which gives:
'Hett, Agva, Delf,'
Why use a set?
To query whether an element is in a set, it is O(1) average case - since they are stored by a hash which makes lookup constant time. On the other hand, lookup in a list is O(n) as Python must iterate over the list until the element is found. This means that it is much more efficient for this task to use a set since, for each new word, you can instantly check to see if you have seen in before whereas you'd have to iterate over a list of seen elements otherwise which would take much longer for a large list.
Oh and to just check if there are duplicates, query whether the length of the split list is the same as the set of that list (which removes the duplicates but looses the order).
I.e.
def has_dups(string):
parts = string.split(',')
return len(parts) != len(set(parts))
which works as expected:
>>> has_dups('Hett, Agva,')
False
>>> has_dups('Hett, Agva, Delf, Agva, Hett,')
True
You can use toolz.unique, or equivalently the unique_everseen recipe in the itertools docs, or equivalently #JoeIddon's explicit solution.
Here's the solution using 3rd party toolz:
x = "Hett, Agva, Delf, Agva, Hett,"
from toolz import unique
res = ', '.join(filter(None, unique(x.replace(' ', '').split(','))))
print(res)
'Hett, Agva, Delf'
I've removed whitespace and used filter to clean up a trailing , which may not be required.
if you will receive a string in only this format then you can do the following:
import numpy as np
string_words=string.split(',')
uniq_words=np.unique(string_words)
string=""
for word in uniq_words:
string+=word+", "
string=string[:-1]
what this code does is that it splits words into a list, finds unique items, and then merges them into a string like before
If order of words id important then you can make a list of words in the string and then iterate over the list to make a new list of unique words.
string = "Hett, Agva, Delf, Agva, Hett,"
words_list = string.split()
unique_words = []
[unique_words.append(w) for w in words_list if w not in unique_words]
new_string = ' '.join(unique_words)
print (new_String)
Output:
'Hett, Agva, Delf,'
Quick and easy approach:
', '.join(
set(
filter( None, [ i.strip() for i in string.split(',') ] )
)
)
Hope it helps. Please feel free to ask if anything is not clear :)

Removing item in list during loop

I have the code below. I'm trying to remove two strings from lists predict strings and test strings if one of them has been found in the other. The issue is that I have to split up each of them and check if there is a "portion" of one string inside the other. If there is then I just say there is a match and then delete both strings from the list so they are no longer iterated over.
ValueError: list.remove(x): x not in list
I get the above error though and I am assuming this is because I can't delete the string from test_strings since it is being iterated over? Is there a way around this?
Thanks
for test_string in test_strings[:]:
for predict_string in predict_strings[:]:
split_string = predict_string.split('/')
for string in split_string:
if (split_string in test_string):
no_matches = no_matches + 1
# Found match so remove both
test_strings.remove(test_string)
predict_strings.remove(predict_string)
Example input:
test_strings = ['hello/there', 'what/is/up', 'yo/do/di/doodle', 'ding/dong/darn']
predict_strings =['hello/there/mister', 'interesting/what/that/is']
so I want there to be a match between hello/there and hello/there/mister and for them to be removed from the list when doing the next comparison.
After one iteration I expect it to be:
test_strings == ['what/is/up', 'yo/do/di/doodle', 'ding/dong/darn']
predict_strings == ['interesting/what/that/is']
After the second iteration I expect it to be:
test_strings == ['yo/do/di/doodle', 'ding/dong/darn']
predict_strings == []
You should never try to modify an iterable while you're iterating over it, which is still effectively what you're trying to do. Make a set to keep track of your matches, then remove those elements at the end.
Also, your line for string in split_string: isn't really doing anything. You're not using the variable string. Either remove that loop, or change your code so that you're using string.
You can use augmented assignment to increase the value of no_matches.
no_matches = 0
found_in_test = set()
found_in_predict = set()
for test_string in test_strings:
test_set = set(test_string.split("/"))
for predict_string in predict_strings:
split_strings = set(predict_string.split("/"))
if not split_strings.isdisjoint(test_set):
no_matches += 1
found_in_test.add(test_string)
found_in_predict.add(predict_string)
for element in found_in_test:
test_strings.remove(element)
for element in found_in_predict:
predict_strings.remove(element)
From your code it seems likely that two split_strings match the same test_string. The first time through the loop removes test_string, the second time tries to do so but can't, since it's already removed!
You can try breaking out of the inner for loop if it finds a match, or use any instead.
for test_string, predict_string in itertools.product(test_strings[:], predict_strings[:]):
if any(s in test_string for s in predict_string.split('/')):
no_matches += 1 # isn't this counter-intuitive?
test_strings.remove(test_string)
predict_strings.remove(predict_string)

How to replace a list of words with a string and keep the formatting in python?

I have a list containing the lines of a file.
list1[0]="this is the first line"
list2[1]="this is the second line"
I also have a string.
example="TTTTTTTaaaaaaaaaabcccddeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeefffff"
I want to replace list[0] with the string (example). However I want to keep the word length. For example the new list1[0] should be "TTTT TT TTa aaaaa aaaa". The only solution I could come up with was to turn the string example into a list and use a for loop to read letter by letter from the string list into the original list.
for line in open(input, 'r'):
list1[i] = listString[i]
i=i+1
However this does not work from what I understand because Python strings are immutable? What's a good way for a beginner to approach this problem?
I'd probably do something like:
orig = "this is the first line"
repl = "TTTTTTTaaaaaaaaaabcccddeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeefffff"
def replace(orig, repl):
r = iter(repl)
result = ''.join([' ' if ch.isspace() else next(r) for ch in orig])
return result
If repl could be shorter than orig, consider r = itertools.cycle(repl)
This works by creating an iterator out of the replacement string, then iterating over the original string, keeping the spaces, but using the next character from the replacement string instead of any non-space characters.
The other approach you could take would be to note the indexes of the spaces in one pass through orig, then insert them at those indexes in a pass of repl and return a slice of the result
def replace(orig, repl):
spaces = [idx for idx,ch in enumerate(orig) if ch.isspace()]
repl = list(repl)
for idx in spaces:
repl.insert(idx, " ")
# add a space before that index
return ''.join(repl[:len(orig)])
However I couldn't imagine the second approach to be any faster, is certain to be less memory-efficient, and I don't find it easier to read (in fact I find it HARDER to read!) It also don't have a simple workaround if repl is shorter than orig (I guess you could do repl *= 2 but that's uglier than sin and still doesn't guarantee it'll work)

Categories