I am trying to find the most efficient way natural to Python, to find all instances of a string within another string surrounded by a $$ sign and replace it with another value in another variable.
The string in question is like this "$$firstOccurance$$ some other words here then $$secondOccurance$$"
My solution below is not working, I think because it's not able to differentiate between the first time it finds the $$ sign and the second time it finds it again. This results in nothing getting printed. There can be many occurrences of strings between the $$ value.
the_long_string = "$$firstOccurance$$ some other words here then $$secondOccurance$$"
replacement_value = "someNewString"
print(the_long_string[the_long_string.index('$$')+len('$$'):the_long_string.index('$$')])
What would be the way to correct and fix what I have done so far?
The end result has to look like this someNewString some other words here then someNewString, where there is no $$ sign left.
You would use a regular expression substitution:
re.sub(r'\$\$\w+\$\$', replacement_value, the_long_string)
Here is a complete example building on #mbuchove's example. Use the regex package (re) to make it easier for you.
The key is you need to use re.sub() to find and replace all matches.
import re
the_long_string = "$$firstOccurance$$ some other words here then $$secondOccurance$$"
replacement_value = "someNewString"
the_long_string = re.sub("\$\$\w+\$\$", replacement_value, the_long_string)
print(the_long_string)
If you want to continue with the search manually (without regex), you would need to do something a lot more complicated, still finding all of the $$[n]Occurence$$ by hand.
prev_index = -1
search_start = 0
index = the_long_string.find("$$", search_start)
while index != -1:
print(index)
# Closing set of $$
if prev_index != -1:
end_index = index + len("$$")
value_to_replace = the_long_string[prev_index:end_index]
the_long_string = the_long_string.replace(value_to_replace, replacement_value)
prev_index = -1 # Reset the search for an opening set of $$
# Opening set of $$
else:
prev_index = index
# Check for the next match
search_start = index + len("$$")
index = the_long_string.find("$$", search_start)
print(the_long_string)
Related
How can I make a loop that eliminates zeroes from a list of strings that looks something like the following?
List
GR0030
GR00000000013
GR093
I'd like to eliminate the zeroes between the GR and the first number different than zero. I've thought I could solve this problem with something like this:
entry = ""
for x in list:
if x.isalpha():
entry = entry + x
else:
if x == 0:
entry = entry
else:
entry = entry + x[(list.index(x)):-1]
break
list1.append(entry) # the answer list
But, it does not work. I'm just getting a list full of GR in each row. What am I doing wrong?
A regular expression will do here. The expression matches the first group of zeroes, and replaces them with an empty string. To prevent us from reading past the first group, we set count=1.
Your approach could work, but you'd have to keep track of whether or not you've seen a zero before. You also should try to avoid repeated concatenation of strings, as it isn't very efficient.
import re
def strip_intermediate_zeroes(s):
return re.sub('0+', '', s, count=1)
items = ['GR0030', 'GR00000000013', 'GR093']
print(list(map(strip_intermediate_zeroes, items)))
The above code snippet assumes that there's at least one zero after "GR". If such an assumption cannot be made, you can explicitly check for that assumption as a quick fix:
def strip_intermediate_zeroes(s):
if s.startswith('GR0'):
return re.sub('0+', '', s, count=1)
return s
This seems like a natural fit for a regex combined with re.sub(). (?<=^GR)0* means 0 or more zeros that follow 'GR' at the beginning of a string.
import re
l = [
'GR0030',
'GR00000000013',
'GR093',
]
rx = re.compile(r'(?<=^GR)0*')
[rx.sub('', s) for s in l]
# ['GR30', 'GR13', 'GR93']
This is very specific in that it won't change strings like 'SP0091', 'ADGR0000400013', or '000ab'.
so I have this list:
tokens = ['<greeting>', 'Hello World!', '</greeting>']
the task is to count the number of strings that have XML tags. what I have so far (that works) is this:
tokens = ['<greeting>', 'Hello World!', '</greeting>']
count = 0
for i in range(len(tokens)):
if tokens[i].find('>') >1:
print(tokens[i])
count += 1
print(count)
else:
count += 0
what puzzles me is that I'm inclined in using the following line for the if statement
if tokens[i].find('>') == True:
but it won't work.
what's the optimal way of writing this loop, in your opinion?
many thanks!
alex.
One issue I see with you approach is that it might capture false positives (e.g. "gree>ting"), so checking only for a closing tag is not enough.
If your definition of "contains a tag" simply means checking whether the string contains a < followed by some characters, then another >, you could use a regular expression (keeping this in mind in case you were thinking about something more complex).
This, combined with the compact list generator method proposed by #aws_apprentice in the comments, gives us:
import re
regex = "<.+>"
count = sum([1 if re.search(regex, t) else 0 for t in tokens])
print(count) #done!
Explanation:
This one-liner we used is called a list generator, which will generate a list of ones and zeros. For each string t in tokens, if the string contains a tag, append 1 to the new list, else append 0. And re.search is used for checking whether the string (or a substring of it) matches the given regex.
The following approach checks for the opening < at the start of the string and also checks for > at the end of the string.
In [4]: tokens = ['<greeting>', 'Hello World!', '</greeting>']
In [5]: sum([1 if i.startswith('<') and i.endswith('>') else 0 for i in tokens])
Out[5]: 2
Anis R.'s answer should work fine but this is a non-regex alternative (and not as elegant. In fact I would call this clumsy).
This code just looks at the beginning and end of each list element for carats. I'm a novice to the extreme but I think a range(len(tokens)) is redundant and can be simplified like this as well.
tokens = ['<greeting>', 'Hello World!', '</greeting>']
count = 0
for i in tokens:
if i[0].find('<') == 0 and i[-1].find('>') != -1:
print(i)
count += 1
print(count)
str.find() returns an index position, not a boolean as others have noted, so your if statement must reflect that. A .find() with no result returns -1. As you can see, for the first carat checking for an index of 0 will work, as long as your data follows the scheme in your example list. The second if component is negative (using !=), since it checks the last character in the list item. I don't think you could use a positive if statement there since, again, .find() returns an index position and your data presumably has variable lengths. I'm sure you could complicate that check to be positive by adding more code but that shortcut seems satisfactory in your case to me. The only time it wouldn't work is if your list components can look like '<greeting> Hello'
Happy to be corrected by others, that's why I'm here.
This is my first SO post, so go easy! I have a script that counts how many matches occur in a string named postIdent for the substring ff. Based on this it then iterates over postIdent and extracts all of the data following it, like so:
substring = 'ff'
global occurences
occurences = postIdent.count(substring)
x = 0
while x <= occurences:
for i in postIdent.split("ff"):
rawData = i
required_Id = rawData[-8:]
x += 1
To explain further, if we take the string "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff", it is clear there are 3 instances of ff. I need to get the 8 preceding characters at every instance of the substring ff, so for the first instance this would be 909a9090.
With the rawData, I essentially need to offset the variable required_Id by -1 when I get the data out of the split() method, as I am currently getting the last 8 characters of the current string, not the string I have just split. Another way of doing it could be to pass the current required_Id to the next iteration, but I've not been able to do this.
The split method gets everything after the matching string ff.
Using the partition method can get me the data I need, but does not allow me to iterate over the string in the same way.
Get the last 8 digits of each split using a slice operation in a list-comprehension:
s = "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff"
print([x[-8:] for x in s.split('ff') if x])
# ['909a9090', '90434390', 'sdfs9000']
Not a difficult problem, but tricky for a beginner.
If you split the string on 'ff' then you appear to want the eight characters at the end of every substring but the last. The last eight characters of string s can be obtained using s[-8:]. All but the last element of a sequence x can similarly be obtained with the expression x[:-1].
Putting both those together, we get
subject = '090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff'
for x in subject.split('ff')[:-1]:
print(x[-8:])
This should print
909a9090
90434390
sdfs9000
I wouldn't do this with split myself, I'd use str.find. This code isn't fancy but it's pretty easy to understand:
fullstr = "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff"
search = "ff"
found = None # our next offset of
last = 0
l = 8
print(fullstr)
while True:
found = fullstr.find(search, last)
if found == -1:
break
preceeding = fullstr[found-l:found]
print("At position {} found preceeding characters '{}' ".format(found,preceeding))
last = found + len(search)
Overall I like Austin's answer more; it's a lot more elegant.
I have a regex python script to go over Hex data and find patterns which looks like this
r"(.{6,}?)\1{2,}"
all it does is look for at least 6 character long hex strings that repeat and at least have two instances of it repeating. My issue is it is also finding substrings inside larger strings it has already found for example:
if it was "a00b00a00b00a00b00a00b00a00b00a00b00" it would find 2 instances of "a00b00a00b00a00b00" and 6 instances of "a00b00" How could I go about keeping only the longest patterns found and ignoring even looking for shorter patterns without more hardcoded parameters?
#!/usr/bin/python
import fnmatch
pattern_string = "abcdefabcdef"
def print_pattern(pattern, num):
n = num
# takes n and splits it by that value in this case 6
new_pat = [pattern[i:i+n] for i in range(0, len(pattern), n)]
# this is the hit counter for matches
match = 0
# stores the new value of the match
new_match = ""
#loops through the list to see if it matches more than once
for new in new_pat:
new_match = new
print new
#if matches previous keep adding to match
if fnmatch.fnmatch(new, new_pat[0]):
match += 1
if match:
print "Count: %d\nPattern:%s" %(match, new_match)
#returns the match
return new_match
print_pattern(pattern_string, 6)
regex is better but this was funner to write
I've a file and I want to find the index of some special character (\*) in it. This character might appear at several places in the file. for example:
hello \*this is a file*/
print "good\* morning"
I want to use find command to find index of \* only outside double quotes and not inside double quotes. Is there a way to implement this in python?
I know that find returns the index of first character that is found but I've a for loop that checks for this character and prints the index. But I want that whenever it encounters this character /* inside double quotes, it should skip that character and move on to find next one on the file.
str1 = 'hello \*this is a file*/'
str2 = 'print "good\* morning"'
def find_index(_str):
is_in_quotes = 0
idx = 0
while idx < len(_str):
if _str[idx] == '"':
is_in_quotes = 1 - is_in_quotes
elif not is_in_quotes:
if _str[idx: idx+2] == '\*':
return idx
idx += 1
return -1
print(find_index(str1))
print(find_index(str2))
The function return -1 if it doesn't find it.
Let me know if it meets all your needs.