Slicing by start and stop string values in Python - python

I have a string in which there are certain values that I need to extract from it. For example: "FEFEWFSTARTFFFPENDDCDC". How could I make an expression that would take a slice from "START" all the way to "END"?
I tried doing this previously by creating functions which used a for loop and string.find("START") to locate the beginning and ends, but this didn't appear to work effectively and seemed overly complex. Is there an easier way to do this without using complex loops?
EDIT:
Forgot this part. What if there were different end values? In other words, instead of just ending with "END", the values "DONE" and "NOMORE" would also end it? And in addition to that, there were multiple starts and ends throughout the string. For example: "STARTFFEFFDONEFEWFSTARTFEFFENDDDW".
EDIT2: Sample run: Start value: ATG. End values: TAG,TAA,TGA
"Enter a string": TTATGTTTTAAGGATGGGGCGTTAGTT
TTT
GGGCGT
And
"Enter a string": TGTGTGTATAT
"No string found"

That's a perfect fit for a regular expression:
>>> import re
>>> s = "FEFEWFSTARTFFFPENDDCDCSTARTDOINVOIJHSDFDONEDFOIER"
>>> re.findall("START.*?(?:END|DONE|NOMORE)", s)
['STARTFFFPEND', 'STARTDOINVOIJHSDFDONE']
.* matches any number of characters (except newlines), the additional ? makes the quantifier lazy, telling it to match as few characters as possible. Otherwise, there would be only one match, namely STARTFFFPENDDCDCSTARTDOINVOIJHSDFDONE.
As #BurhanKhalid noted, if you add a capturing group, only the substring matched by that part of the regex will be captured:
>>> re.findall("START(.*?)(?:END|DONE|NOMORE)", s)
['FFFP', 'DOINVOIJHSDF']
Explanation:
START # Match "START"
( # Match and capture in group number 1:
.*? # Any character, any number of times, as few as possible
) # End of capturing group 1
(?: # Start a non-capturing group that matches...
END # "END"
| # or
DONE # "DONE"
| # or
NOMORE # "NOMORE"
) # End of non-capturing group
And if your real goal is to match gene sequences, you need to make sure that you always match triplets:
re.findall("ATG(?:.{3})*?(?:TA[AG]|TGA)", s)

a="FEFEWFSTARTFFFPENDDCDC"
a[a.find('START'):]
'STARTFFFPENDDCDC'

The simple way (no loop, no regex):
s = "FEFEWFSTARTFFFPENDDCDC"
tmp = s[s.find("START") + len("START"):]
result = tmp[:tmp.find("END")]

yourString = 'FEFEWFSTARTFFFPENDDCDC'
substring = yourString[yourString.find("START") + len("START") : yourString.find("END")]

Not that efficient but does work.
>>> s = "FEFEWFSTARTFFFPENDDCDC"
>>> s[s.index('START'):s.index('END')+len('END')]
'STARTFFFPEND'

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

I have the following list :
list_paths=imgs/foldeer/img_ABC_21389_1.tif.tif,
imgs/foldeer/img_ABC_15431_10.tif.tif,
imgs/foldeer/img_GHC_561321_2.tif.tif,
imgs_foldeer/img_BCL_871125_21.tif.tif,
...
I want to be able to run a for loop to match string with specific number,which is the number between the second occurance of "_" to the ".tif.tif", for example, when number is 1, the string to be matched is "imgs/foldeer/img_ABC_21389_1.tif.tif" , for number 2, the match string will be "imgs/foldeer/img_GHC_561321_2.tif.tif".
For that, I wanted to use regex expression. Based on this answer, I have tested this regex expression on Regex101:
[^\r\n_]+\.[^\r\n_]+\_([0-9])
But this doesn't match anything, and also doesn't make sure that it will take the exact number, so if number is 1, it might also select items with number 10 .
My end goal is to be able to match items in the list that have the request number between the 2nd occurrence of "_" to the first occirance of ".tif" , using regex expression, looking for help with the regex expression.
EDIT: The output should be the whole path and not only the number.
Your pattern [^\r\n_]+\.[^\r\n_]+\_([0-9]) does not match anything, because you are matching an underscore \_ (note that you don't have to escape it) after matching a dot, and that does not occur in the example data.
Then you want to match a digit, but the available digits only occur before any of the dots.
In your question, the numbers that you are referring to are after the 3rd occurrence of the _
What you could do to get the path(s) is to make the number a variable for the number you want to find:
^\S*?/(?:[^\s_/]+_){3}\d+\.tif\b[^\s/]*$
Explanation
\S*? Match optional non whitespace characters, as few as possible
/ Match literally
(?:[^\s_/]+_){3} Match 3 times (non consecutive) _
\d+ Match 1+ digits
\.tif\b[^\s/]* Match .tif followed by any char except /
$ End of string
See a regex demo and a Python demo.
Example using a list comprehension to return all paths for the given number:
import re
number = 10
pattern = rf"^\S*?/(?:[^\s_/]+_){{3}}{number}\.tif\b[^\s/]*$"
list_paths = [
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif",
"imgs_foldeer/img_BCL_871125_21.png.tif"
]
res = [lp for lp in list_paths if re.search(pattern, lp)]
print(res)
Output
['imgs/foldeer/img_ABC_15431_10.tif.tif']
I'll show you something working and equally ugly as regex which I hate:
data = ["imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif"]
numbers = [int(x.split("_",3)[-1].split(".")[0]) for x in data]
First split gives ".tif.tif"
extract the last element
split again by the dot this time, take the first element (thats your number as a string), cast it to int
Please keep in mind it's gonna work only for the format you provided, no flexibility at all in this solution (on the other hand regex doesn't give any neither)
without regex if allowed.
import re
s= 'imgs/foldeer/img_ABC_15431_10.tif.tif'
last =s[s.rindex('_')+1:]
print(re.findall(r'\d+', last)[0])
Gives #
10
[0-9]*(?=\.tif\.tif)
This regex expression uses a lookahead to capture the last set of numbers (what you're looking for)
Try this:
import re
s = '''imgs/foldeer/img_ABC_21389_1.tif.tif
imgs/foldeer/img_ABC_15431_10.tif.tif
imgs/foldeer/img_GHC_561321_2.tif.tif
imgs_foldeer/img_BCL_871125_21.tif.tif'''
number = 1
res1 = re.findall(f".*_{number}\.tif.*", s)
number = 21
res21 = re.findall(f".*_{number}\.tif.*", s)
print(res1)
print(res21)
Results
['imgs/foldeer/img_ABC_21389_1.tif.tif']
['imgs_foldeer/img_BCL_871125_21.tif.tif']

Using Regex to move some letter of a string to a new location in the same string in a Series of strings in python

I have a list of 4000 strings. The naming convention needs to be changed for each string and I do not want to go through and edit each one individually.
The list looks like this:
data = list()
data = ['V2-FG2110-EMA-COMPRESSION',
'V2-FG2110-SA-COMPRESSION',
'V2-FG2110-UMA-COMPRESSION',
'V2-FG2120-EMA-DISTRIBUTION',
'V2-FG2120-SA-DISTRIBUTION',
'V2-FG2120-UMA-DISTRIBUTION',
'V2-FG2140-EMA-HEATING',
'V2-FG2140-SA-HEATING',
'V2-FG2140-UMA-HEATING',
'V2-FG2150-EMA-COOLING',
'V2-FG2150-SA-COOLING',
'V2-FG2150-UMA-COOLING',
'V2-FG2160-EMA-TEMPERATURE CONTROL']
I need all each 'SA' 'UMA' and 'EMA' to be moved to before the -FG.
Desired output is:
V2-EMA-FG2110-Compression
V2-SA-FG2110-Compression
V2-UMA-FG2110-Compression
...
The V2-FG2 does not change throughout the list so I have started there and I tried re.sub and re.search but I am pretty new to python so I have gotten a mess of different results. Any help is appreciated.
You can rearrange the strings.
new_list = []
for word in data:
arr = word.split('-')
new_word = '%s-%s-%s-%s'% (arr[0], arr[2], arr[1], arr[3])
new_list.append(new_word)
You can replace matches of the following regular expression with the contents of capture group 1:
(?<=^[A-Z]\d)(?=.*(-(?:EMA|SA|UMA))(?=-))|-(?:EMA|SA|UMA)(?=-)
Demo
The regular expression can be broken down as follows.
(?<=^[A-Z]\d) # current string position must be preceded by a capital
# letter followed by a digit at the start of the string
(?= # begin a positive lookahead
.* # match >= 0 chars other than a line terminator
(-(?:EMA|SA|UMA)) # match a hyphen followed by one of the three strings
# and save to capture group 1
(?=-) # the next char must be a hyphen
) # end positive lookahead
| # or
-(?:EMA|SA|UMA) # match a hyphen followed by one of the three strings
(?=-) # the next character must be a hyphen
(?=-) is a positive lookahead.
Evidently this may not work for versions of Python prior to 3.5, because the match in the second part of the alternation does not assign a value to capture group 1: "Before Python 3.5, backreferences to failed capture groups in Python re.sub were not populated with an empty string.. This quote is from
#WiktorStribiżew 's answer at the link. For what it's worth I confirmed that Ruby has the same behaviour ("V2-FG2110-EMA-COMPRESSION".gsub(rgx,'\1') #=> "V2-EMA-FG2110-COMPRESSION").
One could of course instead replace matches of (?<=^[A-Z]\d)(-[A-Z]{2}\d{4})(-(?:EMA|SA|UMA))(?=-)) with $2 + $1. That's probably more sensible even if it's less interesting.

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

returning info from a string

return poker_hand(list_of_five_cards) returns a string similar to this:
**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)
and I have created a string out of it I want the information inside the brackets. in this vein I have tried:
s = str(poker_hand(one_man))
print s
the_search = re.search(r"\((\w+)\)", s)
and this returns None when you type print the_search. I have also tried
s[s.find("(")+1:s.find(')')]
print s
which returns the whole string. Does anyone know what I am doing wrong?
EDIT sorry for the confusion I should be better,
input is 7-Spades/4-Clubs/3-Diamonds/3-Hearts/8-Spades (One pair.)
desired output is One pair
re the assigning... trying to assign it now, will post the results
the pattern you are using to find the item in brackets is not right.
you can try to test your regex in http://regexr.com/
import re
s = '**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)'
pattern = r'\(.+\.\)'
for item in re.findall(pattern,s):
print item.strip('().')
output:
One pair
IIUC at the end of your string you always have the closed brackets. Then try this:
'**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)'.split('(')[1][:-1]
Out[1]: 'One pair.'
The idea is to split by the opening brackets, taking what's after, and deleting the closing brackets.
input is 7-Spades/4-Clubs/3-Diamonds/3-Hearts/8-Spades (One pair.)
desired output is One pair
You can use something like:
import re
string = "7-Spades/4-Clubs/3-Diamonds/3-Hearts/8-Spades (One pair.)"
result = re.findall(r"\((.*?)\.?\)", string )
print result[0]
Ideone Demo
Regex Explanation:
\((.*?)\.?\)
Match the character “(” literally «\(»
Match the regex below and capture its match into backreference number 1 «(.*?)»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “.” literally «\.?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character “)” literally «\)»
Use the groups:
import re
s = '**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)'
print (s)
m = re.search(r'\(([\s\S]+)\.\)', s)
print(m.group(1))

Matching character "/" in a string

This is my first post and I am a newbie to Python. I am trying to get this to work.
string 1 = [1/0/1, 1/0/2]
string 2 = [1/1, 1/2]
Trying to check the string if I see two / then I just need to replace the 0 with 1 so it becomes 1/1/1 and 1/1/2.
If I don't have two / then I need to add one in along with a 1 and change it to the format 1/1/1 and 1/1/2 so string 2 becomes [1/1/1,1/1/2]
Ultimate goal is to get all strings match the pattern x/1/x. Thanks for all the Input on this.I tried this and it seems to work
for a in Port:
if re.search(r'././', a):
z.append(a.replace('/0/','/1/') )
else:
t1= a.split('/')
if len(t1)>1 :
t2= t1[0] + "/1/" + t1[1]
z.append(t2)
few lines are there to take care of some exceptions but seems to do the job.
The regex pattern for identifying a / is just \/
This could be solved rather simply using the built in string functions without having to add all of the overhead and additional computational time caused by using the RegEx engine.
For example:
# The string to test:
sTest = '1/0/2'
# Test the string:
if(sTest.count('/') == 2):
# There are two forward slashes in the string
# If the middle number is a 0, we'll replace it with a one:
sTest = sTest.replace('/0/', '/1/')
elif(sTest.count('/') == 1):
# One forward slash in string
# Insert a 1 between first portion and the last portion:
sTest = sTest.replace('/', '/1/')
else:
print('Error: Test string is of an unknown format.')
# End If
If you really want to use RegEx, though, you could simply match the string against these two patterns: \d+/0/\d+ and \d+/\d+(?!/) If matching against the first pattern fails, then attempt to match against the second pattern. Then, you can use a either grouping, splitting, or simply calling .replace() (like I'm doing above) to format the string as you need.
EDIT: for clarification, I'll explain the two patterns:
Pattern 1: \d+/0/\d+ could essentially be read as "match any number (consisting of one (1) or more digits) followed by a forward slash, a zero (0), another forward slash and then followed by any number (consisting of one (1) or more digits).
Pattern 2: \d+/\d+(?!/) could be read as "match any number (consisting of one (1) or more digits) followed by a forward slash and any other number (consisting of one (1) or more digits) which is then NOT followed by another forward slash." The last part in this pattern could be a little confusing because it uses the negative lookahead abilities of the RegEx engine.
If you wanted to add stricter rules to these patterns to make sure there are not any leading or trailing non-digit characters, you could add ^ to the start of the patterns and $ to the end, to signify the start of the string and the end of the string respectively. This would also allow you to remove the lookahead expression from the second pattern ((?!/)). As such, you would end up with the following patterns: ^\d+/0/\d+$ and ^\d+/\d+$.
https://regex101.com/r/rE6oN2/1
Click code generator on the left side. You get:
import re
p = re.compile(ur'\d/1/\d')
test_str = u"1/1/2"
re.search(p, test_str)

Categories