How to extract string information from these two strings?

How to extract string information from these two strings? - python

I want to write a single regular expression code to extract the string from these two strings:
string1 = '#HISEQ:625:HC2T5BCXY:1:1101:1177:2101'
string2 = '#SRR7216015.1 HISEQ:630:HC2VKBCXY:1:1101:1177:2073/1'
I want to extract the string right after the # until it hit the end or a space to get
HISEQ:625:HC2T5BCXY:1:1101:1177:2101 from string1
or
SRR7216015.1 from string2
So, how to do it. I've tested a bunch of the regular expression code but couldn't do it.
Below is the code I tried:
string1 = '#HISEQ:625:HC2T5BCXY:1:1101:1177:2101'
string2 = '#SRR7216015.1 HISEQ:630:HC2VKBCXY:1:1101:1177:2073/1'
pattern1 = re.compile(r'#(\w*.*:*\d*:*\w*:*\d*:*\d*[$|\s])')
print(pattern1.search(string1).group(1))
Thanks in advance!

Just use
#(\S+)
and take the first group. Lookarounds or alternations - as suggested in other answers - are expensive.

You could use this regex for that:
(?<=#).*?(?= |$)
Use lookarounds. (?<=#) checks for an # signt before, (?= |$) matches an spaces or end of string. .* mathes everything between
https://regex101.com/r/p7kI2O/1

Related

how can avoid a string based on regex?

am trying to fetch a string which only has a digit in it (the regex I give), but its returning me the both of them.
string1 = '1234843847394645362'
string2 = 'this is what I have 1297643847381737345is a multi'
Regex used :
'\d{15,20}'
this gives me both the numbers from string1 and string2 .
Can we avoid getting the number from string2 ?
need help.

Try with this regex: ^\d{15,20}$
Demo here

If you don't want to match the digits when followed by a newline use \Z
\A\d{15,20}\Z
Regex demo

findall not retruning all the results in Python 3.7

I am trying to create list of tuples with the data after strings string1 and string3. But not getting expected result.
s = 'string1:1234string2string3:a1b2c3string1:2345string3:b5c6d7'
re.findall('string1:(\d+)[\s,\S]+string3:([\s\S]+',s)
Actual result:
[('1234', 'b5c6d7)']
Expected result:
[('1234', 'a1b2c3'), ('2345', 'b5c6d7')]

You current regex uses [\s,\S]+ which is greedy and matches all characters until the end of the line.
You could make it non greedy and use a positive lookahead (?=string|$) for the last match that assert what follows is either string or the end of the line $.
string1:(\d+).*?string3:(.*?)(?=string|$)
import re
s = 'string1:1234string2string3:a1b2c3string1:2345string3:b5c6d7'
print(re.findall('string1:(\d+).*?string3:(.*?)(?=string|$)',s))
Demo

The problem is that [\s,\S]+ is greedy and therefore consuming everything between the first string1 and the last string3.
You can fix that by using positive lookaheads and making the regex non greedy like this:
string1:(\d+)[^\d][\s,\S]+?string3:([\s\S]+?(?=string|$))

how to use python regex find matched string?

for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?

Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]

Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.

For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]

How to remove substrings marked with special characters from a string?

I have a string in Python:
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a nummber."
print Tt
'This is a <"string">string, It should be <"changed">changed to <"a">a nummber.'
You see the some words repeat in this part <\" \">.
My question is, how to delete those repeated parts (delimited with the named characters)?
The result should be like:
'This is a string, It should be changed to a nummber.'

Use regular expressions:
import re
Tt = re.sub('<\".*?\">', '', Tt)
Note the ? after *. It makes the expression non-greedy,
so it tries to match so few symbols between <\" and \"> as possible.
The Solution of James will work only in cases when the delimiting substrings
consist only from one character (< and >). In this case it is possible to use negations like [^>]. If you want to remove a substring delimited with character sequences (e.g. with begin and end), you should use non-greedy regular expressions (i.e. .*?).

I'd use a quick regular expression:
import re
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a number."
print re.sub("<[^<]+>","",Tt)
#Out: This is a string, It should be changed to a nummber.
Ah - similar to Igor's post, he beat my by a bit. Rather than making the expression non-greedy, I don't match an expression if it contains another start tag "<" in it, so it will only match a start tag that's followed by an end tag ">".

Replacing Certain Parts of a String Python

I can not seem to solve this. I have many different strings, and they are always different. I need to replace the ends of them though, but they are always different lengths. Here is a example of a couple strings:
string1 = "thisisnumber1(111)"
string2 = "itsraining(22252)"
string3 = "fluffydog(3)"
Now when I print these out it will of course print the following:
thisisnumber1(111)
itsraining(22252)
fluffydog(3)
What I would like it to print though is the follow:
thisisnumber1
itsraining
fluffydog
I would like it to remove the part in the parentheses for each string, but I do not know how sense the lengths are always changing. Thank You

You can use str.rsplit for this:
>>> string1 = "thisisnumber1(111)"
>>> string2 = "itsraining(22252)"
>>> string3 = "fluffydog(3)"
>>>
>>> string1.rsplit("(")
['thisisnumber1', '111)']
>>> string1.rsplit("(")[0]
'thisisnumber1'
>>>
>>> string2.rsplit("(")
['itsraining', '22252)']
>>> string2.rsplit("(")[0]
'itsraining'
>>>
>>> string3.rsplit("(")
['fluffydog', '3)']
>>> string3.rsplit("(")[0]
'fluffydog'
>>>
str.rsplit splits the string from right-to-left rather than left-to-right like str.split. So, we split the string from right-to-left on ( and then retrieve the element at index 0 (the first element). This will be everything before the (...) at the end of each string.

Your other option is to use regular expressions, which can give you more precise control over what you want to get.
import re
regex = regex = r"(.+)\(\d+\)"
print re.match(regex, string1).groups()[0] #returns thisisnumber1
print re.match(regex, string2).groups()[0] #returns itsraining
print re.match(regex, string3).groups()[0] #returns fluffydog
Breakdown of what's happening:
regex = r"(.+)\(\d+\)" is the regular expression, the formula for the string you're trying to find
.+ means match 1 or more character of any kind except newline
\d+ means match 1 or more digit
\( and \) are the "(" and ")" characters
putting .+ in parentheses puts that string sequence in a group, meaning that group of characters is one that you want to be able to access later on. We don't put the sequence \(\d+\) in a group because we don't care about those characters.
regex.match(regex, string1).groups() gives every substring in string1 that was part of a group. Since you only want 1 substring, you just access the 0th element.
There's a nice tutorial on regular expressions on Tutorial's Point here if you want to learn more.

Since you say in a comment:
"all that will be in the parentheses will be numbers"
so you'll always have digits between your parens, I'd recommend taking a look at removing them with the regular expression module:
import re
string1 = "thisisnumber1(111)"
string2 = "itsraining(22252)"
string3 = "fluffydog(3)"
strings = string1, string2, string3
for s in strings:
s_replaced = re.sub(
r'''
\( # must escape the parens, since these are special characters in regex
\d+ # one or more digits, 0-9
\)
''', # this regular expression will be replaced by the next argument
'', replace the above with an empty string
s, # the string we're modifying
re.VERBOSE) # verbose flag allows us to comment regex clearly
print(s_replaced)
prints:
thisisnumber1
itsraining
fluffydog

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract string information from these two strings? - python

Just use #(\S+) and take the first group. Lookarounds or alternations - as suggested in other answers - are expensive.

You could use this regex for that: (?<=#).?(?= |$) Use lookarounds. (?<=#) checks for an # signt before, (?= |$) matches an spaces or end of string. . mathes everything between https://regex101.com/r/p7kI2O/1

Related

how can avoid a string based on regex?

findall not retruning all the results in Python 3.7

how to use python regex find matched string?

How to remove substrings marked with special characters from a string?

Replacing Certain Parts of a String Python

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract string information from these two strings? - python

Just use #(\S+) and take the first group. Lookarounds or alternations - as suggested in other answers - are expensive.

You could use this regex for that: (?<=#).*?(?= |$) Use lookarounds. (?<=#) checks for an # signt before, (?= |$) matches an spaces or end of string. .* mathes everything between https://regex101.com/r/p7kI2O/1

Related

how can avoid a string based on regex?

findall not retruning all the results in Python 3.7

how to use python regex find matched string?

How to remove substrings marked with special characters from a string?

Replacing Certain Parts of a String Python

Categories

Resources

You could use this regex for that: (?<=#).?(?= |$) Use lookarounds. (?<=#) checks for an # signt before, (?= |$) matches an spaces or end of string. . mathes everything between https://regex101.com/r/p7kI2O/1