regex to extract data between quotes - python

As title says string is '="24digit number"' and I want to extract number between "" (example: ="000021484123647598423458" should get me '000021484123647598423458').
There are answers that answer how to get data between " but in my case I also need to confirm that =" exist without capturing (there are also other "\d{24}" strings, but they are for other stuff) it.
I couldn't modify these answers to get what I need.
My latest regex was ((?<=\")\d{24}(?=\")) and string is ="000021484123647598423458".
UPDATE: I think I will settle with pattern r'^(?:\=\")(\d{24})(?:\")' because I just want to capture digit characters.
word = '="000021484123647598423458"'
pattern = r'^(?:\=\")(\d{24})(?:\")'
match = re.findall(pattern, word)[0]
Thank you all for suggestions.

You could have it like:
=(['"])(\d{24})\1
See a demo on regex101.com.
In Python:
import re
string = '="000021484123647598423458"'
rx = re.compile(r'''=(['"])(\d{24})\1''')
print(rx.search(string).group(2))
# 000021484123647598423458

Any one of the following works:
>>> st = '="000021484123647598423458"'
>>> import re
>>> re.findall(r'".*\d+.*"',st)
['"000021484123647598423458"']
or
>>> re.findall(r'".*\d{24}.*"',st)
['"000021484123647598423458"']
or
>>> re.findall(r'"\d{24}"',st)
['"000021484123647598423458"']

Related

I want to extract data using regular expression in python

I have a string = "ProductId%3D967164%26Colour%3Dbright-royal" and i want to extract data using regex so output will be 967164bright-royal.
I have tried with this (?:ProductId%3D|Colour%3D)(.*) in python with regex, but getting output as 967164%26Colour%3Dbright-royal.
Can anyone please help me to find out regex for it.
You don't need a regex here, use urllib.parse module:
from urllib.parse import parse_qs, unquote
qs = "ProductId%3D967164%26Colour%3Dbright-royal"
d = parse_qs(unquote(qs))
print(d)
# Output:
{'ProductId': ['967164'], 'Colour': ['bright-royal']}
Final output:
>>> ''.join(i[0] for i in d.values())
'967164bright-royal'
Update
>>> ''.join(re.findall(r'%3D(\S*?)(?=%26|$)', qs))
'967164bright-royal'
The alternative matches on the first part, you can not get a single match for 2 separate parts in the string.
If you want to capture both values using a regex in a capture group:
(?:ProductId|Colour)%3D(\S*?)(?=%26|$)
Regex demo
import re
pattern = r"(?:ProductId|Colour)%3D(\S*?)(?=%26|$)"
s = "ProductId%3D967164%26Colour%3Dbright-royal"
print(''.join(re.findall(pattern, s)))
Output
967164bright-royal
If you must use a regular expression and you can guarantee that the string will always be formatted the way you expect, you could try this.
import re
pattern = r"ProductId%3D(\d+)%26Colour%3D(.*)"
string = "ProductId%3D967164%26Colour%3Dbright-royal"
matches = re.match(pattern, string)
print(f"{matches[1]}{matches[2]}")

Getting word from string

How can i get word example from such string:
str = "http://test-example:123/wd/hub"
I write something like that
print(str[10:str.rfind(':')])
but it doesn't work right, if string will be like
"http://tests-example:123/wd/hub"
You can use this regex to capture the value preceded by - and followed by : using lookarounds
(?<=-).+(?=:)
Regex Demo
Python code,
import re
str = "http://test-example:123/wd/hub"
print(re.search(r'(?<=-).+(?=:)', str).group())
Outputs,
example
Non-regex way to get the same is using these two splits,
str = "http://test-example:123/wd/hub"
print(str.split(':')[1].split('-')[1])
Prints,
example
You can use following non-regex because you know example is a 7 letter word:
s.split('-')[1][:7]
For any arbitrary word, that would change to:
s.split('-')[1].split(':')[0]
many ways
using splitting:
example_str = str.split('-')[-1].split(':')[0]
This is fragile, and could break if there are more hyphens or colons in the string.
using regex:
import re
pattern = re.compile(r'-(.*):')
example_str = pattern.search(str).group(1)
This still expects a particular format, but is more easily adaptable (if you know how to write regexes).
I am not sure why do you want to get a particular word from a string. I guess you wanted to see if this word is available in given string.
if that is the case, below code can be used.
import re
str1 = "http://tests-example:123/wd/hub"
matched = re.findall('example',str1)
Split on the -, and then on :
s = "http://test-example:123/wd/hub"
print(s.split('-')[1].split(':')[0])
#example
using re
import re
text = "http://test-example:123/wd/hub"
m = re.search('(?<=-).+(?=:)', text)
if m:
print(m.group())
Python strings has built-in function find:
a="http://test-example:123/wd/hub"
b="http://test-exaaaample:123/wd/hub"
print(a.find('example'))
print(b.find('example'))
will return:
12
-1
It is the index of found substring. If it equals to -1, the substring is not found in string. You can also use in keyword:
'example' in 'http://test-example:123/wd/hub'
True

Regex to retrieve the last few characters of a string

Regex to retrieve the last portion of a string:
https://play.google.com/store/apps/details?id=com.lima.doodlejump
I'm looking to retrieve the string followed by id=
The following regex didn't seem to work in python
sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
re.search("id=(.*?)", sampleURL).group(1)
The above should give me an output:
com.lima.doodlejump
Is my search group right?
Your regular expression
(.*?)
will not work because, it will match between zero and unlimited times, as few times as possible (becasue of the ?). So, you have the following choices of RegEx
(.*) # Matches the rest of the string
(.*?)$ # Matches till the end of the string
But, you don't need RegEx at all here, simply split the string like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
print data.split("id=", 1)[-1]
Output
com.lima.doodlejump
If you really have to use RegEx, you can do like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
import re
print re.search("id=(.*)", data).group(1)
Output
com.lima.doodlejump
I'm surprised that nobody has mentioned urlparse yet...
>>> s = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> urlparse.urlparse(s)
ParseResult(scheme='https', netloc='play.google.com', path='/store/apps/details', params='', query='id=com.lima.doodlejump', fragment='')
>>> urlparse.parse_qs(urlparse.urlparse(s).query)
{'id': ['com.lima.doodlejump']}
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id']
['com.lima.doodlejump']
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id'][0]
'com.lima.doodlejump'
The HUGE advantage here is that if the url query string gets more components then it could easily break the other solutions which rely on a simple str.split. It won't confuse urlparse however :).
Just split it in the place you want:
id = url.split('id=')[1]
If you print id, you'll get:
com.lima.doodlejump
Regex isn't needed here :)
However, in case there are multiple id=s in your string, and you only wanted the last one:
id = url.split('id=')[-1]
Hope this helps!
This works:
>>> import re
>>> sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> re.search("id=(.+)", sampleURL).group(1)
'com.lima.doodlejump'
>>>
Instead of capturing non-greedily for zero or more characters, this code captures greedily for one or more.

python find string after linebreak

I have 2 strings that contains the following:
name = 'Kalvo'
info = 'PC1:\nKalvo (Read)(Write)\nKL27 (Read)(Write)'
Now what I want achieve here is to search the info for the word found in name and print out everything after name.
Lets say I'm searching the string info for string name and it should the print out:
Kalvo (Read)(Write)
I tried using re.search and re.findall but I can't get them to work.
Help is much appreciated.
Br,
Toby
You can use str.format to insert the name in the Regex pattern. Then, using .*, you can get any characters after it. See a demonstration below:
>>> from re import findall
>>> name = 'Kalvo'
>>> info = 'PC1:\nKalvo (Read)(Write)\nKL27 (Read)(Write)'
>>> findall("{}.*".format(name), info)[0]
'Kalvo (Read)(Write)'
>>>

regex in python 2.4

I have a string in python as below:
"\\B1\\B1xxA1xxMdl1zzInoAEROzzMofIN"
I want to get the string as
"B1xxA1xxMdl1zzInoAEROzzMofIN"
I think this can be done using regex but could not achieve it yet. Please give me an idea.
st = "\B1\B1xxA1xxMdl1zzInoAEROzzMofIN"
s = re.sub(r"\\","",st)
idx = s.rindex("B1")
print s[idx:]
output = 'B1xxA1xxMdl1zzInoAEROzzMofIN'
OR
st = "\B1\B1xxA1xxMdl1zzInoAEROzzMofIN"
idx = st.rindex("\\")
print st[idx+1:]
output = 'B1xxA1xxMdl1zzInoAEROzzMofIN'
Here is a try:
import re
s = "\\B1\\B1xxA1xxMdl1zzInoAEROzzMofIN"
s = re.sub(r"\\[^\\]+\\","", s)
print s
Tested on http://py-ide-online.appspot.com (couldn't find a way to share though)
[EDIT] For some explanation, have a look at the Python regex documentation page and the first comment of this SO question:
How to remove symbols from a string with Python?
because using brackets [] can be tricky (IMHO)
In this case, [^\\] means anything but two backslashes \\.
So [^\\]+ means one or more character that matches anything but two backslashes \\.
If the desired section of the string is always on the RHS of a \ char then you could use:
string = "\\B1\\B1xxA1xxMdl1zzInoAEROzzMofIN"
string.rpartition("\\")[2]
output = 'B1xxA1xxMdl1zzInoAEROzzMofIN'

Categories