import regex
frase = "text https://www.gamivo.com/product/sea-of-thieves-pc-xbox-one other text https://www.gamivo.com/product/fifa-21-origin-eng-pl-cz-tr"
x = regex.findall(r"/((http[s]?:\/\/)?(www\.)?(gamivo\.com\S*){1})", frase)
print(x)
Result:
[('www.gamivo.com/product/sea-of-thieves-pc-xbox-one', '', 'www.', 'gamivo.com/product/sea-of-thieves-pc-xbox-one'), ('www.gamivo.com/product/fifa-21-origin-eng-pl-cz-tr', '', 'www.', 'gamivo.com/product/fifa-21-origin-eng-pl-cz-tr')]
I want something like:
[('https://www.gamivo.com/product/sea-of-thieves-pc-xbox-one', 'https://gamivo.com/product/fifa-21-origin-eng-pl-cz-tr')]
How can I do this?
You need to
Remove the initial / char that invalidates the match of https:// / http:// since / appears after http
Remove unnecessary capturing group and {1} quantifier
Convert the optional capturing group into a non-capturing one.
See this Python demo:
import re
frase = "text https://www.gamivo.com/product/sea-of-thieves-pc-xbox-one other text https://www.gamivo.com/product/fifa-21-origin-eng-pl-cz-tr"
print( re.findall(r"(?:https?://)?(?:www\.)?gamivo\.com\S*", frase) )
# => ['https://www.gamivo.com/product/sea-of-thieves-pc-xbox-one', 'https://www.gamivo.com/product/fifa-21-origin-eng-pl-cz-tr']
See the regex demo, too. Also, see the related re.findall behaves weird post.
Try this, it will take string starting from https to single space or newline.
import re
frase = "text https://www.gamivo.com/product/sea-of-thieves-pc-xbox-one other text https://www.gamivo.com/product/fifa-21-origin-eng-pl-cz-tr"
x = re.findall('(https?://(?:[^\s]*))', frase)
print(x)
# ['https://www.gamivo.com/product/sea-of-thieves-pc-xbox-one', 'https://www.gamivo.com/product/fifa-21-origin-eng-pl-cz-tr']
Related
I want to replace words and spaces that appear before a digit in a string with nothing. For example, for the string = 'Juice of 1/2', I want to return '1/2'. I tried the following, but it did not work.
string = "Juice of 1/2"
new = string.replace(r"^.+?(?=\d)", "")
Also I am trying to perform this on every cell of a list of columns using the following code. How would I incorporate the new regex pattern into the existing pattern of r"(|)|?
df[pd.Index(cols2) + "_clean"] = (
df[cols2]
.apply(lambda col: col.str.replace(r"\(|\)|,", "", regex=True))
)
You might be able to phrase this using str.extract:
df["col2"] = df["col2"].str.extract(r'([0-9/-]+)')
.+? will match anything, including other digits. It will also match the / in 1/2. Since you only want to replace letters and spaces, use [a-z\s]+.
You also have to use re.sub(), not string.replace() (in Pandas, .str.replace() processes regular expressions by default).
new = re.sub(r'[a-z\s]+(?=\d)', '', string, flags=re.I)
May be something like this might work.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"[A-Za-z\s]+"
test_str = "Juice of 1/2 hede"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
I'm not getting the desire output, re.sub is only replacing the last occurance using python regular expression, please explain me what i"m doing wrong
srr = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123| http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
re.sub("http://.*[#]", "", srr)
'image-1CE005XG03'
Desire output without http://www.google.com/#image from the above string.
image-1CCCC|image-1VVDD|image-123|image-1CE005XG03
I would use re.findall here, rather than trying to do a replacement to remove the portions you don't want:
src = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123| http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
matches = re.findall(r'https?://www\.\S+#([^|\s]+)', src)
output = '|'.join(matches)
print(output) # image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03
Note that if you want to be more specific and match only Google URLs, you may use the following pattern instead:
https?://www\.google\.\S+#([^|\s]+)
>>> "|".join(re.findall(r'#([^|\s]+)', srr))
'image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03'
Here is another solution,
"|".join(i.split("#")[-1] for i in srr.split("|"))
image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03
Using correct regex in re.sub as suggested in comment above:
import re
srr = "http://www.google.com/#image-1CCCC| http://www.google.com/#image-1VVDD| http://www.google.com/#image-123| http://www.google.com/#image-123| http://www.google.com/#image-1CE005XG03"
print (re.sub(r"\s*https?://[^#\s]*#", "", srr))
Output:
image-1CCCC|image-1VVDD|image-123|image-123|image-1CE005XG03
RegEx Details:
\s*: Match 0 or more whitespaces
https?: Match http or https
://: Match ://
[^#\s]*: Match 0 or more of any characters that are not # and whitespace
#: Match a #
message = <#U0104FGR7SL> test111 <#U0106LSJ> test33
There is the above string.
Based on the reference letter corresponding to the pattern <#U......>
I'd like to split the text.
I'd like to cut it by a pattern.
regex = re.compile("<#U[^>]+>")
match = regex.split (message)
If I do this, I get a "test, test22"
<#U0104FGR7SL> test111
<#U0106LSJ> test33
I'd like to split it this way.
Please advise me what to do.
You can do the following:
import re
message = "<#U0104FGR7SL> test111 <#U0106LSJ> test33"
matches = re.findall("<\S+>\s\S+", message)
for x in matches:
print(x)
# <#U0104FGR7SL> test111
# <#U0106LSJ> test33
Another one - using the newer regex module which supports splitting by lookarounds:
import regex as re
string = "<#U0104FGR7SL> test111 <#U0106LSJ> test33"
parts = re.split(r'(?<!\A)(?=<#)', string)
print(parts)
This yields
['<#U0104FGR7SL> test111 ', '<#U0106LSJ> test33']
See a demo on regex101.com.
You may use either of the two re.split solutions:
re.split(r'\s+(?=<#U[^>]+>)', message) # Any Python version, if matches are whitespace separated
[x.strip() for x in re.split(r'(?=<#U[^>]+>)', message) if x] # Starting with Python 3.7
NOTE: In Python 3.7, re.split finally was fixed to split with empty matches.
Details
\s+ - 1+ whitespaces
(?=<#U[^>]+>) - a positive lookahead that requires <#U, 1+ chars other than > and then > immediately to the right of the current location.
See the Python demo:
import re
message = '<#U0104FGR7SL> test111 <#U0106LSJ> test33'
print ( re.split(r'\s+(?=<#U[^>]+>)', message) )
# => '<#U0104FGR7SL> test111', '<#U0106LSJ> test33']
print ( [x.strip() for x in re.split(r'(?=<#U[^>]+>)', message) if x] )
# => '<#U0104FGR7SL> test111', '<#U0106LSJ> test33']
I have a string "Name(something)" and I am trying to extract the portion of the string within the parentheses!
Iv'e tried the following solutions but don't seem to be getting the results I'm looking for.
n.split('()')
name, something = n.split('()')
You can use a simple regex to catch everything between the parenthesis:
>>> import re
>>> s = 'Name(something)'
>>> re.search('\(([^)]+)', s).group(1)
'something'
The regex matches the first "(", then it matches everything that's not a ")":
\( matches the character "(" literally
the capturing group ([^)]+) greedily matches anything that's not a ")"
as an improvement on #Maroun Maroun 's answer:
re.findall('\(([^)]+)', s)
it finds all instances of strings in between parentheses
You can use split as in your example but this way
val = s.split('(', 1)[1].split(')')[0]
or using regex
You can use re.match:
>>> import re
>>> s = "name(something)"
>>> na, so = re.match(r"(.*)\((.*)\)" ,s).groups()
>>> na, so
('name', 'something')
that matches two (.*) which means anything, where the second is between parentheses \( & \).
You can look for ( and ) (need to escape these using backslash in regex) and then match every character using .* (capturing this in a group).
Example:
import re
s = "name(something)"
regex = r'\((.*)\)'
text_inside_paranthesis = re.match(regex, s).group(1)
print(text_inside_paranthesis)
Outputs:
something
Without regex you can do the following:
text_inside_paranthesis = s[s.find('(')+1:s.find(')')]
Outputs:
something
I use part of code to read a website and scrap some information and place it into Google and print some directions.
I'm having an issue as some of the information. the site i use sometimes adds a # followed by 3 random numbers then a / and another 3 numbers e.g #037/100
how can i use python to ignore this "#037/100" string?
I currently use
for i, part in enumerate(list(addr_p)):
if '#' in part:
del addr_p[i]
break
to remove the # if found but I'm not sure how to do it for the random numbers
Any ideas ?
If you find yourself wanting to remove "three digits followed by a forward slash followed by three digits" from a string s, you could do
import re
s = "this is a string #123/234 with other stuff"
t = re.sub('#\d{3}\/\d{3}', '', s)
print t
Result:
'this is a string with other stuff'
Explanation:
# - literal character '#'
\d{3} - exactly three digits
\/ - forward slash (escaped since it can have special meaning)
\d{3} - exactly three digits
And the whole thing that matches the above (if it's present) is replaced with '' - i.e. "removed".
import re
re.sub('#[0-9]+\/[0-9]+$', '', addr_p[i])
I'm no wizzard with regular expressions but i'd imagine you could so something like this.
You could even handle '#' in the regexp as well.
If the format is always the same, then you could check if the line starts with a #, then set the string to itself without the first 8 characters.
if part[0:1] == '#':
part = part[8:]
if the first letter is a #, it sets the string to itself, from the 8th character to the end.
I'd double your problems and match against a regular expression for this.
import re
regex = re.compile(r'([\w\s]+)#\d+\/\d+([\w\s]+)')
m = regex.match('This is a string with a #123/987 in it')
if m:
s = m.group(1) + m.group(2)
print(s)
A more concise way:
import re
s = "this is a string #123/234 with other stuff"
t = re.sub(r'#\S+', '', s)
print(t)