How to remove text between variables - python

I currently have a string that has variables in it.
domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh
I'm trying to delete
&thingy=(all text that is in here)
The order might not always be that, and the text after the = will change.
I started doing something like this, but I feel there has to be quicker alternative:
cleanlist = []
variables = url.split('&')
for t in variables:
if not t.split('=', 1)[0] == 'thingy':
cleanlist.append(t.split('=', 1)[0])

I don't know Python, but from experience with other programming languages, the question I think you should have asked is "How do you parse a URL in Python?" or "How do you parse a url query string in Python?"
Just Googling this I got the following info that may help:
from urlparse import urlparse
o = urlparse('domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh')
q = urlparse.parse_qs(o.query, true)
>>> q.hello
randomtext
>>> q.thingy
randomtext2
Once you parse the URL and query string, just grab what you want.

You can substitute using regex.
import re
p = re.compile(ur'(&thingy=.*)&')
test_str = u"domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh"
subst = u"&"
result = re.sub(p, subst, test_str)
>>> result
u'domain.com/?hello=randomtext&stuff=1231kjh'

If I get your question right then you're trying the delete all the string which is "&thingy=randotext2&stuff=1231kjh "
This can be easily achieved by doing something like this:
current_str = "domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh"
cursor = current_str.find("&thingy=")
clean_str = current_str[:cursor]
Now the clean_str variable is what you're looking for.
This will give a clean result which is only:
domain.com/?hello=randomtext

If you wish to delete only a query string argument value, such as &thingy=, in a regular expression it's like this:
import re
domain = "domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh"
x = re.sub(r'(&thingy=)[^&]*(&?.*)$', r'\1\2', domain)
Never mind what's followed after the given one.

Related

Regex issue in python

I have a regex "value=4020a345-f646-4984-a848-3f7f5cb51f21"
if re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x ):
x = re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x )
m = x.group(1)
m only gives me 4020a345, not sure why it does not give me the entire "4020a345-f646-4984-a848-3f7f5cb51f21"
Can anyone tell me what i am doing wrong?
try out this regex, looks like you are trying to match a GUID
value=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
This should match what you want, if all the strings are of the form you've shown:
value=((\w*\d*\-?)*)
You can also use this website to validate your regular expressions:
http://regex101.com/
The below regex works as you expect.
value=([\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*]+)
You are trying to match on some hex numbers, that is why this regex is more correct than using [\w\d]
pattern = "value=([0-9a-fA-F]{8}-([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12})"
data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
res = re.search(pattern, data)
print(res.group(1))
If you dont care about the regex safety, aka checking that it is correct hex, there is no reason not to use simple string manipulation like shown below.
>>> data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
>>> print(data[7:])
020a345-f646-4984-a848-3f7f5cb51f21
>>> # or maybe
...
>>> print(data[7:].replace('-',''))
020a345f6464984a8483f7f5cb51f21
You can get the subparts of the value as a list
txt = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
parts = re.findall('\w+', txt)[1:]
parts is ['4020a345', 'f646', '4984', 'a848', '3f7f5cb51f21']
if you really want the entire string
full = "-".join(parts)
A simple way
full = re.findall("[\w-]+", txt)[-1]
full is 4020a345-f646-4984-a848-3f7f5cb51f21
value=([\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*)
Try this.Grab the capture.Your regex was not giving the whole as you had used | operator.So if regex on left side of | get satisfied it will not try the latter part.
See demo.
http://regex101.com/r/hQ1rP0/45

Python splitting string to find specific content

I am trying to split a string in python to extract a particular part. I am able to get the part of the string before the symbol < but how do i get the bit after? e.g. the emailaddress part?
>>> s = 'texttexttextblahblah <emailaddress>'
>>> s = s[:s.find('<')]
>>> print s
This above code gives the output texttexttextblahblah 
s = s[s.find('<')+1:-1]
or
s = s.split('<')[1][:-1]
cha0site's and ig0774's answers are pretty straightforward for this case, but it would probably help you to learn regular expressions for times when it's not so simple.
import re
fullString = 'texttexttextblahblah <emailaddress>'
m = re.match(r'(\S+) <(\S+)>', fullString)
part1 = m.group(1)
part2 = m.group(2)
Perhaps being a bit more explicit with a regex isn't a bad idea in this case:
import re
match = re.search("""
(?<=<) # Make sure the match starts after a <
[^<>]* # Match any number of characters except angle brackets""",
subject, re.VERBOSE)
if match:
result = match.group()

regex in python 2.4

I have a string in python as below:
"\\B1\\B1xxA1xxMdl1zzInoAEROzzMofIN"
I want to get the string as
"B1xxA1xxMdl1zzInoAEROzzMofIN"
I think this can be done using regex but could not achieve it yet. Please give me an idea.
st = "\B1\B1xxA1xxMdl1zzInoAEROzzMofIN"
s = re.sub(r"\\","",st)
idx = s.rindex("B1")
print s[idx:]
output = 'B1xxA1xxMdl1zzInoAEROzzMofIN'
OR
st = "\B1\B1xxA1xxMdl1zzInoAEROzzMofIN"
idx = st.rindex("\\")
print st[idx+1:]
output = 'B1xxA1xxMdl1zzInoAEROzzMofIN'
Here is a try:
import re
s = "\\B1\\B1xxA1xxMdl1zzInoAEROzzMofIN"
s = re.sub(r"\\[^\\]+\\","", s)
print s
Tested on http://py-ide-online.appspot.com (couldn't find a way to share though)
[EDIT] For some explanation, have a look at the Python regex documentation page and the first comment of this SO question:
How to remove symbols from a string with Python?
because using brackets [] can be tricky (IMHO)
In this case, [^\\] means anything but two backslashes \\.
So [^\\]+ means one or more character that matches anything but two backslashes \\.
If the desired section of the string is always on the RHS of a \ char then you could use:
string = "\\B1\\B1xxA1xxMdl1zzInoAEROzzMofIN"
string.rpartition("\\")[2]
output = 'B1xxA1xxMdl1zzInoAEROzzMofIN'

How to find values in string, manipulate and replace them

How to find values in string, add specific value to each of them and replace output with fixed string.
import re
def _replace(content):
#x = float(content.group(4))+20
#y = float(content.group(6))+20
return content.group(6)
print re.sub('<g(\s)transform="matrix\((.*)(\s)(.*)(\s)(.*)\)\"', _replace, '<g transform="matrix(0.412445 -0.0982513 0.0982513 0.412445 -5.77618 67.0025)">')
First off, I should repeat the usual warning about not parsing XML with regexes. It's a bad idea, and it will never work for all cases. If you're actually trying to parse the full xml document, use an XML parser.
That having been said, I'm guilty of doing quick and dirty stuff like this all the time. If you really just need a one-off solution, a simple regex can often get the job done. Just be aware that it will come back to haunt you as soon as you run into something more complex!
Next, I confess to not being much of a regex wiz, but here's how I'd modify your code snippet:
import re
def _replace(content):
values = [float(val) for val in content.group(2).split()]
values[3] += 20
values[5] += 100
values = ['{0}'.format(val) for val in values]
return content.group(1) + ' '.join(values) + content.group(3)
test_string = '<g transform="matrix(0.412445 -0.0982513 0.0982513 0.412445 -5.77618 67.0025)">'
pattern = r'(transform=\"matrix\()(.*?)(\))'
print test_string
print re.sub(pattern, _replace, test_string)

How to make string from regex and value of group

I have regexp for twitter profile url and someone's twitter profile url. I can easily extract username from url.
>>> twitter_re = re.compile('twitter.com/(?P<username>\w+)/')
>>> twitter_url = 'twitter.com/dir01/'
>>> username = twitter_re.search(twitter_url).groups()[0]
>>> _
'dir01'
But if I have regexp and username, how do I get url?
Regexen are no two-way street. You can use them for parsing strings, but not for generating strings back from the result. You should probably look into another way of getting the URLs back, like basic string interpolation, or URI templates (see http://code.google.com/p/uri-templates/)
If you are not looking for a general solution to convert any regex into a formatting string, but something that you can hardcode:
twitter_url = 'twitter.com/%(username)s/' % {'username': 'dir01'}
...should give you what you need.
If you want a more general (but not incredibly robust solution):
import re
def format_to_re(format):
# Replace Python string formatting syntax with named group re syntax.
return re.compile(re.sub(r'%\((\w+)\)s', r'(?P<\1>\w+)', format))
twitter_format = 'twitter.com/%(username)s/'
twitter_re = format_to_re(twitter_format)
m = twitter_re.search('twitter.com/dir01/')
print m.groupdict()
print twitter_format % m.groupdict()
Gives me:
{'username': 'dir01'}
twitter.com/dir01/
And finally, the slightly larger and more complete solution that I have been using myself can be found in the Pattern class here.
Why do you need the regex for that - just append the strings.
base_url = "twitter.com/"
twt_handle = "dir01"
twit_url = base_url + twt_handle

Categories