python Regex to find values between 2 strings - python

I have a string: a = '*1357*0123456789012345678901234567890123456789*2468*'
I want to find a value between 1357 and 2468 which is 0123456789012345678901234567890123456789.
I want to use regex or easier method to extract the value.
I tried re.findall(r'1357\.(.*?)2468', a), but I don't know where I'm doing wrong.

You have a couple of problems here:
You're escaping the . after 1357, which means a literal ., which isn't what you meant to have
You aren't treating the * characters (which do need to be escaped, of course).
To make a long story short:
re.findall(r'1357\*(.*?)\*2468', a)

If you want a slightly more general or flexible method, you can use this:
re.findall(r'\*\d+\*(\d+)\*\d+\*',a)
Which gives you the same output:
['0123456789012345678901234567890123456789']
But the advantage is that it gives you the value between any set of numeric values that are surrounded by the *. For instance, this would work for your string, but also for the string a = *0101*0123456789012345678901234567890123456789*0*, etc...

Related

Python is there a way to count the number of string inserts into a string?

Say let's say there's a string in python like
'f"{var_1} is the same as {var_2}"'
or something like
"{} is the same as {}".format(var_1, var_2)
Is there a way to count the number of insertion strings that exist in the string?
I'm trying to create a function that counts the number of insertions in a string. This is because I have code for generating a middle name and it could generate 2 or 1 middle name and just to keep the code consistent I'd rather it count the number of insertions exists in the string.
you could use a regular expression:
import re
s = 'f"{var_1} is the same as {var_2}"'
len(list(re.finditer(r'{.+?}', s)))
output:
2
For simple cases you can just count the number of open braces
nsubst = "{var_1} is the same as {var_2}".count("{")
for complex cases this however doesn't work and there is no easy solution as you need to do a full parser of format syntax or handle quite a few special cases (the problem are for example escaped braces or nested field substitution in field format specs). Moreover for f-strings you're allowed quite a big subset of valid python expressions as fields, including literal nested dictionaries and things are even more complex.

Python- find substring and then replace all characters within it

Let's say I have this string :
<div>Object</div><img src=#/><p> In order to be successful...</p>
I want to substitute every letter between < and > with a #.
So, after some operation, I want my string to look like:
<###>Object<####><##########><#> In order to be successful...<##>
Notice that every character between the two symbols were replaced with # ( including whitespace).
This is the closest I could get:
r = re.sub('<.*?>', '<#>', string)
The problem with my code is that all characters between < and > are replaced by a single #, whereas I would like every individual character to be replaced by a #.
I tried a mixture of various back references, but to no avail. Could someone point me in the right direction?
What about...:
def hashes(mo):
replacing = mo.group(1)
return '<{}>'.format('#' * len(replacing))
and then
r = re.sub(r'<(.*?)>', hashes, string)
The ability to use a function as the second argument to re.sub gives you huge flexibility in building up your substitutions (and, as usual, a named def results in much more readable code than any cramped lambda -- you can use meaningful names, normal layouts, etc, etc).
The re.sub function can be called with a function as the replacement, rather than a new string. Each time the pattern is matched, the function will be called with a match object, just like you'd get using re.search or re.finditer.
So try this:
re.sub(r'<(.*?)>', lambda m: "<{}>".format("#" * len(m.group(1))), string)

Pyparsing delimited list only returns first element

Here is my code :
l = "1.3E-2 2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser,delim='\t ')
print(grammar.parseString(l))
It returns :
['1.3E-2']
Obiously, I want all both values, not a single one, any idea what is going on ?
As #dawg explains, delimitedList is intended for cases where you have an expression with separating non-whitespace delimiters, typically commas. Pyparsing implicitly skips over whitespace, so in the pyparsing world, what you are really seeing is not a delimitedList, but OneOrMore(realnumber). Also, parseString internally calls str.expandtabs on the provided input string, unless you use the parseWithTabs=True argument. Expanding tabs to spaces helps preserve columnar alignment of data when it is in tabular form, and when I originally wrote pyparsing, this was a prevalent use case.
If you have control over this data, then you might want to use a different delimiter than <TAB>, perhaps commas or semicolons. If you are stuck with this format, but determined to use pyparsing, then use OneOrMore.
As you move forward, you will also want to be more precise about the expressions you define and the variable names that you use. The name "parser" is not very informative, and the pattern of Word(alphanums+'+-.') will match a lot of things besides valid real values in scientific notation. I understand if you are just trying to get anything working, this is a reasonable first cut, and you can come back and tune it once you get something going. If in fact you are going to be parsing real numbers, here is an expression that might be useful:
realnum = Regex(r'[+-]?\d+\.\d*([eE][+-]?\d+)?').setParseAction(lambda t: float(t[0]))
Then you can define your grammar as "OneOrMore(realnum)", which is also a lot more self-explanatory. And the parse action will convert your strings to floats at parse time, which will save you step later when actually working with the parsed values.
Good luck!
Works if you switch to raw strings:
l = r"1.3E-2\t2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser, delim=r'\t')
print(grammar.parseString(l))
Prints:
['1.3E-2', '2.5E+1']
In general, delimitedList works with something like PDPDP where P is the parse target and D is the delimter or delimiting sequence.
You have delim='\t '. That specifically is a delimiter of 1 tab followed by 1 space; it is not either tab or space.

Regex named conditional lookahead (in Python)

I'm hoping to match the beginning of a string differently based on whether a certain block of characters is present later in the string. A very simplified version of this is:
re.search("""^(?(pie)a|b)c.*(?P<pie>asda)$""", 'acaaasda')
Where, if <pie> is matched, I want to see a at the beginning of the string, and if it isn't then I'd rather see b.
I'd use normal numerical lookahead but there's no guarantee how many groups will or won't be matched between these two.
I'm currently getting error: unknown group name. The sinking feeling in my gut tells me that this is because what I want is impossible (look-ahead to named groups isn't exactly a feature of a regular language parser), but I really really really want this to work -- the alternative is scrapping 4 or 5 hours' worth of regex writing and redoing it all tomorrow as a recursive descent parser or something.
Thanks in advance for any help.
Unfortunately, I don't think there is a way to do what you want to do with named groups. If you don't mind duplication too much, you could duplicate the shared conditions and OR the expressions together:
^(ac.*asda|bc.*)$
If it is a complicated expression you could always use string formatting to share it (rather than copy-pasting the shared part):
common_regex = "c.*"
final_regex = "^(a{common}asda|b{common})$".format(common=common_regex)
You can use something like that:
^(?:a(?=c.*(?P<pie>asda)$)|b)c.*$
or without .*$ if you don't need it.

Python efficient mass replacing unknown characterers

PHP4+mySQL4 based project post to Django 1.1 project and it mixes up some letters.
What is the best way (most efficient) to replace in this fashion?
The problem for me is that i cannot get values for those letters. Is there an online tool to do that?
I have textField with various letters and i want to replace those in this fashion:
àèæëáðøûþ => ąčęėįšųūž
ÀÈÆËÁÐØÛÞ => ĄČĘĖĮŠŲŪŽ
I had similar case where i had to clean up the code so i used this:
def clean(string):
return ''.join([c for c in string if ord(c) > 31 or ord(c) in [9, 10, 13]] )
Update: i succeeded to extract Unicode values looking at Django debug messages (replace_from:replace_to):
{'\xe0':'\u0105', '\xe8':'\u010d', '\xe6':'\u0119', '\xeb':'\u0117', '\xe1':'\u012f',
'\xf0':'\u0161', '\xf8':'\u0179', '\xfb':'\u016b', '\xfe':'\u017e',
'\xc0':'\u0104', '\xc8':'\u010c', '\xc6':'\u0118', '\xcb':'\u0116', '\xc1':'\u012e',
'\xd0':'\u0160', '\xd8':'\u0172', '\xdb':'\u016a', '\xde':'\u017d'
So the main problem remains - replacing
Try the str.replace() method - should work with unicode strings.
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
Make sure your old and new strings are of type Unicode
(that applies to your input data as well).
Find out what your input (non-unicode) string is supposed to be encoded in.
For example, it may be in latin1 encoding.
Use the builtin str.decode() method to create a Unicode version of your data,
and feed that to str.replace().
>>> unioldchars = oldchars.decode("latin1")
>>> newdata = data.replace(unioldchars, newchars)
I'd do it myself. The built-in replace function is of little use if you want multiple, efficient replacements.
Give this a look: http://code.activestate.com/recipes/81330-single-pass-multiple-replace/
EDIT: WAIT, you wanted to do the replacement client-side, like in the text-box?
string.translate(s, table[, deletechars])
Delete all characters from s that are in deletechars (if
present), and then translate the characters using table, which must be
a 256-character string giving the translation for each character value,
indexed by its ordinal. If table is None, then only the character deletion
step is performed.
See also http://docs.python.org/library/string.html#string.maketrans

Categories