How to get the index lenth of an array in python - python

I have a python code like
for i in re.finditer('something(.+?)"', html):
I am now trying to find out how many times its going to loop before going to that loop..in other words the length of array i.
Could anyone give me an alternative but similar code with with I get length of the loop.

x = list(re.finditer('something(.+?)"', html))
if len(x)
....
for i in x:
....
findall is not an adequate replacement since it returns strings, not match objects.

You can't do that with re.finditer because it returns an iterator which doesn't know when it's finished until it does (since it finds the next match on each iteration) ..., you'll have to use re.findall.
matches = re.findall('something(.+?)"', html)
num_loops = len(matches)
or use #thg435's approach if you do in fact need the match objects.

finditer returns the results as it finds them. There is no way finditer can tell you how many times you will loop in advance.
You need to use something else. Either re.findall or possibly re.search to get the length

Related

If i am using re.finditer(), how can I limit the number of occurrences in python regex?

I am working on a python script that needs to go through a file and find certain paragraphs. I am able to successfully match the pattern using regex, however, the number of times that same paragraph occurs is more than 1. I simply need the first occurrence of the paragraph to be printed out.
Is there anything that I could add to my regular expression that would only return the first occurance.
This is my regex expression thus far... pattern = re.compile(//#|//\s#).+[\S\s] , then i did matches = pattern.finditer(file_name) , lastly i traversed through a for loop and printed print(i.group()). Note: the reason why i did finditer() instead of findall() is because i need it to be printed out as a string rather then a list.
Any guidance as to how I can tweak my current approach to only yield the first matched paragraph would be great!
You might simply use .search rather than .finditer, example
import re
text = 'A1B2C3'
pattern = re.compile(r'([0-9])')
found = pattern.search(text).group()
print(found) # 1
print(isinstance(found,str)) # True

Python regex findall() returns unwanted substrings (including the correct answer)

I have to write a python function which gets a line of code as an input and returns true if that line contains ternary operator (and counts them!), else false. I wrote a few version of regex which worked perfectly on this site https://regexr.com/, but for example on Google Colab neither of them worked.
def ternaryOp(line):
found_operator=re.findall(r'(((=|==|<|>|<=|>=|!=)[\s\t]*)?[\s\t]*.+[\s\t]*\?[\s\t]*((.+:.*)|(.*:.+)))',line)
if found_operator:
print(len(found_operator))
print(found_operator)
return True
else:
return False
ternaryOp('category=age<18?child:adult')
Expected result:
1
[('category=age<18?child:adult')]
True
Actual result:
6
[('category=age<18?child:adult', '', '', 'child:adult', 'child:adult', '')]
True
It's doing exactly what it's supposed and documented to do:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Your regex has 6 capture groups, therefore each match is a 6-tuple, with each element of the tuple being a capture group. Either work with that, or use non-capturing groups ((?:pattern)) for groups you don't specifically care about, or use re.finditer which yields match objects and thus much richer and more flexible results.
Incidentally you're working very inefficiently, if you just want to know that a pattern can be found in a string use re.match or re.search, the code you're posting here has no need for the capabilities of findall since you're just checking whether it found anything.
I think I have found the solution which works for me. Thank you all!
re.findall(r'(?:(?:=|==|<|>|<=|>=|!=)?[\s\t]*[\s\t]*[^?:]+[\s\t]*\?[\s\t]*(?:.*?:[^ ]*))')

Regex subsequence matching

I'm using python but code in any language will do as well for this question.
Suppose I have 2 strings.
sequence ='abcd'
string = 'axyzbdclkd'
In the above example sequence is a subsequence of string
How can I check if sequence is a subsequence of string using regex? Also check the examples here for difference in subsequence and subarray and what I mean by subsequence.
The only think I could think of is this but it's far from what I want.
import re
c = re.compile('abcd')
c.match('axyzbdclkd')
Just allow arbitrary strings in between:
c = re.compile('.*a.*b.*c.*d.*')
# .* any character, zero or more times
You can, for an arbitrary sequence construct a regex like:
import re
sequence = 'abcd'
rgx = re.compile('.*'.join(re.escape(x) for x in sequence))
which will - for 'abcd' result in a regex 'a.*b.*c.*d'. You can then use re.find(..):
the_string = 'axyzbdclkd'
if rgx.search(the_string):
# ... the sequence is a subsequence.
pass
By using re.escape(..) you know for sure that for instance '.' in the original sequence will be translated to '\.' and thus not match any character.
I don't think the solution is as simple as #schwobaseggl claims. Let me show you another sequence from your database: ab1b2cd. By using the abcd subsequence for pattern matching you can get 2 results: ab(1b2)cd and a(b1)b(2)cd. So for testing purposes the proposed ^.*a.*b.*c.*d.*$ is ok(ish), but for parsing the ^a(.*)b(.*)cd$ will always be greedy. To get the second result you'll need to make it lazy: ^a(.*?)b(.*)cd$. So if you need this for parsing, then you should know how many variables are expected and to optimize the regex pattern you need to parse a few example strings and put the gaps with capturing groups only to the positions you really need them. An advanced version of this would inject the pattern of the actual variable instead of .*, so for example ^ab(\d\w\d)cd$ or ^a(\w\d)b(\d)cd$ in the second case.

Python Regex returns me the value with parentheses

I'm trying to run this code:
picture = re.search("#4F9EFF;\"><img src=\"(.+?)\" width=\"120\" height=\"90\"", data)
and when i do print picture.groups(1)
it returns me the value but with parentheses, why?
Output:
('http://sample.com/img/file.jpg',)
The group is a tuple containing one element. You can access the string (which is the first match) as output[0]. The important part is the comma after the string.
BUT
DON'T PARSE HTML WITH REGEX
You should use a proper HTML parser. This will save you innumerable headaches in the future, when your regex fails to match or gets too much. Look into BeautifulSoup or lxml.
Notice the comma before the closing parenthesis? This is a tuple (albeit one with just one element in it).
As the documentation for MatchObject.groups() says:
groups([default])
Return a tuple containing all the subgroups of the match, from 1 up to
however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.
As noted by other posters, you want to use MatchObject.group() instead.
You should be using
picture.group(1)
not groups() in plural if you're only looking for one specific group. groups() always returns a tuple, group() is the one you're looking for.
groups() returns a tuple of all the groups. You want pictures.group(1) which returns the string that matched group 1.
As the groups help says is returns "a tuple containing all the subgroups of the match".
If you want a single group use the group method.

How to get a list of character positions in Python?

I'm trying to write a function to sanitize unicode input in a web application, and I'm currently trying to reproduce the PHP function at the end of this page : http://www.iamcal.com/understanding-bidirectional-text/
I'm looking for an equivalent of PHP's preg_match_all in python. RE function findall returns matches without positions, and search only returns the first match. Is there any function that would return me every match, along with the associated position in the text ?
With a string abcdefa and the pattern a|c, I want to get something like [('a',0),('c',2),('a',6)]
Thanks :)
Try:
text = 'abcdefa'
pattern = re.compile('a|c')
[(m.group(), m.start()) for m in pattern.finditer(text)]
I don't know of a way to get re.findall to do this for you, but the following should work:
Use re.findall to find all the matching strings.
Use str.index to find the associate index of all strings returned by re.findall. However, be careful when you do this: if a string has two exact substrings in distinct locations, then re.findall will return both, but you'll need to tell str.index that you're looking for the second occurrence or the nth occurrence of a string. Otherwise, it will return an index that you already have. The best way I can think of to do this would be to maintain a dictionary that has the strings from the result of re.findall as keys and a list of indices as values
Hope this helps

Categories