I was wondering if any of the following exist in python:
A: non-regex equivalent of "re.findall()".
B: a way of neutralizing regex special characters in a variable before passing to findall().
I am passing a variable to re.findall which runs into problems when the variable has a period or a slash or a carat etc because I would like these characters to be interpreted literally. I realize it is not necessary to use regex to do this job, but I like the behavior of re.findall() because it returns a list of every match it finds. This allows me to easily count how many times the substring exists by using len().
Here's an example of my code:
>>substring_matches = re.findall(randomVariableOfCharacters, document_to_be_searched)
>>
>>#^^ will return something like ['you', 'you', 'you']
>>#but could also return something like ['end.', 'end.', 'ends']
>>#if my variable is 'end.' because "." is a wildcard.
>>#I would rather it return ['end.', 'end.']
>>
>>occurrences_of_substring = len(substring_matches)
I'm hoping to not have to use string.find(), if possible. Any help and/or advice is greatly appreciated!
You can use str.count() if you only want the number of occurances, but its not equivalent to re.findall() it only gets the count.
document_to_be_searched = "blabla bla bla."
numOfOcur = document_to_be_searched.count("bl")
Sure: looking at your code, I think that you're looking for is string.count.
>>> 'abcdabc'.count('abc')
2
Note that however, this is not an equivalent to re.findall; although it looks more appropriate in your case.
Related
I'm trying to search strings with variable using Regular expression operations.I browsed about it and find this useful code
s = "These are oranges and apples and pears, but not pinapples or .."
r = re.compile(r'\bAND\b | \bOR\b | \bNOT\b', flags=re.I | re.X)
r.findall(s)
['and', 'and', 'not', 'or'] #result
In this code they using exact string value 'AND''OR''NOT'.What should i do if i have something like this,
a = 'AND'
b = 'OR'
(I'm getting these string values by running a loop)In this code they are using '| (or)' and re.findall(), what should i do if i need to search both a and b. and use re.search()
Note: I think i need to use r'\bfoo\b' because sometimes my matches will be in this way...'foo.', '(foo)''cod.foo' because of this i can't use condition likeif a in s: (or) if a and b in s:. Please give some suggestions to work on this, Thank you.
If you want to search both variables, you can call re.search twiceSomething like this...
if((re.search(rf"\b(?=\w){(a)}\b(?!\w)", s, re.IGNORECASE)) and (re.search(rf"\b(?=\w){(b)}\b(?!\w)", s, re.IGNORECASE))):
Hope it helps
I might not know what you are going to do, but if your intention is to use variables inside regex, then remember that a regex, before it is send to re.compile, is just a simple text. So you can do with it everything you can do with texts like:
re.compile(f"\\b({a}|{b})\\b")
or in older python:
re.compile("\\b(" + a + "|" + b + ")\\b")
You are not restricted to use r"text" to define regex patterns.
I think you need something like this, and add flag "case insensitive" too:
pattern = '\b(and|or|not)\b'
For e.g., if the string is really2, I want to replace it as really-really.
I want the duplicated words hyphenated as well.
Can I avoid using RegEx?
Thanks so much!
Though you can do it without regex, it would be much easier to do it with regex. The code will be more readable and can be easily modifiable, if needed
import re
s = "if the string is really2, I want to replace it"
re.sub(r'(\w+)2', r'\1-\1', s)
# 'if the string is really-really, I want to replace it'
Don't need regex for this:
>>> a = 'blah foo2 bar'
>>> ' '.join((i[:-1]+'-'+i[:-1]) if i.endswith('2') else i for i in a.split())
'blah foo-foo bar'
>>>
It might get more complicated if you decide to include all other numbers and repeat the word several times. But since you only asked about duplicating -- this one works well enough.
Programming in Python3.
I am having difficulty in controlling whether a string meets a specific format.
So, I know that Python does not have a .contain() method like Java but that we can use regex.
My code hence will probably look something like this, where lowpan_headers is a dictionary with a field that is a string that should meet a specific format.
So the code will probably be like this:
import re
lowpan_headers = self.converter.lowpan_string_to_headers(lowpan_string)
pattern = re.compile("^([A-Z][0-9]+)+$")
pattern.match(lowpan_headers[dest_addrS])
However, my issue is in the format and I have not been able to get it right.
The format should be like bbbb00000000000000170d0000306fb6, where the first 4 characters should be bbbb and all the rest, with that exact length, should be hexadecimal values (so from 0-9 and a-f).
So two questions:
(1) any easier way of doing this except through importing re
(2) If not, can you help me out with the regex?
As for the regex you're looking for I believe that
^bbbb[0-9a-f]{28}$
should validate correctly for your requirements.
As for if there is an easier way than using the re module, I would say that there isn't really to achieve the result you're looking for. While using the in keyword in python works in the way you would expect a contains method to work for a string, you are actually wanting to know if a string is in a correct format. As such the best solution, as it is relatively simple, is to use a regular expression, and thus use the re module.
Here is a solution that does not use regex:
lowpan_headers = 'bbbb00000000000000170d0000306fb6'
if lowpan_headers[:4] == 'bbbb' and len(lowpan_headers) == 32:
try:
int(lowpan_headers[4:], 16) # tries interpreting the last 28 characters as hexadecimal
print('Input is valid!')
except ValueError:
print('Invalid Input') # hex test failed!
else:
print('Invalid Input') # either length test or 'bbbb' prefix test failed!
In fact, Python does have an equivalent to the .contains() method. You can use the in operator:
if 'substring' in long_string:
return True
A similar question has already been answered here.
For your case, however, I'd still stick with regex as you're indeed trying to evaluate a certain String format. To ensure that your string only has hexadecimal values, i.e. 0-9 and a-f, the following regex should do it: ^[a-fA-F0-9]+$. The additional "complication" are the four 'b' at the start of your string. I think an easy fix would be to include them as follows: ^(bbbb)?[a-fA-F0-9]+$.
>>> import re
>>> pattern = re.compile('^(bbbb)?[a-fA-F0-9]+$')
>>> test_1 = 'bbbb00000000000000170d0000306fb6'
>>> test_2 = 'bbbb00000000000000170d0000306fx6'
>>> pattern.match(test_1)
<_sre.SRE_Match object; span=(0, 32), match='bbbb00000000000000170d0000306fb6'>
>>> pattern.match(test_2)
>>>
The part that is currently missing is checking for the exact length of the string for which you could either use the string length method or extend the regex -- but I'm sure you can take it from here :-)
As I mentioned in the comment Python does have contains() equivalent.
if "blah" not in somestring:
continue
(source) (PythonDocs)
If you would prefer to use a regex instead to validate your input, you can use this:
^b{4}[0-9a-f]{28}$ - Regex101 Demo with explanation
I have a long string like this:
'[("He tended to be helpful, enthusiastic, and encouraging, even to studentsthat didn\'t have very much innate talent.\\n",), (\'Great instructor\\n\',), (\'He could always say something nice and was always helpful.\\n\',), (\'He knew what he was doing.\\n\',), (\'Likes art\\n\',), (\'He enjoys the classwork.\\n\',), (\'Good discussion of ideas\\n\',), (\'Open-minded\\n\',), (\'We learned stuff without having to take notes, we just applied it to what we were doing; made it an interesting and fun class.\\n\',), (\'Very kind, gave good insight on assignments\\n\',), (\' Really pushed me in what I can do; expanded how I thought about art, the materials used, and how it was visually.\\n\',)
and I want to remove all [, (, ", \, \n from this string at once. Somehow I can do it one by one, but always failed with '\n'. Is there any efficient way I can remove or translate all these characters or blank lines symbols?
Since my senectiecs are not long so I do not want to use dictionary methods like earlier questions.
Maybe you could use regex to find all the characters that you want to replace
s = s.strip()
r = re.compile("\[|\(|\)|\]|\\|\"|'|,")
s = re.sub(r, '', s)
print s.replace("\\n", "")
I have some problems with the "\n" but replacing after the regex is easy to remove too.
If string is correct python expression then you can use literal_eval from ast module to transform string to tuples and after that you can process every tuple.
from ast import literal_eval
' '.join(el[0].strip() for el in literal_eval(your_string))
If not then you can use this:
def get_part_string(your_string):
for part in re.findall(r'\((.+?)\)', your_string):
yield re.sub(r'[\"\'\\\\n]', '', part).strip(', ')
''.join(get_part_string(your_string))
I am trying to make Python's str.partition function ignore case during the search, so
>>>partition_tuple = 'Hello moon'.partition('hello')
('', 'Hello', ' moon')
and
>>>partition_tuple = 'hello moon'.partition('hello')
('', 'hello', ' moon')
return as shown above.
Should I be using regular expressions instead?
Thanks,
EDIT:
Pardons, I should have been more specific. I want to find a keyword in a string, change it (by adding stuff around it) then put it back in. My plan to do this was make partitions and then change the middle section then put it all back together.
Example:
'this is a contrived example'
with keyword 'contrived' would become:
'this is a <<contrived>> example'
and I need it to perform the <<>> even if 'contrived' was spelled with a capital 'C.'
Note that any letter in the word could be capitalized, not just the starting one.
The case needs to be preserved.
Another unique point to this problem is that there can be several keywords. In fact, there can even be a key phrase. That is to say, in the above example, the keywords could have been 'a contrived' and 'contrived' in which case the output would need to look like:
'this is <<a contrived>> example.'
How about
re.split('[Hh]ello', 'Hello moon')
This gives
['', ' moon']
Now you have the pieces and you can put them back together as you please.
And it preserves the case.
[Edit]
You can put multiple keywords in one regex (but read caution below)
re.split(r'[Hh]ello | moon', 'Hello moon')
Caution: re will use the FIRST one that matches and then ignore the rest.
So, putting in multiple keywords is only useful if there is a SINGLE keyword in each target.
How about
'Hello moon'.lower().partition('hello')
What is the actual problem you are trying using partition()?
No, partition() is case-sensitive and there is no way around it except by normalizing the primary string.
You can do this if you don't need to preserve the case:
>>> partition_tuple = 'Hello moon'.lower().partition('hello')
>>> partition_tuple
('', 'hello', ' moon')
>>>
However as you can see, this makes the resulting tuple lowercase as well. You cannot make partition case insensitive.
Perhaps more info on the task would help us give a better answer.
For example, is Bastien's answer sufficient, or does case need to be preserved?
If the string has the embedded space you could just use
the str.split(sep)
function.
But I am guessing you have a more complex task in mind.
Please describe it more.
You could also do this by writing your own case_insensitive_partition which could look something like this (barely tested but it did work at least in trivial cases):
def case_partition(text, sep):
ltext = text.lower()
lsep = sep.lower()
ind = ltext.find(lsep)
seplen = len(lsep)
return (text[:ind], text[ind:ind+seplen], text[ind+seplen:])