python regex find lowercase followed by capital letter - python

I feel like I have to apologize in advance for this one, but I've searched for answers and they seem to tell me what I'm doing is correct.
I'm trying to set a DataFrame column to True if another column has instances of a lowercase letter immediately followed by an uppercase letter.
What I tried was this:
cities['multiteam'] = cities['team'].apply(lambda x: pd.notna(re.search(r'[A][a]',x)))
That's setting all the results to False, so I figured maybe I was doing something wrong with my lambda function and I made the following to debug just the re.search() part:
cities['multiteam'] = pd.notna(re.search(r'[a][A]','OneTwo'))
That's also setting all the results to False. And there I'm stuck.

The following code is useful only to look for a letter 'A' followed by the lower case 'a'.
cities['multiteam'] = cities['team'].apply(lambda t: pd.notna(re.search(r'[A][a]',t)))
You may need to change it if you want to check it for all letters. Maybe replace that line with something like this:
cities['multiteam'] = cities['team'].apply(lambda t: pd.notna(re.search(r'[A-Z][a-z]',t)))

You should never to apologise about asking questions. Using apply is quite slow, try and use the str.contains which can accept a regex pattern.
cities.assign(multiteam=cities.team.str.contains('[a-z][A-Z]'))
The assign above is pandas new recommend way of assigning columns.
The str.contains works with regex and fixed strings, much faster than apply.
The regex pattern above says a range of a-z followed by A-Z.

Related

Extract values in name=value lines with regex

I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']
Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:
Try removing the ? quantifier. It will make your capture group match an empty st
regex101

Find and replace semi-common strings in dataframe?

I am attempting to find a semi-common occurring string and remove all other data in the column. Pandas and Re have been imported. For instance, I have dataframe...
>>>df
COLUMN COUNT DATA
1 this row RA-123: data 8b43a
2 here RA-5372: data 94h63c
I need to keep just the RA-'number that follows' and remove everything before and after. The numbers that follow are not always the same length and the 'RA-' string does not always occur in the same position. There is a colon after every instance that can be used as a delimiter.
I tried this (a friend wrote the regex search piece for me because I am not familiar with it).
df.assign(DATA= df['DATA'].str.extract(re.search('RA[^:]+')))
But python returned
TypeError: search() missing 1 required positional argument: 'string'
What am I missing here? Thanks in advance!
You should use acapturing group with extract:
df['DATA'].str.extract(r'(RA-\d+)')
Here, (RA-\d+) is a capturing group matching RA, then a hyphen and then one or more digits.
You may use your own pattern, but you still need to wrap it with capturing parentheses, r'(RA[^:]+)'.
Looking at the docs, you don't need the re.search method. You just call df[DATA] = df['DATA'].str.extract(r'RA[^:]+'))
As I mentioned earlier, no need for re here.
Other answers addressed well how to use extract directly. However, to answer your specificly, if you really want to use re, the way to go is to use re.compile instead of re.search.
df.assign(DATA= df['DATA'].str.extract(re.compile(regex_str)))

Beginner with regular expressions; need help writing a specific query - space, followed by 1-3 numbers, followed by any number of letters

I'm working with some poorly formatted HTML and I need to find every instance of a certain type of pattern. The issue is as follows:
a space, followed by a 1 to 3 digit number, followed by letters (a word, usually). Here are some examples of what I mean.
hello 7Out
how 99In
are 123May
So I would be looking for the expression to get the "7Out", "99In", "123May", etc. The initial space does not need to be included. I hope this is descriptive enough, as I am literally just starting to expose myself to regular expressions and am still struggling a bit. In the end, I will want to count the total number of these instances and add the total count to a df that already exists, so if you have any suggestions on how to do that I would be open to that as well. Thanks for your help in advance!
Your regular expression will be: r'\w\s(\d{1,3}[a-zA-Z]+)'
So in order to get count you can use len() upon list returned by findall. The code will be
import re
string='hello 70qwqeqwfwe123 12wfgtr123 34wfegr123 dqwfrgb'
result=re.findall(r'\w\s(\d{1,3}[a-zA-Z]+)',string)
print "result = ",result #this will give you all the found occurances as list
print "len(result) = ",len(result) #this will give you total no of occurances.
The result will be:
result = ['70qwqeqwfwe', '12wfgtr', '34wfegr']
len(result) = 3
Hint: findall will evaluate regular expression and returns results based on grouping. I'm using that to solve this problem.
Try these:
re.findall(r'(\w\s((\d{1,3})[a-zA-Z]+))',string)
re.findall(r'\w\s((\d{1,3})[a-zA-Z]+)',string)
To get an idea about regular expressions refer python re, tutorials point and to play with the matching characters use this.

regex does not match only upper case letters, despite being instructed to do so

I'm making a script to crawl through a web page and find all upper case names, equalling a number (ex. DUP_NB_FUNC=8). The part where my regular expression has to match only upper case letters however, does not seem to be working properly.
value = re.findall(r"[A-Z0-9_]*(?==\d).{2,}", input)
|tc_apb_conf_00.v:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=4
|:-:DUP_NB_FUNC=5
|tc_apb_conf_01.v:-:DUP_NB_FUNC=8
Desired output should look something like the above. However, I am getting:
|tc_apb_conf_00.v:-:=1" name="viewport"/>
|:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=4
|:-:DUP_NB_FUNC=5
|tc_apb_conf_01.v:-:DUP_NB_FUNC=8
Based on the input I can see its finding a match starting at =1. I don't however understand why as I've put only A-Z in the regex range. I'd really appreciate a bit of assistance and clearing up.
This should be help:
[A-Z0-9_]+(?==\d).{2,}
or
\b[A-Z0-9_]*(?==\d).{2,}\b
But anyway your regex quite weird, according to your requirement above I suggest this
[A-Z0-9_]+=\d+
Instead of using
(?==\d).{2,}: any letters two or more and make sure that the first two letter are = and a one integer respectively,
you can just use
=\d+
Try this.
value = re.findall(r"[A-Z0-9_]+(?==\d).{2,}", input)
You want the case sensitive match to match at least once, which means you want the + quantifier, not the * quantifier, that matches between zero and unlimited times.
I will suggest you define your pattern and check you input if it is available
for i in tlist:
value=re.compile(r"[A-Z0-9_:-.]+=\d+")
jee=value.match(i)
if jee is not None:
print i
tlist contains your input

Conditional Removal of suffix from words in a python list

The task that I have to perform is as follows :
Say I have a list of words (Just an example...the list can have any word):
'yappingly', 'yarding', 'yarly', 'yawnfully', 'yawnily', 'yawning','yawningly',
'yawweed', 'yealing', 'yeanling', 'yearling', 'yearly', 'yearnfully','yearning',
'yearnling', 'yeastily', 'yeasting', 'yed',
I have to create a new list of words from which words having the suffix ing are added after removing the suffix (i.e yeasting is added to the new list as yeast) and the remaining words are added as it is
Now as far as insertion of string ending with ing is concerned, i wrote the following code and it works fine
Data=[w[0:-3] for w in wordlist if re.search('ing$',w)]
But how to add the remaining words to the list?? How do I add an else clause to the above if statement? I was unable to find suitable documentation for the above. I did came across several questions on SO regarding the shorthand if else statement, but simply adding the else statement at the end of the above code doesn't work. How do I go about it??
Secondly, if I have to extend the above regular expression for multiple suffixes say as follows:
re.search('(ing|ed|al)$',w)
How do I perform the "trim" operation to remove the suffix accordingly and simultaneously add the word to the new list??
Please Help.
First, what makes you think you need a regexp at all? There are easier ways to strip suffixes.
Second, if you want to use regexps, why not just re.sub instead of trying to use regexps and slicing together? For example:
Data = [re.sub('(ing|ed|al)$', '', w) for w in wordlist]
Then you don't need to work out how much to slice off (which would require you to keep track of the result of re.search so you can get the length of the group, instead of just turning it into a bool).
But if you really want to do things your way, just replace your if filter with a conditional expression, as in iCodez's answer.
Finally, if you're stuck on how to fit something into a one-liner, just take it out of the one-liner. It should be easy to write a strip_suffixes function that returns the suffix-stripped string (which is the original string if there was no suffix). Then you can just write:
Data = [strip_suffixes(w) for w in wordlist]
Regarding your first question, you can use a ternary placed just before the for:
Data=[w[0:-3] if re.search('ing$',w) else w for w in wordlist]
Regarding your second, well, the best answer in my opinion is to use re.sub as #abarnert demonstrated. However, you could also make a slight adaption to your use of re.search:
Data=[re.search('(.*)(?:ing|ed|al)$', w).group(1) for w in wordlist]
Finally, here is a link for more information on comprehensions.

Categories