How to extract with excluding some characters by python regex - python

I have been using python regex to extract address patterns.
For example, i have a list of add as below:
12buixuongtrach
34btrannhatduat
25bachmai
78bhoangquocviet
i want to refine the addresses like these:
12 buixuongtrach
34b trannhatduat
23 bachmai
78b hoangquocviet
Anyone please help some hint code?
Many thanks

You can use a pretty simple regex to split the numbers off from the letters, but like people have said in the comments, there's no way to know when those b's should be part of the number and when they're part of the text.
import re
text = """12buixuongtrach
34btrannhatduat
25bachmai
78bhoangquocviet"""
unmatched = text.split()
matched = [re.sub('(\d+)(.*)', '\\1 \\2', s) for s in unmatched]
Which gives:
>>> matched
['12 buixuongtrach', '34 btrannhatduat', '25 bachmai', '78 bhoangquocviet']
The regex is just grabbing one or more digits at the start of the string and putting them into group \1, then putting the rest of the string into group \2.

Thanks all for your response. i finally found a work around.
I used the pattern as below and it works like a charm :)
'[a-zA-Z]+|[\/0-9abcd]+(?!a|u|c|h|o|e)'

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Regex - everything up to a group of possibilities, or everything

I'm trying to make a regex which can get everything until it finds some specific words, but if these words are not present in then text, the just grab everything, in this example let's consider our group of words: ['ABC', 'HIJ', 'TUV']
I have no ideia ABC about who i am
I have no ideia
I may have an idea about who you HIJ think you are
I may have an idea about who you
Sometimes i just wish you are not here
Sometimes i just wish you are not here
It finds everything until one of the words i defined, but if this word is not present like in the last string, then it gets everything.
My attempt:
(.*)(?:ABC|HIJ|TUV|$)
But it always get the entire string even when it has some of the words in the group.
P.S: I'm applying this in python
With your shown samples, could you please try following. Using findall function of Python.
import re
lst = ['ABC', 'HIJ', 'TUV']
var=""" have no ideia ABC about who i am
I have no ideia
I may have an idea about who you HIJ think you are
I may have an idea about who you
Sometimes i just wish you are not here
Sometimes i just wish you are not here"""
regex = r'(.*?)(?:' + '|'.join(lst) + r'|$)'
re.findall(regex,var)
[' have no ideia ', 'I may have an idea about who you ', 'Sometimes i just wish you are not here', '']
Explanation: Simple explanation would be, using Python's re library. Then creating variable var which has value in it. Then Creating regex variable with join function of Python to create regex in it. Then applying findall function with ready regex on var to get all occurrences before words present in list.
Explanation of regex'(.*?)(?:ABC|HIJ|TUV|$)': Using non greedy capability to match till all elements present in list named lst in a non-capturing group.

Remove duplicated puntaction in a string

I'm working on a cleaning some text as the one bellow:
Great talking with you. ? See you, the other guys and Mr. Jack Daniels next week, I hope-- ? Bobette ? ? Bobette Riner??????????????????????????????? Senior Power Markets Analyst?????? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com ? ? - cinhrly020101.doc
It has multiple spaces and question marks, to clean it I'm using regular expressions:
def remove_duplicate_characters(text):
text = re.sub("\s+"," ",text)
text = re.sub("\s*\?+","?",text)
text = re.sub("\s*\?+","?",text)
return text
remove_duplicate_characters(msg)
remove_duplicate_characters(msg)
Which gives me the following result:
'Great talking with you.? See you, the other guys and Mr. Jack Daniels next week, I hope--? Bobette? Bobette Riner? Senior Power Markets Analyst? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com? - cinhrly020101.doc'
For this particular case, it does work, but does not looks like the best approach if I want to add more charaters to remove. Is there an optimal way to solve this?
To replace all consecutive punctuation chars with their single occurrence you can use
re.sub(r"([^\w\s]|_)\1+", r"\1", text)
If the leading whitespace must be removed, use the r"\s*([^\w\s]|_)\1+" regex.
See the regex demo online.
In case you want to introduce exceptions to this generic regex, you may add an alternative on the left where you'd capture all the contexts where you wat the consecutive punctuation to be kept:
re.sub(r'((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+', r'\1\2', text)
See this regex demo.
The ((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+ regex matches and captures a ... (not encosed with other dots on both ends) and a :// string (commonly seen in URLS), and the rest is the original regex with the adjusted backreference (since now, there are two capturing groups).
The \1\2 in the replacement pattern put back the captured vaues into the resulting string.

RegEx for re-occurring phrase

I have the following phrase:
05/30/2016 07:02 AM (GMT+02:00) added by XXX YYY (PID-000301):\tSome_alphanum_text_Some_alphanum_text_Some_alphanum_text_Some_alphanum_text\t\t*************************************************************************************************\t05/12/2016 02:03 PM (GMT+02:00) added by ZZZ AAA (PID-000301):\tSome_other_alphanum_text_Some_other_alphanum_text_Some_other_alphanum_text_Some_other_alphanum_text\t\t
I would like to write a RegEx which is just going to scoop up for me only 'Some_alphanum_text' and 'Some_other_alphanum_text'.
So far I was trying my luck with something like this:
r'(?:.+\(PID-\d{6}\):)(.+)'
But it is only giving me the 'Some_other_alphanum_text' occurrence.
There can be more than 2 unique strings I will need to scoop out from this mess of a text. Any ideas?
You need to replace .+ with something that only matches what you want to return. Since you only want to match alphanumeric text, use \w instead of .
r'(?:\(PID-\d{6}\):)\s*(\w+)'
You need \s* before the second group because the whitespace before the alphanumeric text won't match \w+.
You also don't need .+ at the beginning. The match will just begin where it finds PID.
DEMO
I believe you need this regex:
\(PID-\d{6}\):\\t(.+?)(?:\\t){2}
regex101
I think you could use this to find all the instances of text occurring between "\t"s
I didn't change the regex area to be a code block so it has not worked.
Now it works! One thing you should consider is that there could be no '\t'. But
every matched text follows a date format such as 05/12/2016 02:03 or ends.
\(PID-\d{6}\)[\n\r\t\s]*:(?:.|[\n\r\t\s])*?(?=[0-9]{2}\/[0-9]{2}\/[0-9]{4}[\n\r\t\s]*[0-9]{2}:[0-9]{2}|$)

most efficient way to go about identifying sub-strings in a string in python?

i need to search a fairly lengthy string for CPV (common procurement vocab) codes.
at the moment i'm doing this with a simple for loop and str.find()
the problem is, if the CPV code has been listed in a slightly different format, this algorithm won't find it.
what's the most efficient way of searching for all the different iterations of the code within the string? Is it simply a case of reformatting each of the up to 10,000 CPV codes and using str.find() for each instance?
An example of different formatting could be as follows
30124120-1
301241201
30124120 - 1
30124120 1
30124120.1
etc.
Thanks :)
Try a regular expression:
>>> cpv = re.compile(r'([0-9]+[-\. ]?[0-9])')
>>> print cpv.findall('foo 30124120-1 bar 21966823.1 baz')
['30124120-1', '21966823.1']
(Modify until it matches the CPVs in your data closely.)
Try using any of the functions in re (regular expressions for Python). See the docs for more info.
You can craft a regular expression to accept a number of different formats for these codes, and then use re.findall or something similar to extract the information. I'm not certain what a CPV is so I don't have a regular expression for it (though maybe you could see if Google has any?)
cpv = re.compile(r'(\d{8})(?:[ -.\t/\\]*)(\d{1}\b)')
for m in re.finditer(cpv, ex):
cpval,chk = m.groups()
print("{0}-{1}".format(cpval,chk))
applied to your sample data returns
30124120-1
30124120-1
30124120-1
30124120-1
30124120-1
The regular expression can be read as
(\d{8}) # eight digits
(?: # followed by a sequence which does not get returned
[ -.\t/\\]* # consisting of 0 or more
) # spaces, hyphens, periods, tabs, forward- or backslashes
(\d{1}\b) # followed by one digit, ending at a word boundary
# (ie whitespace or the end of the string)
Hope that helps!

Categories