Regex - everything up to a group of possibilities, or everything - python

I'm trying to make a regex which can get everything until it finds some specific words, but if these words are not present in then text, the just grab everything, in this example let's consider our group of words: ['ABC', 'HIJ', 'TUV']
I have no ideia ABC about who i am
I have no ideia
I may have an idea about who you HIJ think you are
I may have an idea about who you
Sometimes i just wish you are not here
Sometimes i just wish you are not here
It finds everything until one of the words i defined, but if this word is not present like in the last string, then it gets everything.
My attempt:
(.*)(?:ABC|HIJ|TUV|$)
But it always get the entire string even when it has some of the words in the group.
P.S: I'm applying this in python

With your shown samples, could you please try following. Using findall function of Python.
import re
lst = ['ABC', 'HIJ', 'TUV']
var=""" have no ideia ABC about who i am
I have no ideia
I may have an idea about who you HIJ think you are
I may have an idea about who you
Sometimes i just wish you are not here
Sometimes i just wish you are not here"""
regex = r'(.*?)(?:' + '|'.join(lst) + r'|$)'
re.findall(regex,var)
[' have no ideia ', 'I may have an idea about who you ', 'Sometimes i just wish you are not here', '']
Explanation: Simple explanation would be, using Python's re library. Then creating variable var which has value in it. Then Creating regex variable with join function of Python to create regex in it. Then applying findall function with ready regex on var to get all occurrences before words present in list.
Explanation of regex'(.*?)(?:ABC|HIJ|TUV|$)': Using non greedy capability to match till all elements present in list named lst in a non-capturing group.

Related

Use regex to remove a substring that matches a beginning of a substring through the following comma

I haven't found any helpful Regex tools to help me figure this complicated pattern out.
I have the following string:
Myfirstname Mylastname, Department of Mydepartment, Mytitle, The University of Me; 4-1-1, Hong,Bunk, Tokyo 113-8655, Japan E-mail:my.email#example.jp, Tel:00-00-222-1171, Fax:00-00-225-3386
I am trying to learn enough Regex patterns to remove the substrings one at a time:
E-mail:my.email#example.jp
Tel:00-00-222-1171
Fax:00-00-225-3386
So I think the correct pattern would be to remove a given word (ie., "E-mail", "Tel") all the way through the following comma.
Is type of dynamic pattern possible in Regex?
I am performing the match in Python, however, I don't think that would matter too much.
Also, I know the data string looks comma separated, and it is. However there is no guarantee of preserving the order of those fields. That's why I'm trying to use a Regex match.
How about this regex:
<YOUR_WORD>.*?(?=(,|($)))
Explanation:
It looks for the word specified in <YOUR_WORD> placeholder
It looks for any kind of character afterwards
The search stops when it hits one of the two options:
It finds the character ,
It finds an end of the line
So:
E-mail.*?(?=(,|($)))
Will result in:
E-mail:my.email#example.jp
And
Fax.*?(?=(,|($)))
Will result in:
Fax:00-00-225-3386
If there are edge cases it misses - I would like to know, and whether it affects the performance/ is necessary.

Twitter data analysis

I have a question for my Thesis project.
In order to do a sentiment analysis, I would like to eliminate all hashtags, but with this Python code I remove only the "#". I would like to remove also the word associated to "#".
Thanks everyone
df['text']=df['text'].apply(lambda x:' '.join(re.findall(r'\w+', x)))
Assuming you want the rest of the words after the hashtag to remain intact, try this:
import re
df['text']=df['text'].apply(lambda x:(re.sub("#([\S]+)",'',x)))
It will remove any word(s) after the # until the next whitespace.
You can use the re.sub method. Something like that:
df["text"] = df["text"].apply (lambda x : re.sub (r"#.*\s", "", x))
In this way you replace everything that matches the pattern "#.*\s" (hashtag followed by any amount of characters followed by a space) with an empty string. You may need to tweak the regex a bit depending on your data.
Check the documentation about the re module here: https://docs.python.org/3/library/re.html

Matching data between padding

I'm trying to match some strings in a binary file and the strings appear to be padded. As an example, the word PROGRAM could be in the binary like this:
%$###P^&#!)00000R{]]]////O.......G"""""R;;$#!*%&#*A/////847M
In that example, the word PROGRAM is there but it is split up and it's between random data, so I'm trying to use regex to find it.
Currently, this is what I came up with but I don't think this is very effectie:
(?<=P)(.*?)(?=R)(.*?)(?=O)(.*?)(?=G)(.*?)(?=R)(.*?)(?=A)(.*?)(?=M)
If you want to get PROGRAM from the string, one option might be to use re.sub with a negated character class to remove all that you don't want.
[^A-Z]+
Regex demo | Python demo
For example:
import re
test_str = "%$###P^&#!)00000R{]]]////O.......G\"\"\"\"\"R;;$#!*%&#*A/////847M"
pattern = r'[^A-Z]+'
print(re.sub(pattern, '', test_str))
Result
PROGRAM
This should work for you and is more efficient than your current solution:
P[^R]+R[^O]+O[^G]+G[^R]+R[^A]+A[^M]+M
Explanation:
P[^R]+ - match P, than one or more characters other than R
Demo
I'm not quite sure what the desired output might be, I'm guessing maybe this expression,
(?=.*?P.*?R.*?O.*?G.*?R.*?A.*?M).*?(P).*?(R).*?(O).*?(G).*?(R).*?(A).*?(M)
might be a start.
The expression is explained on the top right panel of this demo, if you wish to explore further or simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

How can I find all substrings that have this pattern: some_word.some_other_word with python?

I am trying to clean up some very noisy user-generated web data. Some people do not add a space after a period that ends the sentence. For example,
"Place order.Call us if you have any questions."
I want to extract each sentence, but when I try to parse a sentence using nltk, it fails to recognize that these are two separate sentences. I would like to use regular expressions to find all patterns that contain "some_word.some_other_word" and all patterns that contain "some_word:some_other_word" using python.
At the same time I want to avoid finding patterns like "U.S.A". so avoid just_a_character.just_another_character
Thanks very much for your help :)
The easiest solution:
>>> import re
>>> re.sub(r'([.:])([^\s])', r'\1 \2', 'This is a test. Yes, test.Hello:world.')
'This is a test. Yes, test. Hello: world.'
The first argument — the pattern — tells that we want to match a period or a colon followed by a non-whitespace character. The second argument is the replacement, it puts the first matched symbol, then a space, then the second matched symbol back.
It seems that you are asking two different questions:
1) If you want to find all patterns like "some_word.some_other_word" or "some_word:some_other_word"
import re
re.findall('\w+[\.:\?\!]\w+', your_text)
This finds all patterns in the text your_text
2) If you want to extract all sentences, you could do
import re
re.split('[\.\!\?]', your_text)
This should return a list of sentences. For example,
text = 'Hey, this is a test. How are you?Fine, thanks.'
import re
re.findall('\w+[\.:\?\!]\w+', text) # returns ['you?Fine']
re.split('[\.\!\?]', text) # returns ['Hey, this is a test', ' How are you', 'Fine, thanks', '']
Here's some cases that might be in your text:
sample = """
Place order.Call us (period: split)
ever after.(The end) (period: split)
U.S.A.(abbreviation: don't split internally)
1.3 How to work with computers (dotted numeral: don't split)
ever after...The end (ellipsis: don't split internally)
(This is the end.) (period inside parens: don't split)
"""
So: Don't add space to periods after digits, after a single capital letter, or before a paren or another period. Add space otherwise. This will do all that:
sample = re.sub(r"(\w[A-Z]|[a-z.])\.([^.)\s])", r"\1. \2", sample)
Result:
Place order. Call us (period: split)
ever after. (The end) (period: split)
U.S.A.(abbreviation: don't split internally)
1.3 How to work with computers (dotted numeral: don't split)
ever after... The end (ellipsis: don't split internally)
(This is the end.) (period inside parens: don't split)
This fixed every problem in the sample except the last period after U.S.A., which should have a space added after it. I left that aside because combinations of conditions are tricky. The following regexp will handle everything, but I do not recommend it:
sample = re.sub(r"(\w[A-Z]|[a-z.]|\b[A-Z](?!\.[A-Z]))\.([^.)\s])", r"\1. \2", sample)
Complex regexps like this are a maintainability nightmare-- just try adding another pattern, or restricting it to omit some more cases. Instead, I recommend using a separate regexp to catch just the missing case: a period after a single capital letter, but not followed by a single capital, paren, or another period.
sample = re.sub(r"(\b[A-Z]\.)([^.)A-Z])", r"\1 \2", sample)
For a complex task like this, it makes sense to use a separate regexp for each type of replacement. I'd split the original into subcases, each of which adds spaces only for a very specific pattern. You can have as many as you want, and it won't get out of hand (at least, not too much...)
You could use something like
import re
test = "some_word.some_other_word"
r = re.compile(r'(\D+)\.(\D+)')
print r.match(test).groups()

How to extract with excluding some characters by python regex

I have been using python regex to extract address patterns.
For example, i have a list of add as below:
12buixuongtrach
34btrannhatduat
25bachmai
78bhoangquocviet
i want to refine the addresses like these:
12 buixuongtrach
34b trannhatduat
23 bachmai
78b hoangquocviet
Anyone please help some hint code?
Many thanks
You can use a pretty simple regex to split the numbers off from the letters, but like people have said in the comments, there's no way to know when those b's should be part of the number and when they're part of the text.
import re
text = """12buixuongtrach
34btrannhatduat
25bachmai
78bhoangquocviet"""
unmatched = text.split()
matched = [re.sub('(\d+)(.*)', '\\1 \\2', s) for s in unmatched]
Which gives:
>>> matched
['12 buixuongtrach', '34 btrannhatduat', '25 bachmai', '78 bhoangquocviet']
The regex is just grabbing one or more digits at the start of the string and putting them into group \1, then putting the rest of the string into group \2.
Thanks all for your response. i finally found a work around.
I used the pattern as below and it works like a charm :)
'[a-zA-Z]+|[\/0-9abcd]+(?!a|u|c|h|o|e)'

Categories