How to check two dots placed together and quotation marks in email? - python

I've got a task to solve:
Write a function on python, which checks an e-mail on compliance with these
rules:
e-mail consists of the name and domain parts, and the "#" mark is between them;
the domain part is between 3 and 256 symbols, is a set of non-empty strings, consisting of a-z 0-9_- symbols separated by dot;
each component of the domain part can't begin or end with "-" symbol;
the name part (before #) is no more than 128 symbols, consists of a-z0-9"._-;
in the name part, we can't write two dots going together "..";
if we have double quotes in the name part (") , they should have a pair ("blabla");
we also can write "!,:" symbols in the name part, but only between double quotes.
I wrote a small regular expression step-by-step up to 4th point:
((?!-)[A-Z0-9"\.\-_]{1,128}(?<!-)#(?!-)[A-Z0-9\-_.]{3,256}(?<!-))
but I stuck on 5th and 6th.
How to implement these conditions in my regexp? I tried to add the
|(?:\.(?!\.))
in the end, but it doesn't work.

Do not try to do this in regexp, this is an example of an email validator written in regex with Perl, to this day that monstrosity haunts my dreams.
Use a proper parser, you should try looking at the source of the validate_email library and make change to serve your purposes. This might also be a good source to use as base.

Related

Finding a substring in Python RegEx considering pairing of LaTeX brakets and ignoring eol

I need to find the contents of certain LaTeX macros in Python in LaTeX articles, taking into account the pairing of brakets, but without taking into account single eol characters. For example,
\title[Synthesis of knowledge: problems and methods\ldots]
{Synthesis of knowledge: problems and methods when building
models of intelligent systems}
\abstract{Problems of using {\it regular structures} of memory
$\Sigma =\{\Sigma(z) \vert z\in M\}$
and $\tilde{\Sigma} = \{\tilde{\Sigma}(z) \vert z\in R\}$
to model the processes of synthesis of knowledge structures
with the help of knowledge processing operations from special
classes of such operations are considered. These structures
define the formats of the memory subdomains used.
Area formats match domain element structures
(domains of definition and meaning of operations)...}
\keywords{knowledge structures, ...}
Briefly about the syntax. Macros in LaTeX start with a \ then, after the name of the macro, may or may not be followed by an optional parameter in a pair of square brackets [], and the required parameter in a pair of curly braces {}. Curly braces are also used to create so-called groups (something like a scope restriction). If you need to display the curly brace itself, write, for example, \{ in the text. LaTeX does not take into account multiple spaces (they turn into a single space) and single newlines (they also turn into a single space later). A new paragraph is indicated by two newlines. This is a simplified description of the LaTeX syntax, but in my case it is completely correct.
I need a function that returns the content of the required argument of word macro from a previously read file like this
def tag(file, word): # file - content of LaTeX fileŠ°, word - name of macro
regex = re.compile(r'\\' + word + '\[?.*?\]?\{(.*)\}', re.MULTILINE)
return regex.findall(file)[0]
This function works well only if the content of the desired command is one line (and which is logical), because when using re.DOTALL it can be seen that it is too greedy. But I need it to be neither too greedy nor too lazy and must take into account the pairing of curly braces, the nesting of which can be quite large, and also ignore single eol characters (just replace them with spaces), but at the same time two translation characters strings would be converted to a single line terminator.
That is, as a result, I need the tag(file, 'abstract') to return this
Problems of using {\it regular structures} of memory $\Sigma =\{\Sigma(z) \vert z\in M\}$ and $\tilde{\Sigma} = \{\tilde{\Sigma}(z) \vert z\in R\}$ to model the processes of synthesis of knowledge structures with the help of knowledge processing operations from special classes of such operations are considered. These structures define the formats of the memory subdomains used.
Area formats match domain element structures (domains of definition and meaning of operations)...
I would be grateful for advice on constructing a correct regular expression.

How to wrap all words in a document with vim?

Here is the file and on line 97 and below are the lines that I want to put in a python dictionary, idea is that words on left side of colon ':' will become keys and on right side of it will be values. All keys and values must be strings and for that I need to wrap all words (from line 97 and below) in quotation. So the question is How to wrap all words, in a document, in quotes?
My purpose of doing this is to obtain column names for prepossessing for machine learning. If you are interested you can find columns without names here.
Difficult to know exactly what you mean if you don't post the code (and no, I am not going to download and open a file called adult.names from a random person on the internet). However, if all you want is for every word to be wrapped in quotes, you can use a global substitution:
:%s/\w\+\ze[\s, \n, :]\+/"\0"/g
Explanation:
:s/regex/text will replace whatever is matched by regex with text on the current line.
Add a % at the beginning and it will do it for all lines.
If you only want to do this for a section of your document, make a visual selection and then run this command without the %.
\w matches a word character
\ze ends the match (so you can specify what comes after whatever you're matching)
The [\s, \n, :] means match spaces, newlines, and colons, and the \+ following that means match a non-zero number of those (i.e. at least one whitespace character or newline or colon).
All of that together means it's matching each word individually.
Then, for each of those matched words, it is replacing it with a quotation mark, then \0 which means the first thing that was substituted before, and another quotation mark.
The /g at the end means that it will do this substitution as many times as it finds the regex on each line. Without that it would only substitute the first match on each line.
The result should be that it wraps every word in quotes. But again, it's difficult to test and find the right solution without seeing what you're working with. In the future please put the relevant pieces of code in your post.

regex- capturing text between matches

In the following text, I try to match a number followed by ")" and number followed by a period. I am trying to retrieve the text between the matches.
Example:
"1) there is a dsfsdfsd and 2) there is another one and 3) yet another
case"
so I am trying to output: ["there is a dsfsdfsd and", "there is another one and", yet another case"]
I've used this regex: (?:\d)|\d.)
Adding a .* at the end matches the entire string, I only want it to match the words between
also in this string:
"we will give 4. there needs to be another option and 6.99 USD is a
bit amount"
I want to only match the 4. and not the 6.99
Any pointers will be appreciated. Thank you. r
tldr
Regular expressions are tricky beasts and you should avoid them if at all possible.
If you can't avoid them, then make sure you have lots of test cases for all the edge cases that can occur.
Build up your regular expression slowly and systematically, testing your assumptions at every step.
If this code will go intro production, then please write unit tests that explain the thinking process to the poor soul who has to maintain it one day
The long version
Regular expressions are finicky. Your best approach may be to solve the problem a different way.
For example, your language might have a library function that allows you to split up strings using a regular expression to define what comes between the numbers. That will let you get away with writing a simpler regex to match the numbers and brackets/dots.
If you still decide to use regular expressions, then you need to be very structured about how you build up your regular expressions. It's extremely easy to miss edge cases.
So let's break this down piece by piece...
Set up a test environment for quickly experimenting with your regex.
There are lots of options here, depending on your programming language and OS. Ones I sometimes use are:
a Powershell window for testing .Net regexes (NB: the cli gives you a history of past attempts, so you can go back a few steps if you mess things up too badly)
a Python console for testing Python regexes (which are slightly different to .Net regexes in their syntax for named capture groups).
an html page with JavaScript to test the regex
an online or desktop regex tool (I still use the ancient Regular Expression Workbench from Eric Gunnerson, but I'm sure there are better alternatives these days)
Since you didn't specify a language or regex version, I'll assume .Net regular expressions
Create a single test string for testing a wider variety of options.
Your goal is to include as many edge cases as you can think of. Here's what I would use: "ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10."
Note that I've added a few extra cases you didn't mention:
empty strings between two round bracket numbers: "4)" and "5)"
white space string between two round bracket numbers: "5)" and "6)"
empty strings between a round bracket number and a dotted number: "6)" and "10."
empty string after the dotted number "10." at the end of the string
random text and empty space, which should be ignored, before the first number
I'm going to make a few assumptions here, which you will need to vary based on your actual requirements:
You DO want to capture the white space after the dot or round bracket.
You DO want to capture the white space before the next dotted number or round bracket number.
You might have numbers that go beyond 9, so I've included "10" in the test cases.
You want to capture empty strings at the end e.g. after the "10."
NOTES:
Thinking through this test case forces you to be more rigorous about your requirements.
It will also help you be more efficient while you are manually testing your regular expression.
HOWEVER, this is assuming you aren't following a TDD approach. If you are, then you should probably do things a little differently... create unit tests for each scenario separately and get the regex working incrementally.
This test string doesn't cover all cases. For example, there are no newline or tab characters in the test string. Also it can't test for an empty string following a round bracket number at the very end.
First get a regex working that just captures the round brackets and dotted brackets.
Don't worry about the $6.99 edge case yet.
Drop the "(?:" non-capturing group syntax from your regex for now: "\d)|\d."
This doesn't even parse, because you have an unescaped round bracket.
The revised string is "\d\)|\d.", which parses, but which also matches "99" which you probably weren't expecting. That's because you forgot to escape the "."
The revised string is "\d\)|\d\.". This no longer matches "99", but it now matches "0." at the end instead of "10.". That's because it assumes that numbers will be single digit only.
The following string seems to work: "\d+\)|\d+\."
Time to deal with that pesky "$6.99" now...
Modify the regex so that it doesn't capture a floating point number.
You need to use a negative look ahead pattern to prevent a digit being after the decimal point.
Result: "\d+\)|\d+\.(?!\d)"
Count how many matches this produces. You're going to use this number for checking later results.
Hint: Save the regex pattern somewhere. You want to be able to go back to it any time you mess up your regex pattern beyond repair.
If you found a string splitting function, then you should use it now and avoid the complexity that follows. [I've included an example of this at the end.]
Simple is better, but I'm going to continue with the longer solution in the interests of showing an approach to staying in control of regex'es that start getting horribly complicated
Decide how to exclude that pattern
You used the non-capture group pattern in your question i.e. "(?:"
That approach can work. But it's a bit cumbersome, because you need to have a capturing group after it that you will look for instead.
It would be much nicer if your entire pattern matched what you are looking for.
So wrap the number pattern inside a zero-width positive look behind pattern (if your language supports it) i.e. "(?<=".
This checks for the pattern, but doesn't include it in what gets captured.
So now your regex looks like this: "(?<=\d+\)|\d+\.(?!\d))"
Test it!
It might seem silly to test this on its own - all the matches are empty strings.
Do it anyway. You want to sanity check every step of the way.
Make sure that it still produces the same number of matches as in step 4.
Decide how to match the text in between the numbers.
You rightly mention that ".*" will match the entire string, not just the parts in between.
There's a neat trick that allows you to reuse the pattern from step 5 to get the text in between.
Start by just matching the next character
The trick is that you want to match any character unless it's the start of the next number
That sounds like a negative look ahead pattern again: "(?!"
Let X be the pattern you saved in step 4. Matching a single character will look like this: "(?!X)."
You want to match lots of those characters. So put that pattern into a non-capturing group and repeat it: "(?:(?!X).)*"
This assumes you want to capture empty text.
If you're not, then change the "*" to a "+".
Hint: This is such a common pattern that you will want to reuse it in future pasting in different patterns in place of X
I used a non-capturing group instead of a normal group so that you can also embed this pattern in regexes where you do care about the capturing groups
Resulting pattern: "(?:(?!\d+\)|\d+\.(?!\d)).)*"
I suggest testing this pattern on its own to see what it does
Now put parts 5 and 7 together: "(?<=\d+\)|\d+\.(?!\d))(?:(?!\d+\)|\d+\.(?!\d)).)*"
Test it!
Unit tests!
If this is going into production, then please write lots of unit tests that will explain each step of this thought process
Have pity on the poor soul who has to maintain your regex in future!
By rights that person should be you
I suggest putting a note in your calendar to return to this code in 6 months' time and make sure you can still understand it from the unit tests alone!
Refactor
In six months' time, if you can't understand the code any more, use your newfound insight (and incentive) to solve the problem without using regular expressions (or only very simple ones)
Addendum
As an example of using a string splitting function to get away with a simpler regex, here's a solution in Powershell:
$string = 'ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10.'
$pattern = [regex] '\d+\)|\d+\.(?!\d)'
$string -split $pattern | select-object -skip 1
Judging by the task you have, it might be easier to match the delimiters and use re.split (as also pointed out by bobblebubble in the comments).
I dsuggest a mere
\d+[.)]\B\s*
See it in action (demo)
It matches 1 or more digits, then a . or a ), then it makes sure there is no word letter (digit, letter or underscore) after it and then matches zero or more whitespace.
Python demo:
import re
rx = r'\d+[.)]\B\s*'
test_str = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case\n\"we will give 4. there needs to be another option and 6.99 USD is a bit amount"
print([x for x in re.split(rx,test_str) if x])
Try the following regex with the g modifier:
([A-Za-z\s\-_]+|\d(?!(\)|\.)\D)|\.\d)
Example: https://regex101.com/r/kB1xI0/3
[A-Za-z\s\-_]+ automatically matches all alphabetical characters + whitespace
\d(?!(\)|\.)\D) match any numeric sequence of digits not followed by a closing parenthesis ) or decimal value (.99)
\.\d match any period followed by numeric digit.
I used this pattern:
(?<=\d.\s)(.*?)(?=\d.\s)
demo
This looks for the contents between any digit, any character, then a space.
Edit: Updated pattern to handle the currency issue and line ends better:
This is with flag 'g'
(?<=[0-9].\s)(.*?)(?=\s[0-9].\s|\n|\r)
Demo 2
import re
s = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case"
s1 = "we will give 4. there needs to be another option and 6.99 USD is a bit amount"
regex = re.compile("\d\)\s.*?|\s\d\.\D.*?")
print ([x for x in regex.split(s) if x])
print regex.split(s1)
Output:
['there is a dsfsdfsd and ', 'there is another one and ', 'yet another case']
['we will give', 'there needs to be another option and 6.99 USD is a bit amount']

parsing string with specific name in python

i have string like this
<name:john student male age=23 subject=\computer\sience_{20092973}>
i am confused ":","="
i want to parsing this string!
so i want to split to list like this
name:john
job:student
sex:male
age:23
subject:{20092973}
parsing string with specific name(name, job, sex.. etc) in python
i already searching... but i can't find.. sorry..
how can i this?
thank you.
It's generally a good idea to give more than one example of the strings you're trying to parse. But I'll take a guess. It looks like your format is pretty simple, and primarily whitespace-separated. It's simple enough that using regular expressions should work, like this, where line_to_parse is the string you want to parse:
import re
matchval = re.match("<name:(\S+)\s+(\S+)\s+(\S+)\s+age=(\S+)\s+subject=[^\{]*(\{\S+\})", line_to_parse)
matchgroups = matchval.groups()
Now matchgroups will be a tuple of the values you want. It should be trivial for you to take those and get them into the desired format.
If you want to do many of these, it may be worth compiling the regular expression; take a look at the re documentation for more on this.
As for the way the expression works: I won't go into regular expressions in general (that's what the re docs are for) but in this case, we want to get a bunch of strings that don't have any whitespace in them, and have whitespace between them, and we want to do something odd with the subject, ignoring all the text except the part between { and }.
Each "(...)" in the expression saves whatever is inside it as a group. Each "\S+" stands for one or more ("+") characters that aren't whitespace ("\S"), so "(\S+)" will match and save a string of length at least one that has no whitespace in it. Each "\s+" does the opposite: it has not parentheses around it, so it doesn't save what it matches, and it matches at one or more ("+") whitespace characters ("\s"). This suffices for most of what we want. At the end, though, we need to deal with the subject. "[...]" allows us to list multiple types of characters. "[^...]" is special, and matches anything that isn't in there. {, like [, (, and so on, needs to be escaped to be normal in the string, so we escape it with \, and in the end, that means "[^{]*" matches zero or more ("*") characters that aren't "{" ("[^{]"). Since "*" and "+" are "greedy", and will try to match as much as they can and still have the expression match, we now only need to deal with the last part. From what I've talked about before, it should be pretty clear what "({\S+})" does.

Extract numbers with EXPONENTS from heterogeneous text file

I need to take out some unformatted numerical data from text file. In the textfile, the numbers are somewhere separated by single space and somewhere by multiple spaces, somewhere by tabs; pretty heterogeneous text :(
I want Python to ignore all spaces/tabs and identify whole numerical values and put them in an array/list. Is it possible to do this using Python?
EDIT: There are many numbers written in scientific/exponential notation e.g. 1.2345E+06, and Python does not recognize them as numbers. So \d does not work simply :(
I don't want to use a normal string search for this purpose (given there are many strings/words which are of no interest/use). The regular expression module documentation has nothing mentioned about this issue.
If lines are like " 2.3e4 " or "2.6" or so, try:
^\s*?([+-]?\d+(\.\d+)?(e[+-]?\d+)?)\s*$
notice the \s*? mark (non-greedy zero/more spaces). Dont forget the question mark there - not including the question mark will make you capture only the last digit of your number due to greediness.
AFAIK python has not a special symbol, other than \d for digits, to capture numbers
You could use a regular expression like \s+([-+]?\d*\.?\d+(?:[eE][-+]?\d+)?)\s+ (adapted from here). Take a look at this to see how you can search for a regular expression in a file.

Categories