Python Regex: how to not select whitespace before last string? - python

I am (a newbie,) struggling with separating a database in columns with regex.findall().
I want to separate these Dutch street names into name and number.
Roemer Visscherstraat 15
Vondelstraat 102-huis
For the number I use
\S*$
Which works just fine. For the street name I use
^\S.+[^\S$]
Or: use everything but the last element, which may be a number or a combination of a number and something else.
Problem is: Python then also keeps the last whitespace after the last name, so I get:
'Roemer Visscherstraat '
Any way I can stop this from happening?
Also, Findall returns a list consisting of the bit of database I wanted, and an empty string. How does this happen and can i prevent it somehow?
Thanks so much in advance for you help.

You can rstrip() the name to remove any spaces at the end of it:
>>>'Roemer Visscherstraat '.rstrip()
'Roemer Visscherstraat'
But if the input is similar to the one you posted, you can simply use split() instead of regex, for example:
st = 'Roemer Visscherstraat 15'
data = st.split()
num = st[-1]
name = ' '.join(st[:-1])
print 'Name: {}, Number: {}'.format(name, num)
output:
Name: Roemer Visscherstraat, Number: 15

For the number you should use the following:
\S+$
Using a + instead of a * will ensure that you have at least one character in the match.
For the street name you can use the following:
^.+(?=\s\S+$)
What this does is selects text up until the number.
However, what you may consider doing is using one regex match with capture groups instead. The following would work:
^(.+(?=\s\S+$))\s(\S+$)
In this case, the first capture group gives you the street name, and the second gives you the number.

([^\d]*)\s+(\d.*)
In this regex the first group captures everything before a space and a number and the 2nd group gives the desired number
my assumption is that number would begin with a digit and the name would not have a digit in it
take a look at https://regex101.com/r/eW0UP2/1
Roemer Visscherstraat 15
Full match 0-24 `Roemer Visscherstraat 15`
Group 1. 0-21 `Roemer Visscherstraat`
Group 2. 22-24 `15`
Vondelstraat 102-huis
Full match 24-46 `Vondelstraat 102-huis`
Group 1. 24-37 `Vondelstraat`
Group 2. 38-46 `102-huis`

Related

I have a list and i want to print a specific string from it how can i do that?

So far I have done this but this returns the movie name but i want the year 1995 like this in separate list.
moviename=[]
for i in names:
moviename.append(i.split(' (',1)[0])
One issue with the code you have is that you're getting the first element of the array returned by split, which is the movie title. You want the second argument split()[1].
That being said, this solution won't work very well for a couple of reasons.
You will still have the second parenthesis in the year "1995)"
It won't work if the title itself has parenthesis (e.g. for Shanghai Triad)
If the year is always at the end of each string, you could do something like this.
movie_years = []
for movie in names:
movie_years.append(movie[-5:-1])
You could use a regular expression.
\(\d*\) will match an opening bracket, following by any number of digit characters (0-9), followed by a closing bracket.
Put only the \d+ part inside a capturing group to get only that part.
year_regex = r'\((\d+)\)'
moviename=[]
for i in names:
if re.search(year_regex, i):
moviename.append(re.search(year_regex, i).group(1))
By the way, you can make this all more concise using a list comprehension:
year_regex = r'\((\d+)\)'
moviename = [re.search(year_regex, name_and_year).group(1)
for name_and_year in names
if re.search(year_regex, name_and_year)]

Regex substitution that returns a trimmed version of the input?

I am dealing with a variety of "five and two" strings that refer to an individual. The strings have the first five letters of an individual's last name, and then the first two letters of the individual's first name. Each string concludes with a two digit numeral that acts as a "tiebreaker" if more than two individuals have the same "five and two." The numerals are to be considered strings. In the event of an individual who possesses a last name shorter than five letters, the entire last name is included in the string with no extra characters to fill in the gap.
Examples:
adamsjo02
allenje01
alstoga01
ariasge01
aucoide01
ayraujo01
belkti01 #This individual has a last name with only four letters
I wish to convert each of these strings into a "four and one" string that has a three digit numeral. The result of the above examples after being converted should look like this:
adamj002
allej001
alstg001
ariag001
aucod001
ayraj001
belkt001
I am using python throughout my project. I suspect that a regex substitution would be the best course of action to achieve what I need. I have little experience with regexes, and have come up with this thus far to detect the regex:
re.compile(r'(/w){2,5}(/w/w)(/w/w)')
While this does not work for me, it does lay out that I perceive there to be three groupings in each string. The last name portion, the first name portion, and the numerals (to be treated as strings). Each of those groupings ought to be undergoing a change, with exception to any individual that may have a last name of four or fewer letters.
You can do with a proper escape character \ and f-string:
import re
text = '''adamsjo02
allenje01
alstoga01
ariasge01
aucoide01
ayraujo01
belkti01
maja01'''
p = re.compile(r"(\w{2,5})(\w{2})(\d{2})")
output = [f"{m.group(1):_<4.4}{m.group(2):1.1}{m.group(3):0>3}" for m in map(p.search, text.splitlines())]
print(output)
# ['adamj002', 'allej001', 'alstg001', 'ariag001', 'aucod001', 'ayraj001', 'belkt001', 'ma__j001']
In this case, since you have a very specific format, I'd say regex is not necessary, though it does the job. I'm proposing, then, an alternate solution without using it.
def to_four_one(code: str) -> str:
last, first, number = code[:-4][:4], code[-4:-2], int(code[-2:])
return f"{last}{first[-2]}{number:03}"
It's a simple function that rearranges the elements in the string. It simply gets the last name, first name and number as different elements, and rewrites them as the new format asks (clipping last names for len == 4, and first names for len == 1, besides formatting the number as 3 digit).
Usage below. I added two more names with even less characters to show it doesn't break in those cases.
codes = [
"adamsjo02",
"allenje01",
"alstoga01",
"ariasge01",
"aucoide01",
"ayraujo01",
"belkti01",
"jorma03",
"baka02"]
[print(to_four_one(code)) for code in codes]
>>>adamj002
allej001
alstg001
ariag001
aucod001
ayraj001
belkt001
jorm003
bak002

separating a string into two using whitespace as the separator as the base for separation

I'm currently struggling to separate a string into two while using a white space as the base of the split.
I'm writing a program where a person is supposed to input their first and last name as a string. Then the program should take the first three letters out of the first name and first three letters out of the last name. Put these together and append them to a domain name creating an email. Such as Firstname Lastname becomes firlas#example.com. I know I could easily solve the problem by just using two inputs, but I want it to be only one input.
def create_email_address():
name = input("Please input your first and last name: ")
domain = "#aperturescience.com"
email_address = str.lower(name[:3]) + str.lower(name[10:13]) + domain
print(str(email_address))
This is what I have so far, and while I have experimented using list splits and indexes I can't get it to work properly. I can get the index to work when I use my own name as an example name. But if I input something that doesn't have the same amount of characters in it it doesn't perform as I want it to.
What I would like is to separate the first name and last name using the white space as the indicator where the split is supposed to be. That way it won't matter how long or short the names are, I can always index after the first three letters in each name.
Is this possible? and if so, any tips?
Thanks!
def create_email_address():
name = input("Please input your first and last name: ")
domain = "#aperturescience.com"
names=name.split()
email_address = str.lower(names[0][:3]) + str.lower(names[1][:3]) + domain
print(str(email_address))
The above code has been slightly modified to use split - which will create a list from your input string assuming that it is delimited by whitespaces.
Use str.split() method to get the first and last name. Note: if you don't specify attributes, it will split not only by space, but by all the space characters (e.g. "\n" - new line)
Here are the parts of your code which should be corrected:
1) It's better to use name[:3].lower() instead str.lower(name[:3]), though the code works fine in both cases. In Python, we almost always call method of a class from an already created instance (str_instance.lower() instead of str.lower(str_instance)
2) You shouldn't use print(str(...)), because print will convert the result to string even if you don't specify it explicitly. Moreover, in this case it is already a string.
So, here is the full fixed code:
first, last = input("Please input your first and last name: ").split()
domain = "#aperturescience.com"
email_address = first[:3].lower() + last[:3].lower() + domain
print(email_address)
You can use str.split() like so:
# get user input:
full_name = input('Enter your full name: ')
# split user input on white space, this will return a list:
first_name = full_name.split()[0]
last_name = full_name.split()[1]
# get first three characters in string:
first_name_first_three = first_name[0:3]
last_name_first_three = last_name[0:3]
print(first_name_first_three)
print(last_name_first_three)
One of the best ways to do that would be to use regular expressions. Please refer to the examples below.
import re
pattern = re.compile('\W+')
name = "Jerry Maguire"
split_names = pattern.split(name)
split_names
Out[24]: ['Jerry', 'Maguire']
name = "FrodoBaggins"
split_names = pattern.split(name)
split_names
Out[27]: ['FrodoBaggins']
This would handle scenarios where there are no spaces or if name needs to be split on multiple lines.
While we're all giving solutions, I'll just give you the bad answer. A one liner:
>>> print("".join(["".join([i[:3].lower() for i in input("Enter your full name: ").split(" ")]), "#example.com"]))
Enter your full name: Albert Finksson
albfin#example.com
This one is almost unreadable though, for your sake and others in the future use the other solutions.
I see that the use case where name contains more than two word (e.g. middle name) hasn't been addressed. So here's my attempt to answer,
def create_email():
name = input("Enter Full Name : ")
domain = "example.com"
email = name[:3].lower() + name.split()[-1][:3].lower() + "#" + domain
print(email)
Couple of pointers on the code above:
you don't really need to split to get the first three letters of first name as it's just the first three letters of full name
name.split()[-1] will get the last name from full name and [:3] will get the first three letters of last name

Incorrect output due to regular expression

I had a pdf in which names are written after a '/'
Eg: /John Adam Will Newman
I want to extract the names starting with '/',
the code which i wrote is :
names=re.compile(r'((/)((\w)+(\s)))+')
However, it produces just first name of the string "JOHN" and that too two times not the rest of the name.
Your + is at the wrong position; your regexp, as it stands, would demand /John /Adam /Will /Newman, with a trailing space.
r'((/)((\w)+(\s))+)' is a little better; it will accept /John Adam Will, with a trailing space; won't take Newman, because there is nothing to match \s.
r'((/)(\w+(\s\w+)*))' matches what you posted. Note that it is necessary to repeat one of the sequences that match a name, because we want N-1 spaces if there are N words.
(As Ondřej Grover says in comments, you likely have too many unneeded capturing brackets, but I left that alone as it hurts nothing but performance.)
I think you define way too many unnamed regexp groups. I would do something like this
import re
s = '/John Adam Will Newman'
name_regexp = re.compile(r'/(?P<name>(\w+\s*)+)')
match_obj = name_regexp.match(s) # match object
group_dict = match_obj.groupdict() # dict mapping {group name: value}
name = group_dict['name']
(?P<name>...) starts a named group
(\w+\s*) is a group matching one or more alphanum characters, possibly followed by some whitespace
the match object returned by the .match(s) method has a method groupdict() which returns a dict which is mapping from group names to their contents

python: regular expressions, how to match a string of undefind length which has a structure and finishes with a specific group

I need to create a regexp to match strings like this 999-123-222-...-22
The string can be finished by &Ns=(any number) or without this... So valid strings for me are
999-123-222-...-22
999-123-222-...-22&Ns=12
999-123-222-...-22&Ns=12
And following are not valid:
999-123-222-...-22&N=1
I have tried testing it several hours already... But did not manage to solve, really need some help
Not sure if you want to literally match 999-123-22-...-22 or if that can be any sequence of numbers/dashes. Here are two different regexes:
/^[\d-]+(&Ns=\d+)?$/
/^999-123-222-\.\.\.-22(&Ns=\d+)?$/
The key idea is the (&Ns=\d+)?$ part, which matches an optional &Ns=<digits>, and is anchored to the end of the string with $.
If you just want to allow strings 999-123-222-...-22 and 999-123-222-...-22&Ns=12 you better use a string function.
If you want to allow any numbers between - you can use the regex:
^(\d+-){3}[.]{3}-\d+(&Ns=\d+)?$
If the numbers must be of only 3 digits and the last number of only 2 digits you can use:
^(\d{3}-){3}[.]{3}-\d{2}(&Ns=\d{2})?$
This looks like a phone number and extension information..
Why not make things simpler for yourself (and anyone who has to read this later) and split the input rather than use a complicated regex?
s = '999-123-222-...-22&Ns=12'
parts = s.split('&Ns=') # splits on Ns and removes it
If the piece before the "&" is a phone number, you could do another split and get the area code etc into separate fields, like so:
phone_parts = parts[0].split('-') # breaks up the digit string and removes the '-'
area_code = phone_parts[0]
The portion found after the the optional '&Ns=' can be checked to see if it is numeric with the string method isdigit, which will return true if all characters in the string are digits and there is at least one character, false otherwise.
if len(parts) > 1:
extra_digits_ok = parts[1].isdigit()

Categories