I have the following string which contains a repeating pattern of text followed by parentheses with an ID number.
The New York Yankees (12980261666)\n\nRedsox (1901659429)\nMets (NYC)
(21135721896)\nKansas City Royals (they are 7-1) (222497247812331)\n\n
other team (618006)\n
I'm struggling to write a regex that would return:
The New York Yankees (12980261666)
Redsox (1901659429)
Mets (NYC) (21135721896)
Kansas City Royals (they are 7-1) (222497247812331)
other team (618006)
The newline character could be replaced later with a string.replace('/n', '').
use the negate character to achieve this.
String pat="([^\\n])"
Related
I have a string as follows:
27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:
[^\n\d{6,}].+
This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon is a match and Group European Operations Netherlands Branch is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch. What regex concepts should I use to achieve this?
Edit
I tried the following based on the comment below but got the wrong result
text = 'West Food Group B.V.9\n \n52608670\n \nWestcon\n \nGroup European Operations Netherlands Branch\n \n30221053\n \nWestland Infra Netbeheer B.V.\n \n27176688\n \nWetransfer 85 B.V.\n \n34380998\n \nWETRAVEL B.V.\n \n70669783\n \nWeWork Companies (International) B.V.\n \n61501220\n \nWeWork Netherlands B.V.\n \n61505439\n \nWexford Finance B.V.\n \n27124941\n \nWFC\n-\nFood Safety B.V.\n \n11069471\n \nWhale Cloud Technology Netherlands B.V.\n \n63774801\n \nWHILL Europe B.V.\n \n72465700\n \nWhirlpool Nederland B.V.\n \n20042061\n \nWhitaker\n-\nTaylor Netherlands B.V.\n \n66255163\n \nWhite Oak B.V.\n'
re.findall(r'[^\n\d{6,}](?:(?:[a-z\s.]+(\n[a-z\s.])*)|.+)',text)
I think that you only want the company names. If so, this should work.
input = '''27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
'''
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*\d{1,5}.*', input)
pprint(company_name_regex)
['West Food Group B.V.9',
'Westcon',
'Group European Operations Netherlands Branch',
'Westland Infra Netbeheer B.V.',
'Wetransfer 85 B.V.'
'WETRAVEL B.V.']
This will create one group for lines that don't have numbers.
regex: /(?!(\d{6,}|\n))[a-zA-Z .\n]+/g
Demo: https://regex101.com/r/MMLGw6/1
Assuming your company names starts with a letter, you may use this regex with re.M modifier:
^[a-zA-Z].*(?:\n+[a-zA-Z].*)*(?=\n+\d{6,}$)
RegEx Demo
In python:
regex = re.compile(r"^[a-zA-Z].*(?:\n+[a-zA-Z].*)*(?=\n+\d{6,}$)", re.M)
This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by \n that also start with [a-zA-Z] characters.
(?=\n+\d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.
If you can solve this without regex it should be solved without regex:
useful = []
for line in text.split():
if line.strip() and not line.isdigit():
useful.append(line)
That should work - more or less. Replying from my phone so can't test.
Trying to figure out given two different types of strings, how to make a determination whether or not a city name is actually a split word? Since working in python, I Split the string and save s[0] for street num, s[-1] for zip code and so on but how to figure out whether the city name may be a split word such as New York or San Jose!?
E.g. : 123 Main Street St. Louisville OH 43071 [City name is single word]
E. g : 45 Holy Grail Al. Niagara Town ZP 32908 [City name 'Niagara Town' is two words]
Forgive the noob question.
Thank you,
I making two assumptions here:
1) That the number code before the town name is always numeric
2) That there is no town name with a number name
index = list(filter(lambda x: x[1].isnumeric(),enumerate(x.split())))[-1][0]
" ".join(x.split()[index+1:])
So what is happening: We try to identify the last part of the split that is purely numeric, and then get the index of that element. Then we join all elements after that numeric element.
The strings have two types.
The first type:
'The Five College Region of Western Massachusetts:'
#Doesn't contain "("
The second type:
'Tuskegee (Tuskegee University)[5]'
#Containing "("
If the string contains "(", remove all characters after"(" and the white space before"(".
If not, extract all characters.
I have firgured out how to extract second type of strings .
r'(.+) \('
You don't need regex for this.
university = 'Tuskegee (Tuskegee University)[5]'
print(university.split("(", 1)[0].strip())
Use re.sub to remove everything after ( if you want to use regex:
import re
re.sub(r' \(.*', '', 'Tuskegee (Tuskegee University)[5]')
# 'Tuskegee'
re.sub(r' \(.*', '', 'The Five College Region of Western Massachusetts:')
# 'The Five College Region of Western Massachusetts:'
You can use a regex re.sub('\s*\(.*',..) to match whitespace plus a "(".
If this matches, it will replace this with the empty string. If not, nothing is replaced.
import re
re.sub('\s*\(.*', '', 'The Five College Region of Western Massachusetts:')
#'The Five College Region of Western Massachusetts:'
re.sub('\s*\(.*', '', 'Tuskegee (Tuskegee University)[5]')
#'Tuskegee'
I want to extract postal codes of Alberta (Canada) region from an address string.
For example:
addr = '12345-67 Ave, Edmonton, AB T1A 2B3, Canada'
Should extract T1A 2B3.
The regular expression to match the postal code is [T]\d[A-Z] *\d[A-Z]\d. However, I do not know that given an entire address, how can I extract only the postal code? I guess it has to do something with backreferences () but I cannot figure it out.
How can I achieve this in Python?
Extracting just the substring that matched the regexp is easy enough:
test = re.compile(r'[T]\d[A-Z] *\d[A-Z]\d')
addr = '12345-67 Ave, Edmonton, AB T1A 2B3, Canada'
test.search(addr).group()
test.search will return a match object, which has all kinds of stuff you can extract.
Building on #Peter's Answer here is how you can do it for some more postal codes:
US:
addr= 'Statue of liberty, New York, NY 10004, USA'
test = re.compile(r'\d{5}')
test.search(addr).group()
UK:
addr= 'Olympic Park, Montfichet Rd, London E20 1EJ, United Kingdom'
test = re.compile(r'[A-Z]\d\d\s\d[A-Z]\d')
Canada:
addr= 'Toronto City Hall, 100 Queen St W, Toronto, ON M5H 2N2'
test = re.compile(r'[A-Z]\d[A-Z]\s\d[A-Z]\d')
[A-Z] Matches any uppercase letter in range A-Z
[a-zA-Z] Matches any uppercase letter in range A-Z (case insensitive)
\d matches any digit
\d{n} matches any occurrence of n digits
\s matches any whitespace character
You can also use Regex101, which is a very helpful tool for testing Regexes.
I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?
It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space
text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)