I am trying to change the suffixes of companies such that they are all in a common pattern such as Limited, Limiteed all to LTD.
Here is my code:
re.sub(r"\s+?(CORPORATION|CORPORATE|CORPORATIO|CORPORATTION|CORPORATIF|CORPORATI|CORPORA|CORPORATN)", r" CORP", 'ABC CORPORATN')
I'm trying 'ABC CORPORATN' and it's not converting it to CORP. I can't see what the issue is. Any help would be great.
Edit: I have tried the other endings that I included in the regex and they all work except for corporatin (that I mentioned above)
I see that all te patterns begins with "CORPARA", so we can just go:
import re
print(re.sub("CORPORA\w+", "CORP", 'ABC CORPORATN'))
Output:
ABC CORP
Same for the possible patterns of limited; if they all begin with "Limit", you can
import re
print(re.sub("Limit\w+", "LTD", 'Shoe Shop Limited.'))
Output:
Shoe Shop LTD.
Related
I'm trying to make lists of companies from long strings.
The company names tend to be randomly dispersed through the strings, but they always have a comma and a space before the names ', ', and they always end in Inc, LLC, Corporation, or Corp.
In addition, there is always a company listed at the very beginning of the string. It goes something like:
Companies = 'Apples Inc, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, Bananas LLC,
Carrots Corp, xxxx.'
I've been trying to use regex to crack this nut, but I am too inexperienced with python.
My closest attempt went like this:
r = re.compile(r' .*? Inc | .*? LLC | .*? Corporation | .*? Corp',
flags = re.I | re.X)
r.findall(Companies)
But my output is always some variation of
['Apples Inc', ', xxxxxxxxxxxxxxxxxxx, Bananas LLC', ', Carrots Corp']
When I need it to be like
['Apples Inc', 'Bananas LLC', 'Carrots Corp']
I am vexed and I humbly ask for assistance.
****EDIT
I have figured out a method to find the company name if it includes a comma, like Apples, Inc.
Before I run any analysis on the long string, I will have the program look if any commas exist 2 spaces before the Inc., and then delete them.
Then I will run the program to list out the company names.
I think that this is a perfect example of when not to use regex. Your result can be achieved by just splitting the string based on commas and checking if the suffixes you specify exist in any of the divided segments.
For example:
paragraph = 'Apples Inc, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, Bananas LLC, Carrots Corp, xxxx.'
suffixes = ["Inc", "Corp", "Corporation", "LLC"]
companies = []
#Split paragraph by commas
for term in paragraph.split(", "):
#Go through the suffixes and see if any of them match with the split field
for suffix in suffixes:
if suffix in term:
companies.append(term)
print(companies)
This code is a lot more readable and is probably a lot easier to understand than regex.
In this particular case, you can do:
targets=('Inc', 'LLC', 'Corp', 'Corporation')
>>> [x for x in Companies.split(', ') if any(x.endswith(y) for y in targets)]
['Apples Inc', 'Bananas LLC', 'Carrots Corp']
This does not work if there is a , in the name or between the name and entity type however.
If you potentially have Apple, Inc. (which would be typical) you can do:
Companies = 'Apples, Inc., xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, Bananas, LLC, Carrots Corp., xxxx.'
targets=('Inc', 'LLC', 'Corp', 'Corporation')
>>> re.findall(rf'([^,]+?(?:, )?(?:{"|".join(targets)})\.?)', Companies)
['Apples, Inc.', ' Bananas, LLC', ' Carrots Corp.']
Demo and explanation of regex
I have a string in the following format
test_str = '{"keywords": {"Associate Director Information Technology Services": ["Director of Technology Services"]}}'
My Regex code below
import re
matches= re.findall(r'\{(.*?)\}',test_str)
gives the output
['"keywords": {"Associate Director Information Technology Services": ["Director of Technology Services"]']
What change should I do in my Regex expression to output only
"Director of Technology Services"
re.findall(r"\[(.*?)\]", test_str)
print(re.findall(r"\[(.*?)\]", test_str)[0])
instead of escaping { and } you should escape [ and ].
Alternative Solution using capturing of groups.
import re
regex = re.compile(r"\[(.*?)\]")
test_str = '{"keywords": {"Associate Director Information Technology Services": ["Director of Technology Services"]}}'
print(regex.search(test_str).group(1))
Output:
"Director of Technology Services"
Try using the below code:
import json
test_str = '{"keywords": {"Associate Director Information Technology Services": ["Director of Technology Services"]}}'
test_str_json = json.loads(test_str)
output = test_str_json["keywords"]["Associate Director Information Technology Services"][0]
print(output)
Output:
Director of Technology Services
I have a list of properly-formatted company names, and I am trying to find when those companies appear in a document. The problem is that they are unlikely to appear in the document exactly as they do in the list. For example, Visa Inc may appear as Visa or American Airlines Group Inc may appear as American Airlines.
How would I go about iterating over the entire contents of the document and then return the properly formatted company name when a close match is found?
I have tried both fuzzywuzzy and difflib.get_close_matches, but the problem is it looks at each individual word rather than clusters of words:
from fuzzywuzzy import process
from difflib import get_close_matches
company_name = ['American Tower Inc', 'American Airlines Group Inc', 'Atlantic American Corp', 'American International Group']
text = 'American Tower is one company. American Airlines is another while there is also Atlantic American Corp but we cannot forget about American International Group Inc.'
#using fuzzywuzzy
for word in text.split():
print('- ' + word+', ', ', '.join(map(str,process.extractOne(word, company_name))))
#using get_close_matches
for word in text.split():
match = get_close_matches(word, company_name, n=1, cutoff=.4)
print(match)
I was working on a similar problem. Fuzzywuzzy internally uses difflib and both of them perform slowly on large datasets.
Chris van den Berg's pipeline converts company names into vectors of 3-grams using a TF-IDF matrix and then compares the vectors using cosine similarity.
The pipeline is quick and gives accurate results for partially matched strings too.
For that type of task I use a record linkage algorithm, it will find those clusters for you with the help of ML. You will have to provide some actual examples so the algorithm can learn to label the rest of your dataset properly.
Here is some info:
https://pypi.org/project/pandas-dedupe/
Cheers,
I'm new to NLP and to Python.
I'm trying to use object standardization to replace abbreviations with their full meaning. I found code online and altered it to test it out on a wikipedia exert. but all the code does is print out the original text. Can any one help out a newbie in need?
heres the code:
import nltk
lookup_dict = {'EC': 'European Commission', 'EU': 'European Union', "ECSC": "European Coal and Steel Commuinty",
"EEC": "European Economic Community"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
print(new_text)
return new_text
_lookup_words(
"The High Authority was the supranational administrative executive of the new European Coal and Steel Community ECSC. It took office first on 10 August 1952 in Luxembourg. In 1958, the Treaties of Rome had established two new communities alongside the ECSC: the eec and the European Atomic Energy Community (Euratom). However their executives were called Commissions rather than High Authorities")
Thanks in advance, any help is appreciated!
In your case, the lookup dict has the abbreviations for EC and ECSC amongs the words found in your input sentence. Calling split splits the input based on whitespace. But your sentence has the words ECSC. and ECSC: ,ie these are the tokens obtained post splitting as opposed to ECSC thus you are not able to map the input. I would suggest to do some depunctuation and run it again.
I am trying to use regular expressions to find a UK postcode within a string.
I have got the regular expression working inside RegexBuddy, see below:
\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b
I have a bunch of addresses and want to grab the postcode from them, example below:
123 Some Road Name Town, City County PA23 6NH
How would I go about this in Python? I am aware of the re module for Python but I am struggling to get it working.
Cheers
Eef
repeating your address 3 times with postcode PA23 6NH, PA2 6NH and PA2Q 6NH as test for you pattern and using the regex from wikipedia against yours, the code is..
import re
s="123 Some Road Name\nTown, City\nCounty\nPA23 6NH\n123 Some Road Name\nTown, City"\
"County\nPA2 6NH\n123 Some Road Name\nTown, City\nCounty\nPA2Q 6NH"
#custom
print re.findall(r'\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b', s)
#regex from #http://en.wikipedia.orgwikiUK_postcodes#Validation
print re.findall(r'[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}', s)
the result is
['PA23 6NH', 'PA2 6NH', 'PA2Q 6NH']
['PA23 6NH', 'PA2 6NH', 'PA2Q 6NH']
both the regex's give the same result.
Try
import re
re.findall("[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}", x)
You don't need the \b.
#!/usr/bin/env python
import re
ADDRESS="""123 Some Road Name
Town, City
County
PA23 6NH"""
reobj = re.compile(r'(\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b)')
matchobj = reobj.search(ADDRESS)
if matchobj:
print matchobj.group(1)
Example output:
[user#host]$ python uk_postcode.py
PA23 6NH