Geocode the address written in native language using English letters - python

Friends,
I am analyzing some texts. My requirement is to gecode the address written in English letters of a different native language.
Ex: chandpur market ke paas, village gorthaniya, UP, INDIA
In above sentence words like, "ke paas" --> is a HINDI word (Indian national language), which means "near" in English and "chandapur market" is a noun (can be ignored for conversion)
Now my challenge is to convert such thousands of words to english and identify the street name and geo code it. Unfortunately, i do not have postal code or exact address.
Can you any one please help here?
Thanks in Advance !!

Google's geocode api supports Hindi, so you don't necessarily have to translate it to English. Here's an example using my googleway package (in R) specifying the language = "hi" argument.
You'll need an API key to use the Google API through googleway
library(googleway)
set_key("your_api_key")
res <- google_geocode(address = "village gorthaniya, UP, INDIA",
language = "hi")
geocode_address(res)
# [1] "गोर्थानिया, उत्तर प्रदेश 272181, भारत"
geocode_coordinates(res)
# lat lng
# 1 26.85848 82.50099
geocode_address_components(res)
# long_name short_name types
# 1 गोर्थानिया गोर्थानिया locality, political
# 2 बस्ती बस्ती administrative_area_level_2, political
# 3 उत्तर प्रदेश उ॰ प्र॰ administrative_area_level_1, political
# 4 भारत IN country, political
# 5 272181 272181 postal_code

Related

How to label multi-word entities?

I'm quite new to data analysis (and Python in general), and I'm currently a bit stuck in my project.
For my NLP-task I need to create training data, i.e. find specific entities in sentences and label them. I have multiple csv files containing the entities I am trying to find, many of them consisting of multiple words. I have tokenized and lemmatized the unlabeled sentences with spaCy and loaded them into a pandas.DataFrame.
My main problem is: how do I now compare the tokenized sentences with the entity-lists and label the (often multi-word) entities? Having around 0.5 GB of sentences, I don't think it is feasible to just for-loop every sentence and then for-loop every entity in every class-list and do a simple substring-search. Is there any smart way to use pandas.Series or DataFrame to do this labeling?
As mentioned, I don't really have any experience regarding pandas/numpy etc. and after a lot of web searching I still haven't seemed to find the answer to my problem
Say that this is a sample of finance.csv, one of my entity lists:
"Frontwave Credit Union",
"St. Mary's Bank",
"Center for Financial Services Innovation",
...
And that this is a sample of sport.csv, another one of my entity lists:
"Christiano Ronaldo",
"Lewis Hamilton",
...
And an example (dumb) sentence:
"Dear members of Frontwave Credit Union, any credit demanded by Lewis Hamilton is invalid, said Ronaldo"
The result I'd like would be something like a table of tokens with the matching entity labels (with IOB labeling):
"Dear "- O
"members" - O
"of" - O
"Frontwave" - B-FINANCE
"Credit" - I-FINANCE
"Union" - I-FINANCE
"," - O
"any" - O
...
"Lewis" - B-SPORT
"Hamilton" - I-SPORT
...
"said" - O
"Ronaldo" - O
Use:
FINANCE = ["Frontwave Credit Union",
"St. Mary's Bank",
"Center for Financial Services Innovation"]
SPORT = [
"Christiano Ronaldo",
"Lewis Hamilton",
]
FINANCE = '|'.join(FINANCE)
sent = pd.DataFrame({'sent': ["Dear members of Frontwave Credit Union, any credit demanded by Lewis Hamilton is invalid, said Ronaldo"]})
home = sent['sent'].str.extractall(f'({FINANCE})')
def labeler(row, group):
l = len(row.split())
return [f'I-{group}' if i !=0 else f'B-{group}' for i in range(l)]
home[0].apply(labeler, group='FINANCE').explode()

Spacy PhraseMatcher - match not one of the keywords but ALL keywords within string

I am trying to solve the task of classifying texts into buckets based on the keywords. It is fairly easy to do when I need to match the text with one or several keywords (so one of keywords shall be in text), however I have trouble understanding how to do the matching when I need to ensure that several of keywords exist within the string.
Below is a small sample. Let's say that my dfArticles is a pandas dataframe which has a column Text with the text articles I am trying to match:
dfArticles['Text']
Out[2]:
0 (Reuters) - Major Middle Eastern markets ended...
1 MIDEAST STOCKS-Oil price fall hurts major Gulf...
2 DUBAI, 21st September, 2020 (WAM) -- The Minis...
3 DUBAI, (UrduPoint / Pakistan Point News / WAM ...
4 Brent crude was down 99 cents or 2.1% at $42.2.
Let's also say that my dataframe dfTopics holds a list of keywords I am trying to match against and buckets associated with keywords:
dfTopics
Out[3]:
Topic Keywords
0 Regulations law
1 Regulations regulatory
2 Regulations regulation
3 Regulations legislation
4 Regulations rules
5 Talent capability
6 Talent workforce
When I just need to check if the text is matching one of this keywords it is simple:
def prep_match_patterns(dfTopics):
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
for topic in dfTopics['Topic'].unique():
keywords = dfTopics.loc[dfTopics['Topic'] == topic, 'Keywords'].to_list()
patterns_topic = [nlp.make_doc(text) for text in keywords]
matcher.add(topic, None, *patterns_topic)
return matcher
Then I can easily check with one shot which buckets the text falls into:
nlp = spacy.load("en_core_web_lg")
nlp.disable_pipes(["parser"])
# extract the sentences from the documents
nlp.add_pipe(nlp.create_pipe('sentencizer'))
matcher = prep_match_patterns(dfTopics)
dfResults = pd.DataFrame([],columns=['ArticleID', 'Topic'])
articles = []
topics = []
for index, row in tqdm(dfArticles.iterrows(), total=len(dfArticles)):
doc = nlp(row['Text'])
matches = matcher(doc)
if len(matches)<1:
continue
else:
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
articles.append(row['ID'])
topics.append(string_id)
dfResults['ArticleID'] = articles
dfResults['Topic'] = topics
dfResults.drop_duplicates(inplace=True)
But now the trick is - sometimes to classify the text into bucket I need to ensure it matches several keywords at the same time
Let's say I have a new topic called "Healthcare system context" and for text to fall into this bucket I need the text to have all 3 substrings in it: "fragmentation" and "approval process" and "drug". Order doesn't matter but all three keywords need to be there. Is there any way to do it with PhraseMatcher?
I think you're overcomplicating. You can achieve what you want with simple python.
Suppose we have:
df_topics
Topic Keywords
0 Regulations law
1 Regulations regulatory
2 Regulations regulation
3 Regulations legislation
4 Regulations rules
5 Talent capability
6 Talent workforce
Then you can organize your topic keywords into a dictionary:
topics = df_topics.groupby("Topic")["Keywords"].agg(lambda x: x.to_list()).to_dict()
topics
{'Regulations': ['law', 'regulatory', 'regulation', 'legislation', 'rules'],
'Talent': ['capability', 'workforce']}
Finally, define a func to match keywords:
def textToTopic(text, topics):
t = []
for k,v in topics.items():
if all([topic in text.split() for topic in v]):
t.append(k)
return t
Demo:
textToTopic("law regulatory regulation rules legislation workforce", topics)
['Regulations']
textToTopic("law regulatory regulation rules legislation workforce capability", topics)
['Regulations', 'Talent']
You can apply this func to a text in df

Object Standarization Using NLTK

I'm new to NLP and to Python.
I'm trying to use object standardization to replace abbreviations with their full meaning. I found code online and altered it to test it out on a wikipedia exert. but all the code does is print out the original text. Can any one help out a newbie in need?
heres the code:
import nltk
lookup_dict = {'EC': 'European Commission', 'EU': 'European Union', "ECSC": "European Coal and Steel Commuinty",
"EEC": "European Economic Community"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
print(new_text)
return new_text
_lookup_words(
"The High Authority was the supranational administrative executive of the new European Coal and Steel Community ECSC. It took office first on 10 August 1952 in Luxembourg. In 1958, the Treaties of Rome had established two new communities alongside the ECSC: the eec and the European Atomic Energy Community (Euratom). However their executives were called Commissions rather than High Authorities")
Thanks in advance, any help is appreciated!
In your case, the lookup dict has the abbreviations for EC and ECSC amongs the words found in your input sentence. Calling split splits the input based on whitespace. But your sentence has the words ECSC. and ECSC: ,ie these are the tokens obtained post splitting as opposed to ECSC thus you are not able to map the input. I would suggest to do some depunctuation and run it again.

Python regex solution for extracting German address format

I'm trying hard to write a Python regex code for extracting German address as show below.
Abc Gmbh Ensisheimer Straße 6-8 79346 Endingen
Def Gmbh Keltenstr . 16 77971 Kippenheim Deutschland
Ghi Deutschland Gmbh 53169 Bonn
Jkl Gmbh Ensisheimer Str . 6 -8 79346 Endingen
I wrote the below code for extracting individual address components and also put them together as a single regex but still unable to detect the above addresses. Can anyone please help me with it?
# TEST COMPANY NAME
string = 'Telekom Deutschland Gmbh 53169 Bonn Datum'
result = re.findall(r'([a-zA-Zäöüß]+\s*?[A-Za-zäöüß]+\s*?[A-Za-zäöüß]?)',string,re.MULTILINE)
print(result)
# TEST STREET NAME
result = re.findall(r'([a-zA-Zäöüß]+\s*\.)',string)
print(result)
# TEST STREET NUMBER
result = re.findall(r'(\d{1,3}\s*[a-zA-Z]?[+|-]?\s*[\d{1,3}]?)',string)
print(result)
# TEST POSTAL CODE
result = re.findall(r'(\d{5})',string)
print(result)
# TEST CITY NAME
result = re.findall(r'([A-Za-z]+)?',string)
print(result)
# TEST COMBINED ADDRESS COMPONENTS GROUP
result = re.findall(r'([a-zA-Zäöüß]+\s+?[A-Za-zäöüß]+\s+?[A-Za-zäöüß]+\s+([a-zA-Zäöüß]+\s*\.)+?\s+(\d{1,3}\s*[a-zA-Z]?[+|-]?\s*[\d{1,3}]?)+\s+(\d{5})+\s+([A-Za-z]+))',string)
print(result)
Please note that my objective is that if any of these addresses are present in a huge paragraph of text then the regex should extract and print only the addresses. Can someone please help me?
I would opt against a regex solution and use libpostal instead, it has bindings for a couple of other languages (in your case for python, use postal). You will have to install libpostal separately, since it includes 1.8GB of training data.
The good thing is, you can give it address parts in any order, it will pick the right parts most of the time.
It uses machine learning, trained on OpenStreetMap data in many languages.
For the examples given, it would not necessarily require to cut the company name and country from the string:
from postal.parser import parse_address
parse_address('Telekom Deutschland Gmbh 53169 Bonn Datum')
[('telekom deutschland gmbh', 'house'),
('53169', 'postcode'),
('bonn', 'city'),
('datum', 'house')]
parse_address('Keltenstr . 16 77971 Kippenheim')
[('keltenstr', 'road'),
('16', 'house_number'),
('77971', 'postcode'),
('kippenheim', 'city')]

How to split text into paragraphs using NLTK nltk.tokenize.texttiling?

I found this Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html.
When I feed my text into texttiling, I get the same untokenized text back, but as a list, which is of no use to me.
tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)
tiles = tt.tokenize(text) # same text returned
What I have are emails that follow this basic structure
From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL
If we call this email string s, it would look like
s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"
What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?
*** Not all emails follow this same structure or have the same wording, so I can't use regular expressions.
What about using splitlines? Or do you have to use the nltk package?
email = """ From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""
y = [s.strip() for s in email.splitlines()]
print(y)
What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?
The texttiling algorithm {1,4,5} isn't designed to perform sequential text classification {2,3} (which is the task you described). Instead, from http://people.ischool.berkeley.edu/~hearst/research/tiling.html:
TextTiling is [an unsupervised] technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.
References:
{1} Marti A. Hearst, Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces, NM, June, 1994. pdf
{2} Lee, J.Y. and Dernoncourt, F., 2016, June. Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 515-520). https://www.aclweb.org/anthology/N16-1062.pdf
{3} Dernoncourt, Franck, Ji Young Lee, and Peter Szolovits. "Neural Networks for Joint Sentence Classification in Medical Paper Abstracts." In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 694-700. 2017. https://www.aclweb.org/anthology/E17-2110.pdf
{4} Hearst, M. TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages, Computational Linguistics, 23 (1), pp. 33-64, March 1997. pdf
{5} Pevzner, L., and Hearst, M., A Critique and Improvement of an Evaluation Metric for Text Segmentation, Computational Linguistics, 28 (1), March 2002, pp. 19-36. pdf

Categories