I have a list of strings (sentences) that might contain one or more Dutch city names. I also have a list of Dutch cities, and their various spellings. I am currently working in Python, but a solution in another language would also work.
What would be the best and most efficient way to retrieve a list of cities mentioned in the sentences?
What I do at the moment is loop through the sentence list, and then within that loop, loop through the cities list and one by one check if
place_name in sentence.lower(), so I have:
for sentence in sentences:
for place_name in place_names:
if place_name in sentence.lower():
places[place_name] = places[place_name] + 1
Is this the most efficient way to do this? I also run into the problem that cities like "Ee" exist in Holland, and that words with "ee" in them are quite common. For now I solved this by just checking if place_name + ' ' in sentence.lower(), but this is of course suboptimal and ugly, as it would also disregard sentences like "Huis in Amsterdam", since it doesn't end with a space, and it won't also work well with punctuation. I tried using regex, but this is of course way too slow. Would there be a better way to solve this particular problem, or to solve this problem in general? I am leaning somewhat to an NLP solution, but I also feel like that would be a massive overkill.
You may look into Named Entity Recognition solutions in general. This can be done in nltk as well but here is a sample in Spacy - cities would be marked with GPE labels (GPE stands for "Geopolitical Entity" like countries, states, cities etc):
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(u'Some company is looking at buying an Amsterdam startup for $1 billion')
for ent in doc.ents:
print(ent.text, ent.label_)
Prints:
Amsterdam GPE
$1 billion MONEY
Related
Let me try to explain my issue with an example, I have a large corpus and a substring like below,
corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""
substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""
Both the substring and corpus are very similar but it not exact,
If I do something like,
import re
re.search(substring, corpus, flags=re.I) # this will fail substring is not exact but rather very similar
In the corpus the substring is like below which is bit different from the substring I have because of that regular expression search is failing, can someone suggest a really good alternative for similar substring lookup,
until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now
I did try difflib library but it was not satisfying my use-case.
Some background information,
The substring I have right now, is obtained some time ago from pre-processed corpus using this regex re.sub("[^a-zA-Z]", " ", corpus).
But now I need to use that substring I have to do the reverse lookup in the corpus text and find the start and ending index in the corpus.
You don't actually need to fuzzy match all that much, at least for the example given; text can only change in spaces within substring, and it can only change by adding at least one non-alphabetic character (which can replace a space, but the space can't be deleted without a replacement). This means you can construct a regex directly from substring with wildcards between words, search (or finditer) the corpus for it, and the resulting match object will tell you where the match(es) begin and end:
import re
# Allow any character between whitespace-separated "words" except ASCII
# alphabetic characters
ssre = re.compile(r'[^a-z]+'.join(substring.split()), re.IGNORECASE)
if m := ssre.search(corpus):
print(m.start(), m.end())
print(repr(m.group(0)))
Try it online!
which correctly identifies where the match began (index 217) and ended (index 771) in corpus; .group(0) can directly extract the matching text for you if you prefer (it's uncommon to need the indices, so there's a decent chance you were asking for them solely to extract the real text, and .group(0) does that directly). The output is:
217 771
"until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now"
If spaces might be deleted without being replaced, just change the + quantifier to * (the regex will run a little slower since it can't short-circuit as easily, but would still work, and should run fast enough).
If you need to handle non-ASCII alphabetic characters, the regex joiner can change from r'[^a-z]+' to the equivalent r'[\W\d_]+' (which means "match all non-word characters [non-alphanumeric and not underscore], plus numeric characters and underscores"); it's a little more awkward to read, but it handles stuff like é properly (treating it as part of a word, not a connector character).
While it's not going to be as flexible as difflib, when you know no words are removed or added, it's just a matter of spacing and punctuation, this works perfectly, and should run significantly faster than a true fuzzy matching solution (that has to do far more work to handle the concept of close matches).
While you can not find an exact match if the strings differ by just even one character - you can find similar strings.
So here I made use of the builtin difflib SequenceMatcher in order to check for the similarity of two differing strings.
In case that you need the indices of where the substring starts within the corpus - that could be easily added. In case you have any questions comment pls.
Hope it helps. - Adapted to your edited question - Implemented #ShadowRanger's note
import re
from difflib import SequenceMatcher
def similarity(a, b) -> float:
"""Return similarity between 2 strings"""
return SequenceMatcher(None, a, b).ratio()
def find_similar_match(a, b, threshold=0.7) -> list:
"""Find string b in a - while the strings being different"""
corpus_lst = a.split()
substring_lst = b.split()
nonalpha = re.compile(r"[^a-zA-Z]")
start_indices = [i for i, x in enumerate(
corpus_lst) if nonalpha.sub("", x) == substring_lst[0]]
end_indices = [i for i, x in enumerate(
corpus_lst) if nonalpha.sub("", x) == substring_lst[-1]]
rebuild_substring = " ".join(substring_lst)
max_sim = 0
for start_idx in start_indices:
for end_idx in end_indices:
corpus_search_string = " ".join(corpus_lst[start_idx: end_idx])
sim = similarity(corpus_search_string, rebuild_substring)
if sim > max_sim:
result = [start_idx, end_idx]
return result
The results are of calling find_similar_match(corpus, substring):
Found a match with similarity : 0.8429752066115702
[38, 156]
Not exactly the best solution but this might help.
match = SequenceMatcher(None, corpus, substring).find_longest_match(0, len(corpus), 0, len(substring))
print(match)
print(corpus[match.a:match.a + match.size])
print(substring[match.b:match.b + match.size])
This may help you to visualise the similarity of the two strings based on the
percentage of words in the corpus that are in the substring.
The code below aims to :
use the substring as a bag of words
finds these words in the corpus (and if found - makes them uppercase)
display the modifications in the corpus
calculate the percentage of modified words in the corpus
show the number of words in substring that were not in the corpus
This way you can see which of the substring words were matched in the corpus, and then identify the percentage similarity by word (but not necessarily in the right order).
Code:
import re
corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""
substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""
sub_list = set(substring.split(" "))
unused_words = []
for word in sub_list:
if word in corpus:
r = r"\b" + word + r"\b"
ru = f"{word.upper()}"
corpus = re.sub(r, ru, corpus)
else:
unused_words.append(word)
print(corpus)
lower_strings = len(re.findall("[a-z']+", corpus))
upper_strings = len(re.findall("[A-Z']+", corpus))
print(f"\nWords Matched = {(upper_strings)/(upper_strings + lower_strings)*100:.1f}%")
print(f"Unused Substring words: {len(unused_words)}")
Output:
very quick service, polite workers(cory, I think THAT'S his name), I
basically just drove there AND got A quote(which SEEMS TO be very fair
priced), THEN DROPPED OFF MY CAR 4 days later(because THEY WERE fully
booked UNTIL THEN), THEN I DROPPED OFF MY CAR ON MY APPOINTMENT DAY, THEN
THE SAME DAY THE SHOP CALLED ME AND NOTIFIED ME THAT THE THE JOB IS DONE I
CAN GO PICKUP MY CAR. WHEN I GO CHECKED OUT MY CAR I WAS AMAZED BY THE JOB
THEY'VE DONE TO IT, AND THEY EVEN GAVE THAT DIRTY CAR A WASH( PROB EVEN
WAXED IT OR COATED IT, CUZ IT WAS SHINY AS HELL), TIRES SHINE, MATS WERE
VACUUMED TOO. I GAVE THEM A DIRTY, BROKEN CAR, THEY GAVE ME BACK A WHAT
SEEMS LIKE A BRAND NEW CAR. I'M HAPPY WITH THE RESULT, AND I WILL DEF HAVE
ALL MY CAR'S WORK DONE BY THIS PLACE FROM NOW.
Words Matched = 82.1%
Unused Substring words: 0
I've been searching for this for a long time and most of the materials I've found were entity named recognition. I'm running topic modeling but in my data, there were too many names in the texts.
Is there any python library which contains (English) names of people? or if not, what would be a good way to remove names of people from each document in corpus?
Here's a simple example:
texts=['Melissa\'s home was clean and spacious. I would love to visit again soon.','Kevin was nice and Kevin\'s home had a huge parking spaces.']
I would suggest using a tokenizer with some capability to recognize and differentiate proper nouns. spacy is quite versatile and its default tokenizer does a decent job of this.
There are hazards to using a list of names as if they're stop words - let me illustrate:
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
"Kevin was nice and Kevin's home had a huge parking spaces."
"Bill sold a work of art to Art and gave him a bill"]
tokenList = []
for i, sentence in enumerate(texts):
doc = nlp(sentence)
for token in doc:
tokenList.append([i, token.text, token.lemma_, token.pos_, token.tag_, token.dep_])
tokenDF = pd.DataFrame(tokenList, columns=["i", "text", "lemma", "POS", "tag", "dep"]).set_index("i")
So the first two sentences are easy, and spacy identifies the proper nouns "PROPN":
Now, the third sentence has been constructed to show the issue - lots of people have names that are also things. spacy's default tokenizer isn't perfect, but it does a respectable job with the two sides of the task: don't remove names when they are being used as regular words (e.g. bill of goods, work of art), and do identify them when they are being used as names. (you can see that it messed up one of the references to Art (the person).
Not sure if this solution is efficient and robust but it's simple to understand (to me at the very least):
import re
# get a list of existed names (over 18 000) from the file
with open('names.txt', 'r') as f:
NAMES = set(f.read().splitlines())
# your list of texts
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
"Kevin was nice and Kevin's home had a huge parking spaces."]
# join the texts into one string
texts = ' | '.join(texts)
# find all the words that look like names
pattern = r"(\b[A-Z][a-z]+('s)?\b)"
found_names = re.findall(pattern, texts)
# get singular forms, and remove doubles
found_names = set([name[0].replace("'s","") for name in found_names])
# remove all the words that look like names but are not included in the NAMES
found_names = [name for name in found_names if name in NAMES]
# loop trough the found names and remove every name from the texts
for name in found_names:
texts = re.sub(name + "('s)?", "", texts) # include plural forms
# split the texts back to the list
texts = texts.split(' | ')
print(texts)
Output:
[' home was clean and spacious. I would love to visit again soon.',
' was nice and home had a huge parking spaces.']
List of the names was obtained here: https://www.usna.edu/Users/cs/roche/courses/s15si335/proj1/files.php%3Ff=names.txt.html
And I completely endorse the recommendation of #James_SO to use more smart tools.
I have a huge number of names from different sources.
I need to extract all the groups (part of the names), which repeat from one to another.
In the example below program should locate: Post, Office, Post Office.
I need to get popularity count.
So I want to extract a sorted by popularity list of phrases.
Here is an example of names:
Post Office - High Littleton
Post Office Pilton Outreach Services
Town Street Post Office
post office St Thomas
Basically need to find out some algorithm or better library, to get such results:
Post Office: 16999
Post: 17934
Office: 16999
Tesco: 7300
...
Here is the full example of names.
I wrote a code which is fine for single words, but not for sentences:
from textblob import TextBlob
import operator
title_file = open("names.txt", 'r')
blob = TextBlob(title_file.read())
list = sorted(blob.word_counts.items(), key=operator.itemgetter(1))
print list
You are not looking for clustering (and that is probably why "all of them suck" for #andrewmatte).
What you are looking for is word counting (or more precisely, n-gram-counting). Which is actually a much easier problem. Thst is why you won't be finding any library for that...
Well, actually you jave some libraries. In python, for example, the collections module has the class Counter that has much of the reusable code.
An untested, very basic code:
from collections import Counter
counter = Counter()
for s in sentences:
words = s.split(" ")
for i in range(len(words)):
counter.add(words[i])
if i > 0: counter.add((words[i-1], words[i]))
You csn get the most frequent from counter. If you want words and word pairs separate, feel free to use two counters. If you need longer phrases, add an inner loop. You may also want to clean sentences (e.g. lowercase) and use a regexp for splitting.
Are you looking for something like this?
workspace={}
with open('names.txt','r') as f:
for name in f:
if len(name): # makes sure line isnt empty
if name in workspace:
workspace[name]+=1
else:
workspace[name]=1
for name in workspace:
print "{}: {}".format(name,workspace[name])
I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id, title, scientific_title and synonyms, where each column might contain more than one term inside it.
I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?
In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.
It looks something like this:
title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)
for i in range(length):
for place_length in range(len(sentence_array)):
last_element = place_length + 1
phrase = ' '.join(sentence_array[0:last_element])
if phrase in literalhash:
final_dict.setdefault(id,[])
if not phrase in final_dict[id]:
final_dict[trial_id].append(phrase)
How should I be doing this?
The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:
After:
for line in text:
add:
line = line.lower()
(this is indented four spaces)
And change:
phrase = ' '.join(sentence_array[0:last_element])
to:
phrase = ' '.join(sentence_array[0:last_element]).lower()
AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.
To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.
For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".
Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!
If you need some more code i would post it online.
i use this code to split a data to make a list with three sublists.
to split when there is * or -. but it also reads the the \n\n *.. dont know why?
i dont want to read those? can some one tell me what im doing wrong?
this is the data
*Quote of the Day
-Education is the ability to listen to almost anything without losing your temper or your self-confidence - Robert Frost
-Education is what survives when what has been learned has been forgotten - B. F. Skinner
*Fact of the Day
-Fractals, an important part of chaos theory, are very useful in studying a huge amount of areas. They are present throughout nature, and so can be used to help predict many things in nature. They can also help simulate nature, as in graphics design for movies (animating clouds etc), or predict the actions of nature.
-According to a recent survey by Just-Eat, not everyone in The United Kingdom actually knows what the Scottish delicacy, haggis is. Of the 1,623 British people polled:\n\n * 18% of Brits thought haggis was some sort of Scottish animal.\n\n * 15% thought it was a Scottish musical instrument.\n\n * 4% thought it was a character from Harry Potter.\n\n * 41% didn't even know what Scotland's national dish was.\n\nWhile a small number of Scots admitted not knowing what haggis was either, they also discovered that 68% of Scots would like to see Haggis delivered as takeaway.
-With the growing concerns involving Facebook and its ever changing privacy settings, a few software developers have now engineered a website that allows users to trawl through the status updates of anyone who does not have the correct privacy settings to prevent it.\n\nNamed Openbook, the ultimate aim of the site is to further expose the problems with Facebook and its privacy settings to the general public, and show people just how easy it is to access this type of information about complete strangers. The site works as a search engine so it is easy to search terms such as 'don't tell anyone' or 'I hate my boss', and searches can also be narrowed down by gender.
*Pet of the Day
-Scottish Terrier
-Land Shark
-Hamster
-Tse Tse Fly
END
i use this code:
contents = open("data.dat").read()
data = contents.split('*') #split the data at the '*'
newlist = [item.split("-") for item in data if item]
to make that wrong similar to what i have to get list
The "\n\n" is part of the input data, so it's preserved in python. Just add a strip() to remove it:
finallist = [item.strip() for item in newlist]
See the strip() docs: http://docs.python.org/library/stdtypes.html#str.strip
UPDATED FROM COMMENT:
finallist = [item.replace("\\n", "\n").strip() for item in newlist]
open("data.dat").read() - reads all symbols in file, not only those you want.
If you don't need '\n' you can try content.replace("\n",""), or read lines (not whole content), and truncate the last symbol'\n' of each line.
This is going to split any asterisk you have in the text as well.
Better implementation would be to do something like:
lines = []
for line in open("data.dat"):
if line.lstrip.startswith("*"):
lines.append([line.strip()]) # append a list with your line
elif line.lstrip.startswith("-"):
lines[-1].append(line.strip())
For more homework, research what's happening when you use the open() function in this way.
The following solves your problem i believe:
result = [ [subitem.replace(r'\n\n', '\n') for subitem in item.split('\n-')]
for item in open('data.txt').read().split('\n*') ]
# now let's pretty print the result
for i in result:
print '***', i[0], '***'
for j in i[1:]:
print '\t--', j
print
Note I split on new-line + * or -, in this way it won't split on dashes inside the text. Also i replace the textual character sequence \ n \ n (r'\n\n') with a new line character '\n'. And the one-liner expression is list comprehension, a way to construct lists in one gulp, without multiple .append() or +