in my list comprehension, I am trying to remove"RT #tiktoksaudi2:"occurrences if it exists in a list, yet I get an empty list even though it doesn't exist in the list
test=["The way to a man's heart is through his stomach. - Sarah Willis Parton", 'A first rate soup is better than a second rate painting. - Abraham Maslow', 'A good cook is like a sorceress who dispenses happiness. - Elsa Schiaparelli', "A man's palate can, in time, become accustomed to anything. - Napoleon Bonaparte", 'Savory seasonings stimulate the appetite. - Latin Proverb', 'Cooking is an observation-based process that you can’t do if you’re so completely focused on a recipe. - Alton Brown', 'Happiness is finding three olives in your martini when you’re hungry. - Johnny Carson', 'A good meal makes a man feel more charitable toward the world than any sermon. - Arthur Pendenys', 'Wine and cheese are ageless companions, like aspirin and aches, or June and moon, or good people and noble ventures. - M. F. K. Fisher', 'For hunger is a sauce, well blended and prepared, for any food. - Chrétien de Troyes', "Without my morning coffee I'm just like a dried up piece of roast goat. - Johann Sebastian Bach", 'You know how I feel about tacos. It’s the only food shaped like a smile. A beef smile. - Earl Hickey']
print(test)
sh3rtext=[text.replace("RT #tiktoksaudi2:","") for text in test if "RT #tiktoksaudi2:" in text]
print(sh3rtext)
The if clause in list comprehensions filters out any items that don't match the condition; ie. in your example, any items that don't contain "RT #tiktoksaudi2:" (-> all of them). Just leave out the if "RT #tiktoksaudi2:" in text and do the replace call on all elements (this will do nothing if an element doesn't contain your string and just return the original) to get the entire list back.
Related
I decided I wanted to take a text and find how close some labels were in the text. Basically, the idea is to check if two persons are less than 14 words apart and if they are we say that they are related.
My naive implementation is working, but only if the person is a single word, because I iterate over words.
text = """At this moment Robert who rises at seven and works before
breakfast came in He glanced at his wife her cheek was
slightly flushed he patted it caressingly What s the
matter my dear he asked She objects to my doing nothing
and having red hair said I in an injured tone Oh of
course he can t help his hair admitted Rose It generally
crops out once in a generation said my brother So does the
nose Rudolf has got them both I must premise that I am going
perforce to rake up the very scandal which my dear Lady
Burlesdon wishes forgotten--in the year 1733 George II
sitting then on the throne peace reigning for the moment and
the King and the Prince of Wales being not yet at loggerheads
there came on a visit to the English Court a certain prince
who was afterwards known to history as Rudolf the Third of Ruritania"""
involved = ['Robert', 'Rose', 'Rudolf the Third',
'a Knight of the Garter', 'James', 'Lady Burlesdon']
# my naive implementation
ws = text.split()
l = len(ws)
for wi,w in enumerate(ws):
# Skip if the word is not a person
if w not in involved:
continue
# Check next x words for any involved person
x = 14
for i in range(wi+1,wi+x):
# Avoid list index error
if i >= l:
break
# Skip if the word is not a person
if ws[i] not in involved:
continue
# Print related
print(ws[wi],ws[i])
Now I would like to upgrade this script to allow for multi-word names such as 'Lady Burlesdon'. I am not entirely sure what is the best way to proceed. Any hints are welcome.
You could first preprocess your text so that all the names in text are replaced with single-word ids. The ids would have to be strings that you would not expect to appear as other words in the text. As you preprocess the text, you could keep a mapping of ids to names to know which name corresponds to which id. This would allow to keep your current algorithm as is.
What is the fastest way to remove items in the list that matches substrings in the set?
For example,
the_list =
['Donald John Trump (born June 14, 1946) is an American businessman, television personality',
'and since June 2015, a candidate for the Republican nomination for President of the United States in the 2016 election.',
'He is the chairman and president of The Trump Organization and the founder of Trump Entertainment Resorts.',
'Trumps career',
'branding efforts',
'personal life',
'and outspoken manner have made him a celebrity.',
'Trump is a native of New York City and a son of Fred Trump, who inspired him to enter real estate development.',
'While still attending college he worked for his fathers firm',
'Elizabeth Trump & Son. Upon graduating in 1968 he joined the company',
'and in 1971 was given control, renaming the company The Trump Organization.',
'Since then he has built hotels',
'casinos',
'golf courses',
'and other properties',
'many of which bear his name. He is a major figure in the American business scene and has received prominent media exposure']
The list is actually a lot longer than this (millions of string elements) and I'd like to remove whatever elements that contain the strings in the set, for example,
{"Donald Trump", "Trump Organization","Donald J. Trump", "D.J. Trump", "dump", "dd"}
What will be the fastest way? Is Looping through the fastest?
The Aho-Corasick algorithm was specifically designed for exactly this task. It has the distinct advantage of having a much lower time complexity O(n+m) than nested loops O(n*m) where n is the number of strings to find and m is the number of strings to be searched.
There is a good Python implementation of Aho-Corasick with accompanying explanation. There are also a couple of implementations at the Python Package Index but I've not looked at them.
Use a list comprehension if you have your strings already in memory:
new = [line for line in the_list if not any(item in line for item in set_of_words)]
If you don't have them in memory as a more optimized approach in term of memory use you can use a generator expression:
new = (line for line in the_list if not any(item in line for item in set_of_words))
So I have a list of words describing a particular group. For example, one group is based around pets.
The words for the example group pets, are as follows:
[pets, pet, kitten, cat, cats, kitten, puppies, puppy, dog, dogs, dog walking, begging, catnip, lol, catshit, thug life, poop, lead, leads, bones, garden, mouse, bird, hamster, hamsters, rabbits, rabbit, german shepherd, moggie, mongrel, tomcat, lolcatz, bitch, icanhazcheeseburger, bichon frise, toy dog, poodle, terrier, russell, collie, lab, labrador, persian, siamese, rescue, Celia Hammond, RSPCA, battersea dogs home, rescue home, battersea cats home, animal rescue, vets, vet, supervet, Steve Irwin, pugs, collar, worming, fleas, ginger, maine coon, smelly cat, cat people, dog person, Calvin and Hobbes, Calvin & Hobbes, cat litter, catflap, cat flap, scratching post, chew toy, squeaky toy, pets at home, cruft's, crufts, corgi, best in show, animals, Manchester dogs' home, manchester dogs home, cocker spaniel, labradoodle, spaniel, sheepdog, Himalayan, chinchilla, tabby, bobcat, ragdoll, short hair, long hair, tabby cat, calico, tabbies, looking for a good home, neutring, missing, spayed, neutered, declawing, deworming, declawed, pet insurance, pet plan, guinea pig, guinea pigs, ferret, hedgehogs, minipigs, mastiff, leonburger, great dane, four-legged friend, walkies, goldfish, terrapin, whiskas, mr dog, sheba, iams]
Now I plan on enriching this list using NLTK.
So as a start I can get the synset of each word. If we take cats, as an example we obtain:
Synset('cat.n.01')
Synset('guy.n.01')
Synset('cat.n.03')
Synset('kat.n.01')
Synset('cat-o'-nine-tails.n.01')
Synset('caterpillar.n.02')
Synset('big_cat.n.01')
Synset('computerized_tomography.n.01')
Synset('cat.v.01')
Synset('vomit.v.01')
For this we user nltk's wordnet, from nltk.corpus import wordnet as wn.
We can then obtain the lemmas for each synset. By simply adding these lemma's I inturn add quite a bit of noise, how ever I also add some interesting words.
But what I would like to look at is noise reduction, and would appreciate any suggestions or alternate methods to the above.
One such idea, I am trying is to see if the word 'cats' appears in the synset name or definition, to include or exclude those lemmas.
I'd propose to use semantic similarity here with a variant of kNN: for each candidate word compute pairwise semantic similarity to all gold-standard words, then keep only k (try different k from 5 to 100) most similar gold-standard words, compute average (or sum) of similarities to these k words and then use this value in order to discard noise candidates - by sorting and keeping only n best, or by cut-off by experimentally defined threshold.
Semantic similarity can be computed on the basis of WordNet, see related question, or on the basis of vector models learned by word2vec or similar techniques, see related question again.
Actually, you can try to use this technique with all words as candidates, or all/some words occurring in domain-specific texts - in the last case the task is called automatic term recognition and methods can be used for your problem directly or as a source of candidates; search for them on Google scholar; as an example with short description of existed approaches and links to surveys see this paper:
Fedorenko, D., Astrakhantsev, N., & Turdakov, D. (2013). Automatic
recognition of domain-specific terms: an experimental evaluation. In
SYRCoDIS (pp. 15-23).
Assuming western naming convention of FirstName MiddleName(s) LastName,
What would be the best way to correctly parse out the last name from a full name?
For example:
John Smith --> 'Smith'
John Maxwell Smith --> 'Smith'
John Smith Jr --> 'Smith Jr'
John van Damme --> 'van Damme'
John Smith, IV --> 'Smith, IV'
John Mark Del La Hoya --> 'Del La Hoya'
...and the countless other permutations from this.
Probably the best answer here is not to try. Names are individual and idosyncratic and, even limiting yourself to the Western tradition, you can never be sure that you'll have thought of all the edge cases. A friend of mine legally changed his name to be a single word, and he's had a hell of a time dealing with various institutions whose procedures can't deal with this. You're in a unique position of being the one creating the software that implements a procedure, and so you have an opportunity to design something that isn't going to annoy the crap out of people with unconventional names. Think about why you need to be parsing out the last name to begin with, and see if there's something else you could do.
That being said, as a purely techincal matter the best way would probably be to trim off specifically the strings " Jr", ", Jr", ", Jr.", "III", ", III", etc. from the end of the string containing the name, and then get everything from the last space in the string to the (new, after having removed Jr, etc.) end. This wouldn't get, say, "Del La Hoya" from your example, but you can't even really count on a human to get that - I'm making an educated guess that John Mark Del La Hoya's last name is "Del La Hoya" and not "Mark Del La Hoya" because I"m a native English speaker and I have some intuition about what Spanish last names look like - if the name were, say "Gauthip Yeidze Ka Illunyepsi" I would have absolutely no idea whether to count that Ka as part of the last name or not because I have no idea what language that's from.
Came across a lib called "nameparser" at
https://pypi.python.org/pypi/nameparser
It handles four out of six cases above:
#!/usr/bin/env python
from nameparser import HumanName
def get_lname(somename):
name = HumanName(somename)
return name.last
people_names = [
('John Smith', 'Smith'),
('John Maxwell Smith', 'Smith'),
# ('John Smith Jr', 'Smith Jr'),
('John van Damme', 'van Damme'),
# ('John Smith, IV', 'Smith, IV'),
('John Mark Del La Hoya', 'Del La Hoya')
]
for name, target in people_names:
print('{} --> {} <-- {}'.format(name, get_lname(name), target))
assert get_lname(name) == target
I'm seconding Tnekutippa here, but you should check out named entity recognition. It might help automate some of the process. This is however, as noted, quite difficult. I'm not quite sure if the Stanford NER can extract first and last names out of the box, but a machine learning approach could prove very useful for this task. The Stanford NER could be a nice starting point, or you could try to make your own classifiers and training corpora.
I have a long string (multiple paragraphs) which I need to split into a list of line strings. The determination of what makes a "line" is based on:
The number of characters in the line is less than or equal to X (where X is a fixed number of columns per line_)
OR, there is a newline in the original string (that will force a new "line" to be created.
I know I can do this algorithmically but I was wondering if python has something that can handle this case. It's essentially word-wrapping a string.
And, by the way, the output lines must be broken on word boundaries, not character boundaries.
Here's an example of input and output:
Input:
"Within eight hours of Wilson's outburst, his Democratic opponent, former-Marine Rob Miller, had received nearly 3,000 individual contributions raising approximately $100,000, the Democratic Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a strong national defense and reining in the size of government, won a special election to the House in 2001, succeeding the late Rep. Floyd Spence, R-S.C. Wilson had worked on Spence's staff on Capitol Hill and also had served as an intern for Sen. Strom Thurmond, R-S.C."
Output:
"Within eight hours of Wilson's outburst, his"
"Democratic opponent, former-Marine Rob Miller,"
" had received nearly 3,000 individual "
"contributions raising approximately $100,000,"
" the Democratic Congressional Campaign Committee"
" said."
""
"Wilson, a conservative Republican who promotes a "
"strong national defense and reining in the size "
"of government, won a special election to the House"
" in 2001, succeeding the late Rep. Floyd Spence, "
"R-S.C. Wilson had worked on Spence's staff on "
"Capitol Hill and also had served as an intern"
" for Sen. Strom Thurmond, R-S.C."
EDIT
What you are looking for is textwrap, but that's only part of the solution not the complete one. To take newline into account you need to do this:
from textwrap import wrap
'\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
>>> print '\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
Within eight hours of Wilson's outburst, his
Democratic opponent, former-Marine Rob Miller, had
received nearly 3,000 individual contributions
raising approximately $100,000, the Democratic
Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a
strong national defense and reining in the size of
government, won a special election to the House in
2001, succeeding the late Rep. Floyd Spence,
R-S.C. Wilson had worked on Spence's staff on
Capitol Hill and also had served as an intern for
Sen. Strom Thurmond
You probably want to use the textwrap function in the standard library:
http://docs.python.org/library/textwrap.html