Deduplicating content by removing similar rows of text in Python - python

I'm fairly new to Python. While I know it's possible to deduplicate rows in Pandas with drop_duplicates for identical text results, is there a way to drop similar rows of text?
E.g. for this fictional collection of online article headlines, populated in chronological order
1 "The dog ate my homework" says confused child in Banbury
2 Confused Banbury child says dog ate homework
3 Why are dogs so cute
4 Teacher in disbelief as child says dog ate homework - Banbury Times
5 Dogs don't like eggs, here's why
6 The moment a senior stray is adopted - try not to cry
7 Dog smugglers in Banbury arrested in police sting operation
My ideal outcome would be that only rows 1, 3, 5, 6 and 7 remain, with rows 1, 2 and 4 having been grouped for similarity and then only 1, the oldest/ 'first' entry, kept.
(How) could I get there? Even advice purely about the grouping approach would be very helpful. I would want to be able to run this on hundreds of rows of text, without having a specific, manually pre-determined article or headline to measure similarity against, just group similar rows.
Thank you so much for your thoughts and time!

You cam try to obtain your data with doc2vec (example of usage), then cluster your text with cosine distance with kmedoids of hierarchical algorithms.

Related

How to solve Python Pandas assign error when creating new column

I have a dataframe containing home descriptions:
description
0 Beautiful, spacious skylit studio in the heart...
1 Enjoy 500 s.f. top floor in 1899 brownstone, w...
2 The spaceHELLO EVERYONE AND THANKS FOR VISITIN...
3 We welcome you to stay in our lovely 2 br dupl...
4 Please don’t expect the luxury here just a bas...
5 Our best guests are seeking a safe, clean, spa...
6 Beautiful house, gorgeous garden, patio, cozy ...
7 Comfortable studio apartment with super comfor...
8 A charming month-to-month home away from home ...
9 Beautiful peaceful healthy homeThe spaceHome i...
I'm trying to count the number of sentences on each row (using sent_tokenize from nltk.tokenize) and append those values as a new column, sentence_count, to the df. Since this is part of a larger data pipeline, I'm using pandas assign so that I could chain operations.
I can't seem to get it to work, though. I've tried:
df.assign(sentence_count=lambda x: len(sent_tokenize(x['description'])))
and
df.assign(sentence_count=len(sent_tokenize(df['description'])))
but both raise the following errro:
TypeError: expected string or bytes-like object
I've confirmed that each row has a dtype of str. Perhaps it's because description has dtype('O')?
What am I doing wrong here? Using a pipe with a custom function works fine here, but I prefer using assign.
x['description'] when you pass it to sent_tokenize in the first example is a pandas.Series. It's not a string. It's a Series (similar to a list) of strings.
So instead you should do this:
df.assign(sentence_count=df['description'].apply(sent_tokenize))
Or, if you need to pass extra parameters to sent_tokenize:
df.assign(sentence_count=df['description'].apply(lambda x: sent_tokenize(x)))

Speed up a loop filtering a string [duplicate]

This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 3 years ago.
I want to filter a column containing tweets (3+million rows) in a pandas dataframe by dropping those tweets that do not contain a keyword/s. To do this, I'm running the following loop (sorry, I'm new to python):
filter_word_indicators = []
for i in range(1, len(df)):
if 'filter_word' in str(df.tweets[0:i]):
indicator = 1
else:
indicator = 0
filter_word_indicators.append(indicator)
The idea is to then drop tweets if the indicator equals 0. The problem is that this loop is taking forever to run. I'm sure there is a better way to drop tweets that do not contain my 'filer_word', but I don't know how to code it up. Any help would be great.
Check out pandas.Series.str.contains, which you can use as follows.
df[~df.tweets.str.contains('filter_word')]
MWE
In [0]: df = pd.DataFrame(
[[1, "abc"],
[2, "bce"]],
columns=["number", "string"]
)
In [1]: df
Out[1]:
number string
0 1 abc
1 2 bce
In [2]: df[~df.string.str.contains("ab")]
Out[2]:
number string
1 2 bce
Timing
Ran a small timing test on the following synthetic DataFrame with three million random strings the size of a tweet
df = pd.DataFrame(
[
"".join(random.choices(string.ascii_lowercase, k=280))
for _ in range(3000000)
],
columns=["strings"],
)
and the keyword abc, comparing the original solution, map + regex and this proposed solution (str.contains). The results are as follows.
original 99s
map + regex 21s
str.contains 2.8s
I create the following example:
df = pd.DataFrame("""Suggested order for Amazon Prime Doctor Who series
Why did pressing the joystick button spit out keypresses?
Why tighten down in a criss-cross pattern?
What exactly is the 'online' in OLAP and OLTP?
How is hair tissue mineral analysis performed?
Understanding the reasoning of the woman who agreed with King Solomon to "cut the baby in half"
Can Ogre clerics use Purify Food and Drink on humanoid characters?
Heavily limited premature compiler translates text into excecutable python code
How many children?
Why are < or > required to use /dev/tcp
Hot coffee brewing solutions for deep woods camping
Minor traveling without parents from USA to Sweden
Non-flat partitions of a set
Are springs compressed by energy, or by momentum?
What is "industrial ethernet"?
What does the hyphen "-" mean in "tar xzf -"?
How long would it take to cross the Channel in 1890's?
Why do all the teams that I have worked with always finish a sprint without completion of all the stories?
Is it illegal to withhold someone's passport and green card in California?
When to remove insignificant variables?
Why does Linux list NVMe drives as /dev/nvme0 instead of /dev/sda?
Cut the gold chain
Why do some professors with PhDs leave their professorships to teach high school?
"How can you guarantee that you won't change/quit job after just couple of months?" How to respond?""".split('\n'), columns = ['Sentence'])
You can juste create a simple function with regular expression (more flexible in case of capital characters):
def tweetsFilter(s, keyword):
return bool(re.match('(?i).*(' + keyword + ').*', s))
This function can be called to obtain the boolean series of strings which contains the specific keywords. The mapcan speed up your script (you need to test!!!):
keyword = 'Why'
sel = df.Sentence.map(lambda x: tweetsFilter(x, keyword))
df[sel]
And we obtained:
Sentence
1 Why did pressing the joystick button spit out ...
2 Why tighten down in a criss-cross pattern?
9 Why are < or > required to use /dev/tcp
17 Why do all the teams that I have worked with a...
20 Why does Linux list NVMe drives as /dev/nvme0 ...
22 Why do some professors with PhDs leave their p...

To many lists of Unique Words

This is a homework project from last week. I had problems so did not turn it it. But I like to go back and see if I can make them work. Now that I have it printing the right words in alphabetical order. I have the problem that it is printing 3 separate lists of unique words all with different number of words in the lists. How can I fix this?
import string
def process_line(line_str,word_set):
line_str=line_str.strip()
list_of_words=line_str.split()
for word in list_of_words:
if word!="--":
word=word.strip()
word=word.strip(string.punctuation)
word=word.lower()
word_set.add(word)
def pretty_print(word_set):
list_of_words=[]
for w in word_set:
list_of_words.append(w)
list_of_words.sort()
for w in list_of_words:
print(w,end=" ")
word_set=set([])
fObject=open("gettysburg.txt")
for line_str in fObject:
process_line(line_str,word_set)
print("\nlength of the word set: ",len(word_set))
print("\nUnique words in set: ")
pretty_print(word_set)
Below is the output I get, I only want it to give me the last one with the 138 words. Appreciate any help.
length of the word set: 29
Unique words in set:
a ago all and are brought conceived continent created dedicated equal fathers forth four in liberty men nation new on our proposition score seven that the this to years
length of the word set: 71
Unique words in set:
a ago all altogether and any are as battlefield brought can civil come conceived continent created dedicate dedicated do endure engaged equal fathers field final fitting for forth four gave great have here in is it liberty live lives long men met might nation new now of on or our place portion proper proposition resting score seven should so testing that the their this those to war we whether who years
length of the word set: 138
Unique words in set:
a above add advanced ago all altogether and any are as battlefield be before birth brave brought but by can cause civil come conceived consecrate consecrated continent created dead dedicate dedicated detract devotion did died do earth endure engaged equal far fathers field final fitting for forget forth fought four freedom from full gave god government great ground hallow have here highly honored in increased is it larger last liberty little live lives living long measure men met might nation never new nobly nor not note now of on or our people perish place poor portion power proper proposition rather remaining remember resolve resting say score sense seven shall should so struggled take task testing that the their these they this those thus to under unfinished us vain war we what whether which who will work world years
Take last 3 lines out of for:
....
for line_str in fObject:
process_line(line_str,word_set)
print("\nlength of the word set: ",len(word_set))
print("\nUnique words in set: ")
pretty_print(word_set)

How do I create a list index that loops through integers in another list

I have this list:
sentences = ['The number of adults who read at least one novel within the past 12 months fell to 47%.',
'Fiction reading rose from 2002 to 2008.', 'The decline in fiction reading last year occurred mostly among men.',
'Women read more fiction.', '50% more.', 'Though it decreased 10% over the last decade.', 'Men are more likely to read nonfiction.',
'Young adults are more likely to read fiction.', 'Just 54% of Americans cracked open a book of any kind last year.',
'But novels have suffered more than nonfiction.']
And I have another list containing the indexes of all sequences of sentences in the above list that contain a number.
index_groupings = [[0, 1], [4, 5], [8]]
I want to extract specified sentence sequences in the variable "sentences" by using the indexes in the variable "index_groupings" so that I get the following output:
The number of adults who read at least one novel within the past 12 months fell to 47%.Fiction reading rose from 2002 to 2008.
50% more.Though it decreased 10% over the last decade.
Just 54% of Americans cracked open a book of any kind last year.
So I do the following:
for grouping in index_groupings:
if len(grouping) > 1:
print sentences[grouping[:]]
else:
print sentences[grouping[0]]
When I run that, I get an error message that says
TypeError: list indices must be integers, not list
The line print sentences[grouping[:]] trips it up. Is there a way to loop through those index sequences in the list index_groupings so that it returns the correct output?
print ["".join(map(lambda x:sentences[x],i)) for i in index_groupings]
You can use join and list comprehension. here.
Output:['The number of adults who read at least one novel within the past 12 months fell to 47%. Fiction reading rose from 2002 to 2008.', '50% more. Though it decreased 10% over the last decade.', 'Just 54% of Americans cracked open a book of any kind last year.']
Python's got some batteries for this:
for grouping in index_groupings:
if len(grouping) > 1:
print sentences[slice(*grouping)]
else:
print sentences[grouping[0]]
The slice constructor is used under the hood when you do a[x:y:z] (it gets converted to a[slice(x, y, z)] except specialized byte codes are used to avoid some overhead), this is just creating one explicitly since we have the sequence of values needed to call the constructor anyway, and it's more convenient than indexing or unpacking and then using slicing syntax.

Comparing similarity between multiple strings with a random starting point

I have a bunch of people names that are tied to their respective Identifying Numbers (e.g. Social Security Number/National ID/Passport Number). Due to duplication though, one Identity Number can have upto 100 names which could be similar or totally different. E.g. ID 221 could have the names Richard Parker, Mary Parker, Aunt May, Parker Richard, M#rrrrryy Richard etc etc. Some typos but some totally different names.
Initially, I want to display only 3 (or a similar small number) of the names that are as different as possible from the rest so as to alert that viewer that the multiple names could not be typos but could be even a case of identity theft or negligent data capture or anything else!
I've read up on an algorithm to detect similarity and am currently looking at this one which would allow you to compute a score and a score of 1 means the two strings are the same while a lower score means they are dissimilar. In my use case, how can I go through say the 100 names and display the 3 that are most dissimilar? The algorithm for that just escapes my mind as I feel like I need a starting point and then look and compare among all others and loop again etc etc
Take the function from https://stackoverflow.com/a/14631287/1082673 as you mentioned and iterate over all combinations in your list. This will work if you have not that many entries, otherwise the computation time can increase pretty fast…
Here is how to generate the pairs for a given list:
import itertools
persons = ['person1', 'person2', 'person3']
for p1, p2 in itertools.combinations(persons, 2):
print "Compare", p1, "and", p2

Categories