How to remove duplicated words in csv rows in python?

How to remove duplicated words in csv rows in python? - python

I am working with csv file and I have many rows that contain duplicated words and I want to remove any duplicates (I also don't want to lose the order of the sentences).
csv file example (userID and description are the columns name):
userID, description
12, hello world hello world
13, I will keep the 2000 followers same I will keep the 2000 followers same
14, I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car
.
.
I would like to have the output as:
userID, description
12, hello world
13, I will keep the 2000 followers same
14, I paid $2000 to the car
.
.
I already tried the post such as 1 2 3 but none of them fixed my problem and did not change anything. (Order for my output file matters, since I don't want to lose the orders). It would be great if you can provide your help with a code sample that I can run in my side and learn.
Thank you
[I am using python 3.7 version]

To remove duplicates, I'd suggest a solution involving the OrderedDict data structure:
df['Desired'] = (df['Current'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))

The code below works for me:
a = pd.Series(["hello world hello world",
"I will keep the 2000 followers same I will keep the 2000 followers same",
"I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car"])
a.apply(lambda x: " ".join([w for i, w in enumerate(x.split()) if x.split().index(w) == i]))
Basically the idea is to, for each word, only keep it if its position is the first in the list (splitted from string using space). That means, if the word occurred the second (or more) time, the .index() function will return an index smaller than the position of current occurrence, and thus will be eliminated.
This will give you:
0 hello world
1 I will keep the 2000 followers same
2 I paid $2000 to the car
dtype: object

Solution taken from here:
def principal_period(s):
i = (s+s).find(s, 1)
return s[:i]
df['description'].apply(principal_period)
Output:
0 hello world
1 I will keep the 2000 followers the same
2 I paid $2000 to the car
Name: description, dtype: object
Since this uses apply on string, it might be slow.

Answer taken from How can I tell if a string repeats itself in Python?
import pandas as pd
def principal_period(s):
s+=' '
i = (s + s).find(s, 1, -1)
return None if i == -1 else s[:i]
df=pd.read_csv(r'path\to\filename_in.csv')
df['description'].apply(principal_period)
df.to_csv(r'output\path\filename_out.csv')
Explanation:
I have added a space at the end to account for that the repeating strings are delimited by space. Then it looks for second occurring string (minus first and last character to avoid matching first, and last when there are no repeating strings, respectively) when the string is added to itself. This efficiently finds the position of string where the second occuring string starts, or the first shortest repeating string ends. Then this repeating string is returned.

Related

Find similarties using nlp/spacy

I have a large dataframe to compare with another dataframe and correct the id. I'm gonna illustrate my problem into this simple exp.
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})
df = pd.DataFrame({'id':['nan','nan','nan'],
'description':['JOHN HAS 25 YEAR OLD LIVES IN At/12','STEVE has 50 OLD LIVES IN At.14','ALICIE HAS 10 YEAR OLD LIVES IN AT13']})
print(df)
df1 = pd.DataFrame({'id':[1203,1205,1045],
'description':['JOHN HAS 25year OLD LIVES IN At 2','STEVE has 50year OLD LIVES IN At 14','ALICIE HAS 10year OLD LIVES IN At 13']})
print(df1)
age = ["50year", "25year", "10year"]
for a in age:
ruler.add_patterns([{"label": "age", "pattern": a}])
names = ["JOHN", "STEVE", "ALICIA"]
for n in names:
ruler.add_patterns([{"label": "name", "pattern": n}])
ref = ["AT 2", "At 13", "At 14"]
for r in ref:
ruler.add_patterns([{"label": "ref", "pattern": r}])
#exp to check text difference
doc = nlp("JOHN has 25 YEAR OLD LIVES IN At.12 ")
for ent in doc.ents:
print(ent, ent.label_)
Actually there is a difference in the text of the two dataframe df and df1 which is the reference, as shown in the picture bellow
I dont know how to get similarties 100% in this case.
I tried to use spacy but i dont how to fix difference and correct the id in df.
This is my dataframe1:
id description
0 nan STEVE has 50 OLD LIVES IN At.14
1 nan JOHN HAS 25 YEAR OLD LIVES IN At/12
2 nan ALICIE HAS 10 YEAR OLD LIVES IN AT15
This my reference dataframe:
id description
0 1203 STEVEN HAS 25year OLD lives in At 6
1 1205 JOHN HAS 25year OLD LIVES IN At 2
2 1045 ALICIE HAS 50year OLD LIVES IN At 13
3 3045 STEVE HAS 50year OLD LIVES IN At 14
4 3465 ALICIE HAS 10year OLD LIVES IN At 13
My expected output:
id description
0 3045 STEVE has 50 OLD LIVES IN At.14
1 1205 JOHN HAS 25 YEAR OLD LIVES IN At/12
2 3465 ALICIE HAS 10year OLD LIVES IN AT15
NB:The sentences are not in the same order / The dataframes don't have equal length

If the batch size is very large (and because using fuzzywuzzy is slow), we might be able to construct a KNN index using NMSLIB on some substring ngram embeddings (idea lifted from this article and this follow-up):
import re
import pandas as pd
import nmslib
from sklearn.feature_extraction.text import TfidfVectorizer
def ngrams(description, n=3):
description = description.lower()
description = re.sub(r'[,-./]|\sBD',r'', description)
ngrams = zip(*[description[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]
def build_index(descriptions, vectorizer):
ref_vectors = vectorizer.fit_transform(descriptions)
index = nmslib.init(method='hnsw',
space='cosinesimil_sparse',
data_type=nmslib.DataType.SPARSE_VECTOR)
index.addDataPointBatch(ref_vectors)
index.createIndex()
return index
def search_index(queries, vectorizer, index):
query_vectors = vectorizer.transform(query_df['description'])
results = index.knnQueryBatch(query_vectors, k=1)
return [res[0][0] for res in results]
# ref_df = df1, query_df = df
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
index = build_index(ref_df['description'], vectorizer)
results = search_index(query_df['description'], vectorizer, index)
query_ids = [ref_df['id'].iloc[ref_idx] for ref_idx in results]
query_df['id'] = query_ids
print(query_df)
This gives:
id description
0 3045 STEVE has 50 OLD LIVES IN At.14
1 1205 JOHN HAS 25 YEAR OLD LIVES IN At/12
2 3465 ALICIE HAS 10 YEAR OLD LIVES IN AT13
We can do more pre-processing in ngrams, EG: stop words, handling symbols, etc.

As your strings are "almost" identical, here is a more simple suggestion using the string matching module fuzzywuzzy which, as the name suggests, performs fuzzy string matching.
It offers a number of functions to compute string similarity, you can try out different ones and pick one that seems to work best. Given your example dataframes...
id description
0 nan STEVE has 50 OLD LIVES IN At.14
1 nan JOHN HAS 25 YEAR OLD LIVES IN At/12
2 nan ALICIE HAS 10 YEAR OLD LIVES IN AT15
id description
0 1203 STEVEN HAS 25year OLD lives in At 6
1 1205 JOHN HAS 25year OLD LIVES IN At 2
2 1045 ALICIE HAS 50year OLD LIVES IN At 13
3 3045 STEVE HAS 50year OLD LIVES IN At 14
4 3465 ALICIE HAS 10year OLD LIVES IN At 13
...even the most basic ratio function seems to give us the correct result.
from fuzzywuzzy import fuzz
import numpy as np
import pandas as pd
fuzzy_ratio = np.vectorize(fuzz.ratio)
dist_matrix = fuzzy_ratio(df.description.values[:, None], df1.description.values)
dist_df = pd.DataFrame(dist_matrix, df.index, df1.index)
Result:
0 1 2 3 4
0 52 59 66 82 63
1 49 82 65 66 62
2 39 58 78 65 81
The row-wise maximum values suggest the following mappings:
'STEVE has 50 OLD LIVES IN At.14', 'STEVE HAS 50year OLD LIVES IN At 14'
'JOHN HAS 25 YEAR OLD LIVES IN At/12', 'JOHN HAS 25year OLD LIVES IN At 2'
'ALICIE HAS 10 YEAR OLD LIVES IN AT15', 'ALICIE HAS 10 YEAR OLD LIVES IN AT15'
Note, however, that it's a very close call in the last case, so this is not guaranteed to be always correct. Depending on what your data looks like, you might need more sophisticated heuristics. If all fails, you might even give vector-based similarity metrics like word movers distance a try but it seems overkill if the strings aren't really all that different.

Since you're looking for almost-identical strings, spaCy is not really the right tool for this. Word vectors are about meaning, but you're looking for string similarity.
Maybe this is just possible because of your simplified example, but you can normalize your strings by removing stuff that doesn't make a difference. For example,
text = "Alice lives at..."
text = text.replace(" ", "") # remove spaces
text = text.replace("/", "") # remove slashes
text = text.replace("year", "") # remove "year"
text = text.lower()
It seems like in most (all?) of your examples that would make your strings identical. You can then match strings by using their normalized forms as keys for a dictionary, for example.
This approach has an important advantage over the fuzzy matching described in the prior answer. While once you have two candidates using a string distance measure to see if they're close enough is important, you really don't want to compute string distance for every entry in both tables. If you normalize strings like I've suggested here, you can find matches without comparing each string with every string in the other table.
If the normalization strategy here doesn't work, look at simhash or other locality sensitive hashing techniques. A simplified version would be to use rare words, like the names in your example data, to create "buckets" of entries. Computing the string similarity of everything in a bucket is somewhat slow, but better than using the whole table.

I think using spacy here is not the correct way. What you need to use is (1) regex (2) jaccard match. As it seems most of your tokens are supposed to exactly match, therefore Jaccard match, which calculates how many words are similar between two sentences; will be good. For the regex part, I would follow the following formatting:
import re
def text_clean(text):
#remove everything except alphabets
text = re.sub('[^A-Za-z.]', ' ', text)
text = text.lower()
return text
Now the above function, applied on all the strings, will remove the digits and all the '.', '/' etc characters. After that, if you apply Jaccard similarity, then you should get good matches.
I have suggested removing the digits, as in one of your examples, /12 turned into 2 and you still matched. So that meant that you are mainly concerned with the words and not the digits to be exact.
You may not get 100% accuracy using just the Jaccard match. Important thing is that you will not be able to get 100% Jaccard match in all matches, so you will have to put a cut-off on the value of the Jaccard match, above which you would want to consider a match.
You may want to come up with a more complex approach using both spacy and Jaccard match on the cleaned strings, and then putting custom cut off on both match scores and picking your matches.
Also, I noted that in some cases, you are getting two words occurring together. Is that only occurring with digits such as At13? or is it occurring with two words also? to use the Jaccard match efficiently, you will need to resolve that as well. But that's a whole other process and a bit out of scope for this answer.

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000

You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']

Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']

I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

Removing rows from a DataFrame based on words in a string

Novice programmer here seeking help.
I have a Dataframe that looks like this:
Current
0 "Invest in $APPL, $FB and $AMZN"
1 "Long $AAPL, Short $AMZN"
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I also have a list with all the hashtags: cashtags = ["$AAPL", "$FB", $AMZN"]
Basically, I want to go through all the lines in this column of the DataFrame and keep the rows with a unique cashtag, regardless if it is in caps or not, and delete all others.
Desired Output:
Desired
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I've tried to basically count how many times the word appears in the string and add that value to a new column so that I can delete the rows based on the number.
for i in range(0,len(df)-1):
print(i, end = "\r")
tweet = df["Current"][i]
count = 0
for word in cashtags:
count += str(tweet).count(word)
df["Word_count"][i] = count
However if I do this I will be deleting rows that I don't want to. For example, rows where the unique cashtag is mentioned several times ([3],[5])
How can I achieve my desired output?

Rather than summing the count of each cashtag, you should sum its presence or absence, since you don't care how many times each cashtag occurs, only how many cashtags.
for tag in cashtags:
count += tag in tweet
Or more succinctly: sum(tag in tweet for tag in cashtags)
To make the comparison case insensitive, you can upper case the tweets beforehand. Additionally, it would be more idiomatic to filter on a temporary series and avoid explicitly looping over the dataframe (though you may need to read up more about Pandas to understand how this works):
df[df.Current.apply(lambda tweet: sum(tag in tweet.upper() for tag in cashtags)) == 1]

If you ever want to generalise your question to any tag, then this is a good place for a regular expression.
You want to match against (\$w+)(?!.*/1) see e.g. here for a detailed explanation, but the general structure is:
\$w+: find a dollar sign followed by one or more letters/numbers (or
an _), if you just wanted to count how many tags you had this is all you need
e.g.
df.Current.str.count(r'\$\w+')
will print
0 3
1 2
2 1
3 2
4 1
5 2
but this will remove cases where you have the same element more than once so you need to add a negative lookahead meaning don't match
(?!.*/1): Is a negative lookahead, this means don't match if it is followed by the same match later on. This will mean that only the last tag is counted in the string.
Using this, you can then use pandas DataFrame.str methods, specifically DataFrame.str.count (the re.I does a case insensitive match)
import re
df[df.Current.str.count(r'(\$\w+)(?!.*\1)', re.I) == 1]
which will give you your desired output
Current
2 $AAPL earnings announcement soon
3 $FB is releasing a new product. Will $FB's pro...
4 $Fb doing good today
5 $AMZN high today. Will $amzn continue like this?

How to extract specific information from multi-line string

I have extracted some invoice related information from email body to Python strings, my next task is to extract the Invoice numbers from the string.
The format of emails could vary, hence it is getting difficult to find invoice number from the text. I also tried "Named Entity Recognition" from SpaCy but since in most of the cases the Invoice number is coming in next line from the heading 'Invoice' or 'Invoice#',the NER doesn't understand the relation and returns incorrect details.
Below are 2 examples of the text extracted from mail body:
Example - 1.
Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.
Example - 2.
Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19
My problem is that if I convert this entire text to a single string then this becomes something like this:
Invoice Date Purchase Order Due Date Balance 8754321 8/17/17
7200016508 9/16/18 140.72
As it is visible that the Invoice number (8754321 in this case) changed its position and doesn't follow the keyword "Invoice" anymore, which is more difficult to find.
My desired output is something like this:
Output Example - 1 -
8754321
5245344
Output Example - 2 -
7651234
9872341
I don't know how can I retrieve text just under keyword "Invoice" or "Invoice#" which is the invoice number.
Please let me know if further information is required. Thanks!!
Edit: The invoice number doesn't have any pre-defined length, it can be 7 digit or can be more than that.

Code per my comments.
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
index = line.find('Invoice')
Uses heuristic that the column header row is always camel case or capitals (ID). This would fail if say a heading was exactly 'Account no.' rather than 'Account No.'
# get all number at a certain index
for line in email.split('\n'):
words = line[index:].split()
if words == []: continue
word = words[0]
try:
print(int(word))
except:
continue
Reliability here depends on data. So in my code Invoice column must be first of table header. i.e. you can't have 'Invoice Date' before 'Invoice'. Obviously this would need fixing.

Going off what Andrew Allen was saying, as long as these 2 assumptions are true:
Invoice numbers are always exactly 7 numerical digits
Invoice numbers are always following a whitespace and followed by a whitespace
Using regex should work. Something along the lines of;
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)
invoice in this case has a list of 2 strings, ['8754321', '5245344']

Using Regex. re.findall
Ex:
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
email2 = """Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19 """
for eml in [email, email2]:
print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))
Output:
['8754321', '5245344']
['7651234', '9872341']
\b - regex boundaries
\d{7} - get 7 digit number

Error: match word in file

There are two sentences in "test_tweet1.txt"
#francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
#mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
In "Personal.txt"
The Game (rapper)
The Notorious B.I.G.
The Undertaker
Thor
Tiësto
Timbaland
T.I.
Tom Cruise
Tony Romo
Trajan
Triple H
My codes:
import re
popular_person = open('C:/Users/Personal.txt')
rpopular_person = popular_person.read()
file1 = open("C:/Users/test_tweet1.txt").readlines()
array = []
count1 = 0
for line in file1:
array.append(line)
count1 = count1 + 1
print "\n",count1, line
ltext1 = line.split(" ")
for i,text in enumerate(ltext1):
if text in rpopular_person:
print text
text2 = ' '.join(ltext1)
Results from the codes showed:
1 #francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
Tony
The
man
to
the
the
2 #mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
aga
I tried to match word from "test_tweet1.txt" with "Personal.txt".
Expected result:
Tony
Romo
Any suggestion?

Your problem seems to be that rpopular_person is just a single string. Therefore, when you ask things like 'to' in rpopular_person, you get a value of True, because the characters 't', 'o' appear in sequence. I am assuming that the same goes for 'the' elsewhere in Personal.txt.
What you want to do is split up Personal.txt into individual words, the way you're splitting your tweets. You can also make the resulting list of words into a set, since that'll make your lookup much faster. Something like this:
people = set(popular_person.read().split())
It's also worth noticing that I'm calling split(), with no arguments. This splits on all whitespace--newlines, tabs, and so on. This way you get everything separately like you intend. Or, if you don't actually want this (since this will give you results of "The" all the time based on your edited contents of Personal.txt), make it:
people = set(popular_person.read().split('\n'))
This way you split on newlines, so you only look for full name matches.
You're not getting "Romo" because that's not a word in your tweet. The word in your tweet is "Romo." with a period. This is very likely to remain a problem for you, so what I would do is rearrange your logic (assuming speed isn't an issue). Rather than looking at each word in your tweet, look at each name in your Personal.txt file, and see if it's in your full tweet. This way you don't have to deal with punctuation and so on. Here's how I'd rewrite your functionality:
rpopular_person = set(personal.split())
with open("Personal.txt") as p:
people = p.read().split('\n') # Get full names rather than partial names
with open("test_tweet1.txt") as tweets:
for tweet in tweets:
for person in people:
if person in tweet:
print person

you need to split rpopular_person to get it to match words instead of substrings
rpopular_person = open('C:/Users/Personal.txt').read().split()
this gives:
Tony
The
the reason Romo isn't showing up is that on your line split you have "Romo." Maybe you should look for rpopular_person in the lines, instead of the other way around. Maybe something like this
popular_person = open('C:/Users/Personal.txt').read().split("\n")
file1 = open("C:/Users/test_tweet1.txt")
array = []
for count1, line in enumerate(file1):
print "\n", count1, line
for person in popular_person:
if person in line:
print person

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.