when I import this dataset:
dataset = pd.read_csv('lyrics.csv', delimiter = '\t', quoting = 2)
it prints like so:
lyrics,classification
0 I should have known better with a girl like yo...
1 You can shake an apple off an apple tree\nShak...
2 It's been a hard day's night\nAnd I've been wo...
3 Michelle, ma belle\nThese are words that go to...
4 Can't buy me love, love\nCan't buy me love\nI'...
5 I love you\nCause you tell me things I want to...
6 I dig a Pygmy by Charles Hawtrey and the Deaf ...
7 The song a robin sings,\nThrough years of endl...
8 Love me tender, love me sweet,\nNever let me g...
9 Well, it's one for the money,\nTwo for the sho...
10 All the words that I let her know\nStill could...
and if I print (dataset.columns), I get:
Index([u'lyrics,classification'], dtype='object')
but if I try to prints the lyrics, like so:
for i in range(0, len(dataset)):
lyrics=dataset['lyrics'][i]
print lyrics
I get the following error:
KeyError: 'lyrics'
what am I missing here?
Since you set the delimiter to be a tab (\t), the header isn't be parsed as you think. 'lyrics,classification' is one column name. If you want to keep the delimiter as a tab, then between lyrics and classification there should be a tab rather than a comma.
Related
i need somebody's help, i have a column with words, i want to remove the duplicated words inside each cell
what i want to get is something like this
words
expected
car apple car good
car apple good
good bad well good
good bad well
car apple bus food
car apple bus food
i've tried this but is not working
from collections import OrderedDict
df['expected'] = (df['words'].str.split().apply(lambda x: OrderedDict.fromkeys(x).keys()).str.join(' '))
I'll be very grateful if somebody can help me
If order is important use dict.fromkeys in a list comprehension:
df['expected'] = [' '.join(dict.fromkeys(w.split())) for w in df['words']]
output:
words expected
0 car apple car good car apple good
1 good bad well good good bad well
2 car apple bus food car apple bus food
If you don't need to retain the original order of the words, you can create an intermediate set which will remove duplicates.
df["expected"] = df["words"].str.split().apply(set).str.join(" ")
if words are string "word1 word2":
df['expected'] = [" ".join(set(wrds.strip().split())) for wrds in df.words]
I want to split some strings on Python by separating at \n and use them in that format, but some of those strings have unexpected newlines and I want to ignore them.
TO CLARIFY: Both examples have only one string.
For example this is a regular string with no unexpected newlines:
Step 1
Cut peppers into strips.
Step 2
Heat a non-stick skillet over medium-high heat. Add peppers and cook on stove top for about 5 minutes.
Step 3
Toast the wheat bread and then spread hummus, flax seeds, and spinach on top
Step 4
Lastly add the peppers. Enjoy!
but some of them are like this:
Step 1
Using a fork, mash up the tuna really well until the consistency is even.
Step 2
Mix in the avocado until smooth.
Step 3
Add salt and pepper to taste. Enjoy!
I have to say I am new at regex and if the solution is obvious, please forgive
Edit: Here is my regex
stepOrder = []
# STEPS
txtSteps = re.split("\n",directions.text)
listOfLists = [[] for i in range(len(txtSteps)) if i % 2 == 0]
for i in range(len(listOfLists)):
listOfLists[i] = [txtSteps[i*2],txtSteps[i*2+1]]
recipe["steps"] = listOfLists
print(listOfLists)
directions.text is every one of these examples I gave. I can share what it is too, but I think it's irrelevant.
You can solve this problem by splitting on the following regex:
(?<=\d\n).*
Basically it will get any character in the same line .* which is preceeded by one digit \d and one new line character \n.
Check the regex demo here.
Your whole Python snippet then becomes simplified as follows using the re.findall method:
# STEPS
steps = re.findall("(?<=\d\n).*", directions.text)
out = [[{'order':i+1, 'step': step}] for i, step in enumerate(steps)]
f = open("your_file_name")
content = f.read()
f.close()
for line in content.split("\n"):
if re.match("^&",line):
continue
print(line)
I have a large dataframe to compare with another dataframe and correct the id. I'm gonna illustrate my problem into this simple exp.
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})
df = pd.DataFrame({'id':['nan','nan','nan'],
'description':['JOHN HAS 25 YEAR OLD LIVES IN At/12','STEVE has 50 OLD LIVES IN At.14','ALICIE HAS 10 YEAR OLD LIVES IN AT13']})
print(df)
df1 = pd.DataFrame({'id':[1203,1205,1045],
'description':['JOHN HAS 25year OLD LIVES IN At 2','STEVE has 50year OLD LIVES IN At 14','ALICIE HAS 10year OLD LIVES IN At 13']})
print(df1)
age = ["50year", "25year", "10year"]
for a in age:
ruler.add_patterns([{"label": "age", "pattern": a}])
names = ["JOHN", "STEVE", "ALICIA"]
for n in names:
ruler.add_patterns([{"label": "name", "pattern": n}])
ref = ["AT 2", "At 13", "At 14"]
for r in ref:
ruler.add_patterns([{"label": "ref", "pattern": r}])
#exp to check text difference
doc = nlp("JOHN has 25 YEAR OLD LIVES IN At.12 ")
for ent in doc.ents:
print(ent, ent.label_)
Actually there is a difference in the text of the two dataframe df and df1 which is the reference, as shown in the picture bellow
I dont know how to get similarties 100% in this case.
I tried to use spacy but i dont how to fix difference and correct the id in df.
This is my dataframe1:
id description
0 nan STEVE has 50 OLD LIVES IN At.14
1 nan JOHN HAS 25 YEAR OLD LIVES IN At/12
2 nan ALICIE HAS 10 YEAR OLD LIVES IN AT15
This my reference dataframe:
id description
0 1203 STEVEN HAS 25year OLD lives in At 6
1 1205 JOHN HAS 25year OLD LIVES IN At 2
2 1045 ALICIE HAS 50year OLD LIVES IN At 13
3 3045 STEVE HAS 50year OLD LIVES IN At 14
4 3465 ALICIE HAS 10year OLD LIVES IN At 13
My expected output:
id description
0 3045 STEVE has 50 OLD LIVES IN At.14
1 1205 JOHN HAS 25 YEAR OLD LIVES IN At/12
2 3465 ALICIE HAS 10year OLD LIVES IN AT15
NB:The sentences are not in the same order / The dataframes don't have equal length
If the batch size is very large (and because using fuzzywuzzy is slow), we might be able to construct a KNN index using NMSLIB on some substring ngram embeddings (idea lifted from this article and this follow-up):
import re
import pandas as pd
import nmslib
from sklearn.feature_extraction.text import TfidfVectorizer
def ngrams(description, n=3):
description = description.lower()
description = re.sub(r'[,-./]|\sBD',r'', description)
ngrams = zip(*[description[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]
def build_index(descriptions, vectorizer):
ref_vectors = vectorizer.fit_transform(descriptions)
index = nmslib.init(method='hnsw',
space='cosinesimil_sparse',
data_type=nmslib.DataType.SPARSE_VECTOR)
index.addDataPointBatch(ref_vectors)
index.createIndex()
return index
def search_index(queries, vectorizer, index):
query_vectors = vectorizer.transform(query_df['description'])
results = index.knnQueryBatch(query_vectors, k=1)
return [res[0][0] for res in results]
# ref_df = df1, query_df = df
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
index = build_index(ref_df['description'], vectorizer)
results = search_index(query_df['description'], vectorizer, index)
query_ids = [ref_df['id'].iloc[ref_idx] for ref_idx in results]
query_df['id'] = query_ids
print(query_df)
This gives:
id description
0 3045 STEVE has 50 OLD LIVES IN At.14
1 1205 JOHN HAS 25 YEAR OLD LIVES IN At/12
2 3465 ALICIE HAS 10 YEAR OLD LIVES IN AT13
We can do more pre-processing in ngrams, EG: stop words, handling symbols, etc.
As your strings are "almost" identical, here is a more simple suggestion using the string matching module fuzzywuzzy which, as the name suggests, performs fuzzy string matching.
It offers a number of functions to compute string similarity, you can try out different ones and pick one that seems to work best. Given your example dataframes...
id description
0 nan STEVE has 50 OLD LIVES IN At.14
1 nan JOHN HAS 25 YEAR OLD LIVES IN At/12
2 nan ALICIE HAS 10 YEAR OLD LIVES IN AT15
id description
0 1203 STEVEN HAS 25year OLD lives in At 6
1 1205 JOHN HAS 25year OLD LIVES IN At 2
2 1045 ALICIE HAS 50year OLD LIVES IN At 13
3 3045 STEVE HAS 50year OLD LIVES IN At 14
4 3465 ALICIE HAS 10year OLD LIVES IN At 13
...even the most basic ratio function seems to give us the correct result.
from fuzzywuzzy import fuzz
import numpy as np
import pandas as pd
fuzzy_ratio = np.vectorize(fuzz.ratio)
dist_matrix = fuzzy_ratio(df.description.values[:, None], df1.description.values)
dist_df = pd.DataFrame(dist_matrix, df.index, df1.index)
Result:
0 1 2 3 4
0 52 59 66 82 63
1 49 82 65 66 62
2 39 58 78 65 81
The row-wise maximum values suggest the following mappings:
'STEVE has 50 OLD LIVES IN At.14', 'STEVE HAS 50year OLD LIVES IN At 14'
'JOHN HAS 25 YEAR OLD LIVES IN At/12', 'JOHN HAS 25year OLD LIVES IN At 2'
'ALICIE HAS 10 YEAR OLD LIVES IN AT15', 'ALICIE HAS 10 YEAR OLD LIVES IN AT15'
Note, however, that it's a very close call in the last case, so this is not guaranteed to be always correct. Depending on what your data looks like, you might need more sophisticated heuristics. If all fails, you might even give vector-based similarity metrics like word movers distance a try but it seems overkill if the strings aren't really all that different.
Since you're looking for almost-identical strings, spaCy is not really the right tool for this. Word vectors are about meaning, but you're looking for string similarity.
Maybe this is just possible because of your simplified example, but you can normalize your strings by removing stuff that doesn't make a difference. For example,
text = "Alice lives at..."
text = text.replace(" ", "") # remove spaces
text = text.replace("/", "") # remove slashes
text = text.replace("year", "") # remove "year"
text = text.lower()
It seems like in most (all?) of your examples that would make your strings identical. You can then match strings by using their normalized forms as keys for a dictionary, for example.
This approach has an important advantage over the fuzzy matching described in the prior answer. While once you have two candidates using a string distance measure to see if they're close enough is important, you really don't want to compute string distance for every entry in both tables. If you normalize strings like I've suggested here, you can find matches without comparing each string with every string in the other table.
If the normalization strategy here doesn't work, look at simhash or other locality sensitive hashing techniques. A simplified version would be to use rare words, like the names in your example data, to create "buckets" of entries. Computing the string similarity of everything in a bucket is somewhat slow, but better than using the whole table.
I think using spacy here is not the correct way. What you need to use is (1) regex (2) jaccard match. As it seems most of your tokens are supposed to exactly match, therefore Jaccard match, which calculates how many words are similar between two sentences; will be good. For the regex part, I would follow the following formatting:
import re
def text_clean(text):
#remove everything except alphabets
text = re.sub('[^A-Za-z.]', ' ', text)
text = text.lower()
return text
Now the above function, applied on all the strings, will remove the digits and all the '.', '/' etc characters. After that, if you apply Jaccard similarity, then you should get good matches.
I have suggested removing the digits, as in one of your examples, /12 turned into 2 and you still matched. So that meant that you are mainly concerned with the words and not the digits to be exact.
You may not get 100% accuracy using just the Jaccard match. Important thing is that you will not be able to get 100% Jaccard match in all matches, so you will have to put a cut-off on the value of the Jaccard match, above which you would want to consider a match.
You may want to come up with a more complex approach using both spacy and Jaccard match on the cleaned strings, and then putting custom cut off on both match scores and picking your matches.
Also, I noted that in some cases, you are getting two words occurring together. Is that only occurring with digits such as At13? or is it occurring with two words also? to use the Jaccard match efficiently, you will need to resolve that as well. But that's a whole other process and a bit out of scope for this answer.
Novice programmer here seeking help.
I have a Dataframe that looks like this:
Current
0 "Invest in $APPL, $FB and $AMZN"
1 "Long $AAPL, Short $AMZN"
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I also have a list with all the hashtags: cashtags = ["$AAPL", "$FB", $AMZN"]
Basically, I want to go through all the lines in this column of the DataFrame and keep the rows with a unique cashtag, regardless if it is in caps or not, and delete all others.
Desired Output:
Desired
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I've tried to basically count how many times the word appears in the string and add that value to a new column so that I can delete the rows based on the number.
for i in range(0,len(df)-1):
print(i, end = "\r")
tweet = df["Current"][i]
count = 0
for word in cashtags:
count += str(tweet).count(word)
df["Word_count"][i] = count
However if I do this I will be deleting rows that I don't want to. For example, rows where the unique cashtag is mentioned several times ([3],[5])
How can I achieve my desired output?
Rather than summing the count of each cashtag, you should sum its presence or absence, since you don't care how many times each cashtag occurs, only how many cashtags.
for tag in cashtags:
count += tag in tweet
Or more succinctly: sum(tag in tweet for tag in cashtags)
To make the comparison case insensitive, you can upper case the tweets beforehand. Additionally, it would be more idiomatic to filter on a temporary series and avoid explicitly looping over the dataframe (though you may need to read up more about Pandas to understand how this works):
df[df.Current.apply(lambda tweet: sum(tag in tweet.upper() for tag in cashtags)) == 1]
If you ever want to generalise your question to any tag, then this is a good place for a regular expression.
You want to match against (\$w+)(?!.*/1) see e.g. here for a detailed explanation, but the general structure is:
\$w+: find a dollar sign followed by one or more letters/numbers (or
an _), if you just wanted to count how many tags you had this is all you need
e.g.
df.Current.str.count(r'\$\w+')
will print
0 3
1 2
2 1
3 2
4 1
5 2
but this will remove cases where you have the same element more than once so you need to add a negative lookahead meaning don't match
(?!.*/1): Is a negative lookahead, this means don't match if it is followed by the same match later on. This will mean that only the last tag is counted in the string.
Using this, you can then use pandas DataFrame.str methods, specifically DataFrame.str.count (the re.I does a case insensitive match)
import re
df[df.Current.str.count(r'(\$\w+)(?!.*\1)', re.I) == 1]
which will give you your desired output
Current
2 $AAPL earnings announcement soon
3 $FB is releasing a new product. Will $FB's pro...
4 $Fb doing good today
5 $AMZN high today. Will $amzn continue like this?
There are two sentences in "test_tweet1.txt"
#francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
#mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
In "Personal.txt"
The Game (rapper)
The Notorious B.I.G.
The Undertaker
Thor
Tiƫsto
Timbaland
T.I.
Tom Cruise
Tony Romo
Trajan
Triple H
My codes:
import re
popular_person = open('C:/Users/Personal.txt')
rpopular_person = popular_person.read()
file1 = open("C:/Users/test_tweet1.txt").readlines()
array = []
count1 = 0
for line in file1:
array.append(line)
count1 = count1 + 1
print "\n",count1, line
ltext1 = line.split(" ")
for i,text in enumerate(ltext1):
if text in rpopular_person:
print text
text2 = ' '.join(ltext1)
Results from the codes showed:
1 #francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
Tony
The
man
to
the
the
2 #mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
aga
I tried to match word from "test_tweet1.txt" with "Personal.txt".
Expected result:
Tony
Romo
Any suggestion?
Your problem seems to be that rpopular_person is just a single string. Therefore, when you ask things like 'to' in rpopular_person, you get a value of True, because the characters 't', 'o' appear in sequence. I am assuming that the same goes for 'the' elsewhere in Personal.txt.
What you want to do is split up Personal.txt into individual words, the way you're splitting your tweets. You can also make the resulting list of words into a set, since that'll make your lookup much faster. Something like this:
people = set(popular_person.read().split())
It's also worth noticing that I'm calling split(), with no arguments. This splits on all whitespace--newlines, tabs, and so on. This way you get everything separately like you intend. Or, if you don't actually want this (since this will give you results of "The" all the time based on your edited contents of Personal.txt), make it:
people = set(popular_person.read().split('\n'))
This way you split on newlines, so you only look for full name matches.
You're not getting "Romo" because that's not a word in your tweet. The word in your tweet is "Romo." with a period. This is very likely to remain a problem for you, so what I would do is rearrange your logic (assuming speed isn't an issue). Rather than looking at each word in your tweet, look at each name in your Personal.txt file, and see if it's in your full tweet. This way you don't have to deal with punctuation and so on. Here's how I'd rewrite your functionality:
rpopular_person = set(personal.split())
with open("Personal.txt") as p:
people = p.read().split('\n') # Get full names rather than partial names
with open("test_tweet1.txt") as tweets:
for tweet in tweets:
for person in people:
if person in tweet:
print person
you need to split rpopular_person to get it to match words instead of substrings
rpopular_person = open('C:/Users/Personal.txt').read().split()
this gives:
Tony
The
the reason Romo isn't showing up is that on your line split you have "Romo." Maybe you should look for rpopular_person in the lines, instead of the other way around. Maybe something like this
popular_person = open('C:/Users/Personal.txt').read().split("\n")
file1 = open("C:/Users/test_tweet1.txt")
array = []
for count1, line in enumerate(file1):
print "\n", count1, line
for person in popular_person:
if person in line:
print person