How do I remove numbers and everything behind the number using pandas? Basically anything with a number as a separate word and remove anything behind the separate word.
For example:
ABC,2 QUEEN = ABC
ABC 3 QUEEN = ABC
ABC PTE LTD YES123 = ABC PTE LTD YES123
ABC PTE LTD YES 123 = ABC PTE LTD
Try this:
df['MyCol'].replace('[\,\s]+\d+.+', '')
I dont think pandas is the best way to accomplish that task, you could use ntlk tokenization to separate your row by each word in it, and then iterate through tokenized words, keep those words in a separate array until a number is encountered, in which case you can use 'break' statement and move to next row
This is quite crude but please try
df['MyCol'].str.split('[ |,][0-9]+')
The drawback is that you will have to extract index 0 of the returned list to overwrite the original column. Alternatively, set the parameter Expand=True and drop all the successive columns that are generated.
df['MyCol'].str.split('[ |,][0-9]+', expand=True)
Output:
0 [ABC, QUEEN]
1 [ABC, QUEEN]
2 [ABC PTE LTD YES123]
3 [ABC PTE LTD YES, ]
Related
Let's say I have a df of python strings:
string
0 this house has 3 beds inside
1 this is a house with 2 beds in it
2 the house has 4 beds
I want to extract how many beds each house has. I felt a good way to do this would be to just find the item before beds.
While attempting to complete this problem, I of course noticed strings are indexed by character. That means I would have to turn the strings into a list with str.split(' ').
Then, I can find the index of 'beds' in each of the strings, and return the previous index. I tried both a list comprehension and df.iterrows() for this and can't seem to figure out the right way to do it. My desired output is:
string beds
0 this house has 3 beds inside 3
1 this is a house with 2 beds in it 2
2 the house has 4 beds 4
look at efficient way to get words before and after substring in text (python)
In your case, you could do
for index, row in df.iterrrows():
row['beds'] = row['string'].partition('bed')[0].strip()[-1]
The partition function splits the string based on a word and returns a tuple
The strip function is just used to remove white spaces.
If everything works, then the number you are looking for will be at the end of the first value of the tuple. Hence the [0]
for index, row in df.iterrrows():
row['beds'] = row['string'].partition('bed')[0].strip()[-1]
If the above code is broken down for better readability:
for index, row in df.iterrrows():
split_str = row['string'].partition('bed')
word_before_bed = split_str[0].strip()
number_of_beds = word_before_bed[-1]
row['beds'] = number_of_beds #append column to existing row
print(df.head())
The output df will have a 3 columns.
Note: this is a quick "hack". Notice there is no error checking in the loop. You should add error checking as you never know if the word "bed" shows up at all in the row.
I have a large dataframe to compare with another dataframe and correct the id. I'm gonna illustrate my problem into this simple exp.
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})
df = pd.DataFrame({'id':['nan','nan','nan'],
'description':['JOHN HAS 25 YEAR OLD LIVES IN At/12','STEVE has 50 OLD LIVES IN At.14','ALICIE HAS 10 YEAR OLD LIVES IN AT13']})
print(df)
df1 = pd.DataFrame({'id':[1203,1205,1045],
'description':['JOHN HAS 25year OLD LIVES IN At 2','STEVE has 50year OLD LIVES IN At 14','ALICIE HAS 10year OLD LIVES IN At 13']})
print(df1)
age = ["50year", "25year", "10year"]
for a in age:
ruler.add_patterns([{"label": "age", "pattern": a}])
names = ["JOHN", "STEVE", "ALICIA"]
for n in names:
ruler.add_patterns([{"label": "name", "pattern": n}])
ref = ["AT 2", "At 13", "At 14"]
for r in ref:
ruler.add_patterns([{"label": "ref", "pattern": r}])
#exp to check text difference
doc = nlp("JOHN has 25 YEAR OLD LIVES IN At.12 ")
for ent in doc.ents:
print(ent, ent.label_)
Actually there is a difference in the text of the two dataframe df and df1 which is the reference, as shown in the picture bellow
I dont know how to get similarties 100% in this case.
I tried to use spacy but i dont how to fix difference and correct the id in df.
This is my dataframe1:
id description
0 nan STEVE has 50 OLD LIVES IN At.14
1 nan JOHN HAS 25 YEAR OLD LIVES IN At/12
2 nan ALICIE HAS 10 YEAR OLD LIVES IN AT15
This my reference dataframe:
id description
0 1203 STEVEN HAS 25year OLD lives in At 6
1 1205 JOHN HAS 25year OLD LIVES IN At 2
2 1045 ALICIE HAS 50year OLD LIVES IN At 13
3 3045 STEVE HAS 50year OLD LIVES IN At 14
4 3465 ALICIE HAS 10year OLD LIVES IN At 13
My expected output:
id description
0 3045 STEVE has 50 OLD LIVES IN At.14
1 1205 JOHN HAS 25 YEAR OLD LIVES IN At/12
2 3465 ALICIE HAS 10year OLD LIVES IN AT15
NB:The sentences are not in the same order / The dataframes don't have equal length
If the batch size is very large (and because using fuzzywuzzy is slow), we might be able to construct a KNN index using NMSLIB on some substring ngram embeddings (idea lifted from this article and this follow-up):
import re
import pandas as pd
import nmslib
from sklearn.feature_extraction.text import TfidfVectorizer
def ngrams(description, n=3):
description = description.lower()
description = re.sub(r'[,-./]|\sBD',r'', description)
ngrams = zip(*[description[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]
def build_index(descriptions, vectorizer):
ref_vectors = vectorizer.fit_transform(descriptions)
index = nmslib.init(method='hnsw',
space='cosinesimil_sparse',
data_type=nmslib.DataType.SPARSE_VECTOR)
index.addDataPointBatch(ref_vectors)
index.createIndex()
return index
def search_index(queries, vectorizer, index):
query_vectors = vectorizer.transform(query_df['description'])
results = index.knnQueryBatch(query_vectors, k=1)
return [res[0][0] for res in results]
# ref_df = df1, query_df = df
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
index = build_index(ref_df['description'], vectorizer)
results = search_index(query_df['description'], vectorizer, index)
query_ids = [ref_df['id'].iloc[ref_idx] for ref_idx in results]
query_df['id'] = query_ids
print(query_df)
This gives:
id description
0 3045 STEVE has 50 OLD LIVES IN At.14
1 1205 JOHN HAS 25 YEAR OLD LIVES IN At/12
2 3465 ALICIE HAS 10 YEAR OLD LIVES IN AT13
We can do more pre-processing in ngrams, EG: stop words, handling symbols, etc.
As your strings are "almost" identical, here is a more simple suggestion using the string matching module fuzzywuzzy which, as the name suggests, performs fuzzy string matching.
It offers a number of functions to compute string similarity, you can try out different ones and pick one that seems to work best. Given your example dataframes...
id description
0 nan STEVE has 50 OLD LIVES IN At.14
1 nan JOHN HAS 25 YEAR OLD LIVES IN At/12
2 nan ALICIE HAS 10 YEAR OLD LIVES IN AT15
id description
0 1203 STEVEN HAS 25year OLD lives in At 6
1 1205 JOHN HAS 25year OLD LIVES IN At 2
2 1045 ALICIE HAS 50year OLD LIVES IN At 13
3 3045 STEVE HAS 50year OLD LIVES IN At 14
4 3465 ALICIE HAS 10year OLD LIVES IN At 13
...even the most basic ratio function seems to give us the correct result.
from fuzzywuzzy import fuzz
import numpy as np
import pandas as pd
fuzzy_ratio = np.vectorize(fuzz.ratio)
dist_matrix = fuzzy_ratio(df.description.values[:, None], df1.description.values)
dist_df = pd.DataFrame(dist_matrix, df.index, df1.index)
Result:
0 1 2 3 4
0 52 59 66 82 63
1 49 82 65 66 62
2 39 58 78 65 81
The row-wise maximum values suggest the following mappings:
'STEVE has 50 OLD LIVES IN At.14', 'STEVE HAS 50year OLD LIVES IN At 14'
'JOHN HAS 25 YEAR OLD LIVES IN At/12', 'JOHN HAS 25year OLD LIVES IN At 2'
'ALICIE HAS 10 YEAR OLD LIVES IN AT15', 'ALICIE HAS 10 YEAR OLD LIVES IN AT15'
Note, however, that it's a very close call in the last case, so this is not guaranteed to be always correct. Depending on what your data looks like, you might need more sophisticated heuristics. If all fails, you might even give vector-based similarity metrics like word movers distance a try but it seems overkill if the strings aren't really all that different.
Since you're looking for almost-identical strings, spaCy is not really the right tool for this. Word vectors are about meaning, but you're looking for string similarity.
Maybe this is just possible because of your simplified example, but you can normalize your strings by removing stuff that doesn't make a difference. For example,
text = "Alice lives at..."
text = text.replace(" ", "") # remove spaces
text = text.replace("/", "") # remove slashes
text = text.replace("year", "") # remove "year"
text = text.lower()
It seems like in most (all?) of your examples that would make your strings identical. You can then match strings by using their normalized forms as keys for a dictionary, for example.
This approach has an important advantage over the fuzzy matching described in the prior answer. While once you have two candidates using a string distance measure to see if they're close enough is important, you really don't want to compute string distance for every entry in both tables. If you normalize strings like I've suggested here, you can find matches without comparing each string with every string in the other table.
If the normalization strategy here doesn't work, look at simhash or other locality sensitive hashing techniques. A simplified version would be to use rare words, like the names in your example data, to create "buckets" of entries. Computing the string similarity of everything in a bucket is somewhat slow, but better than using the whole table.
I think using spacy here is not the correct way. What you need to use is (1) regex (2) jaccard match. As it seems most of your tokens are supposed to exactly match, therefore Jaccard match, which calculates how many words are similar between two sentences; will be good. For the regex part, I would follow the following formatting:
import re
def text_clean(text):
#remove everything except alphabets
text = re.sub('[^A-Za-z.]', ' ', text)
text = text.lower()
return text
Now the above function, applied on all the strings, will remove the digits and all the '.', '/' etc characters. After that, if you apply Jaccard similarity, then you should get good matches.
I have suggested removing the digits, as in one of your examples, /12 turned into 2 and you still matched. So that meant that you are mainly concerned with the words and not the digits to be exact.
You may not get 100% accuracy using just the Jaccard match. Important thing is that you will not be able to get 100% Jaccard match in all matches, so you will have to put a cut-off on the value of the Jaccard match, above which you would want to consider a match.
You may want to come up with a more complex approach using both spacy and Jaccard match on the cleaned strings, and then putting custom cut off on both match scores and picking your matches.
Also, I noted that in some cases, you are getting two words occurring together. Is that only occurring with digits such as At13? or is it occurring with two words also? to use the Jaccard match efficiently, you will need to resolve that as well. But that's a whole other process and a bit out of scope for this answer.
I have a dataframe that contains a news dataset. I want to remove one sentence with two specific initial words, i.e. "baca juga:, .... laga." for example. Have an idea how to do it?
This is additional information if u need it.
You can try df.loc to find it and then change it to be blank:
df.loc[df['news'].astype(str).str.contains(r'(?:baca juga)', regex=True), 'news']
and if that works, you can set it to blank with = ''
Using regex, find the sentence then replace it with a blank space
I don't see baca juga in your example but assuming its in one of the rows
import re
df['news'].map(lambda x: re.sub(r'(baca juga[^.]+.)', '', x))
Explanation
baca juga start with this
[^.] this matches any character that's not a period
+. keep going until a reaching a period and remove that period as well
Example
input_df
news
0 dskfl fsdg wer. baca juga: fgads awr yut. dfaw...
1 rwepu fsan apsj lis. fja jp ios jos lfslt
Output_df
0 dskfl fsdg wer. dfaw top fapw asf
1 rwepu fsan apsj lis. fja jp ios jos lfslt
I am trying to remove special characters like ",",".","-"(except comma) from the "Actors" column of my pandas data-frame. For this I use the apply method on the "Actors" column
df['Actors']= df['Actors'].apply(lambda x : x.lower().replace("[^a-zA-Z,]","",)
df['Actors'].head()
The output of the above snippet is shown below and we can see no special characters have been replaced:
1 tim robbins, morgan freeman, bob gunton, willi...
2 marlon brando, al pacino, james caan, richard ...
3 al pacino, robert duvall, diane keaton, robert...
4 christian bale, heath ledger, aaron eckhart, m...
5 martin balsam, john fiedler, lee j. cobb, e.g....
Name: Actors, dtype: object
But when I try resolving the above issue using the snippet below, the code works:
df['Actors'] = df['Actors'].str.lower().str.replace("[^a-zA-Z,]","")
df['Actors'].head()
1 timrobbins,morganfreeman,bobgunton,williamsadler
2 marlonbrando,alpacino,jamescaan,richardscastel...
3 alpacino,robertduvall,dianekeaton,robertdeniro
4 christianbale,heathledger,aaroneckhart,michael...
5 martinbalsam,johnfiedler,leejcobb,egmarshall
Name: Actors, dtype: object
I want to know what is it with the apply function that it doesn't work properly while replacing characters ?
You call apply on series, so x in the lambda is a single string of each row of the series. So, x.lower().replace is python replace. Python replace doesn't support regex. so it considers "[^a-zA-Z,]" as a whole string and it looks for that substring in each x. It couldn't find it so nothing got replaced.
On the other hand, Pandas str.replace default option is regex=True, so it considers "[^a-zA-Z,]" as a regex pattern and replaces everything properly
It does not work because you do a replace on a string, formally you do str.replace("[^a-zA-Z,]","",). Your sting do not contain those characters [^a-zA-Z,] so nothing is removed. If you prefer, python do interpret those characters as regex, but simply as string elements.
To work you should do it like this, it's just to answer your question because the preferred way to do it is with your second exemple.
remove = re.compile(r"[^a-zA-Z,]")
df['Actors']= df['Actors'].apply(lambda x : re.sub(remove, "", x.lower()))
Herw are some documentation :
python str replace
pandas str replace
when I import this dataset:
dataset = pd.read_csv('lyrics.csv', delimiter = '\t', quoting = 2)
it prints like so:
lyrics,classification
0 I should have known better with a girl like yo...
1 You can shake an apple off an apple tree\nShak...
2 It's been a hard day's night\nAnd I've been wo...
3 Michelle, ma belle\nThese are words that go to...
4 Can't buy me love, love\nCan't buy me love\nI'...
5 I love you\nCause you tell me things I want to...
6 I dig a Pygmy by Charles Hawtrey and the Deaf ...
7 The song a robin sings,\nThrough years of endl...
8 Love me tender, love me sweet,\nNever let me g...
9 Well, it's one for the money,\nTwo for the sho...
10 All the words that I let her know\nStill could...
and if I print (dataset.columns), I get:
Index([u'lyrics,classification'], dtype='object')
but if I try to prints the lyrics, like so:
for i in range(0, len(dataset)):
lyrics=dataset['lyrics'][i]
print lyrics
I get the following error:
KeyError: 'lyrics'
what am I missing here?
Since you set the delimiter to be a tab (\t), the header isn't be parsed as you think. 'lyrics,classification' is one column name. If you want to keep the delimiter as a tab, then between lyrics and classification there should be a tab rather than a comma.