check data based on other data csv using pandas - python

i have two data csv
The first :
word,centroid
she,1
great,0
good,3
mother,2
father,2
After,4
before,4
.....
The second:
sentences,label
good mother,1
great father,1
I want to check each sentence based on the cluster results
so if the sentences is good mother good on the centroid 3 then array will be [0,0,0,1,0] and word mother on the centroid 2 then array will be [0,0,1,1,0]...
I have complicated and wrong code ... can anyone help me
this is my code:
import pandas as pd
import re
array=[]
data = pd.read_csv('data/data_komentar.csv',encoding = "ISO-8859-1")
df = pd.read_csv('data/hasil_cluster.csv',encoding = "ISO-8859-1")
for index,row in data.iterrows():
kalimat=row[0]
words=re.sub(r'([^\s\w]|_)', '', str(kalimat))
words= re.sub(r'[0-9]+', '', words)
for word in words.split():
kata=word.lower()
df = df[df.eq(kata)]
if df.empty:
print("empty")
else:
print(kata)
if df['centroid;'] is 0:
array=array+[1,0,0,0,0]
if df['centroid'] is 1:
array=array+[0,1,0,0,0]
if df['centroid'] is 2:
array=array+[0,0,1,0,0]
if df['centroid;'] is 3:
array=array+[0,0,0,1,0]
if df['centroid;'] is 4:
array=array+[0,0,0,0,1]
print(array)

You can use apply() on the sentences column of the DataFrame:
import numpy as np
MAX_CENTROIDS = 5
def get_centroids(row):
centroids = np.zeros(MAX_CENTROIDS, dtype=int)
for word in row.split(' '):
if word in df1['word'].values:
centroids[df1[df1['word']==word]['centroid'].values]+=1
return centroids
df2['centroid'] = df2['sentences'].apply(get_centroids)
Result df2:
df1 is the DataFrame with your words and centroids, df2 with the sententes. You have to specify the maximal number of centroids in MAX_CENTROIDS (=length of the centroid list).
Edit
To read the datasample you provided:
# Maybe remove encoding on your system
df1 = pd.read_csv('hasil_cluster.csv', sep=',', encoding='iso-8859-1')
# Drop Values without a centroid:
df1.dropna(inplace=True)
# Remove ; from every centroid value and convert the column to integers
df1['centroid'] = df1['centroid;'].apply(lambda x:str(x).replace(';', '')).astype(int)
# Remove unused colum
df1.drop('centroid;', inplace=True, axis=1)

Related

How can I exclude words from frequency word analysis in a list of articles in python?

I have a dataframe df with a column "Content" that contains a list of articles extracted from the internet. I have already the code for constructing a dataframe with the expected output (two columns, one for the word and the other for its frequency). However, I would like to exclude some words (conectors, for instance) in the analysis. Below you will find my code, what should I add to it?
It is possible to use the code get_stop_words('fr') for a more efficiente use? (Since my articles are in French).
Source Code
import csv
from collections import Counter
from collections import defaultdict
import pandas as pd
df = pd.read_excel('C:/.../df_clean.xlsx',
sheet_name='Articles Scraping')
df = df[df['Content'].notnull()]
d1 = dict()
for line in df[df.columns[6]]:
words = line.split()
# print(words)
for word in words:
if word in d1:
d1[word] += 1
else:
d1[word] = 1
sort_words = sorted(d1.items(), key=lambda x: x[1], reverse=True)
There are a few ways you can achieve this. You can either use the isin() method with a list comprehension,
data = {'test': ['x', 'NaN', 'y', 'z', 'gamma',]}
df = pd.DataFrame(data)
words = ['x', 'y', 'NaN']
df = df[~df.test.isin([word for word in words])]
Or you can go with not string contains and a join:
df = df[~df.test.str.contains('|'.join(words))]
If you want to utilize the stop words package for French, you can also do that, but you must preprocess all of your texts before you start doing any frequency analysis.
french_stopwords = set(stopwords.stopwords("fr"))
STOPWORDS = list(french_stopwords)
STOPWORDS.extend(['add', 'new', 'words', 'here'])
I think the extend() will help you tremendously.

Count occurrence of column values in other dataframe column

I have two dataframes and I want to count the occurrence of "classifier" in "fullname". My problem is that my script counts a word like "carrepair" only for one classifier and I would like to have a count for both classifiers. I would also like to add one random coordinate that matches the classifier.
First dataframe:
Second dataframe:
Result so far:
Desired Result:
My script so far:
import pandas as pd
fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()
pat = '({})'.format('|'.join(clas['classifier'].unique()))
fl['fullname'] = fl['fullname'].str.extract(pat, expand = False)
clas['count_of_classifier'] = clas['classifier'].map(fl['fullname'].value_counts())
print(clas)
Thanks!
You could try this:
import pandas as pd
fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()
# Add a new column to 'fl' containing either 'repair' or 'car'
for value in clas["classifier"].values:
fl.loc[fl["fullname"].str.contains(value, case=False), value] = value
# Count values and create a new dataframe
new_clas = pd.DataFrame(
{
"classifier": [col for col in clas["classifier"].values],
"count": [fl[col].count() for col in clas["classifier"].values],
}
)
# Merge 'fl' and 'new_clas'
new_clas = pd.merge(
left=new_clas, right=fl, how="left", left_on="classifier", right_on="fullname"
).reset_index(drop=True)
# Keep only expected columns
new_clas = new_clas.reindex(columns=["classifier", "count", "coordinate"])
print(new_clas)
# Outputs
classifier count coordinate
repair 3 52.520008, 13.404954
car 3 54.520008, 15.404954

How can I convert a separate column text into row using Python/Pandas?

I am teaching myself machine learning and working on some dataset which have the columns
`
there are two sentences columns sent0 and sent1 and sol contains whether sent0 is against commonsense or sent1. If sent0 is against commonsense there is 0 in sol column.
What I want is to combine both sent0 and sent1 in one column as
Now if a sent0 is against commonsense i want to write it as 0 in sol column and 1 if sentence is right. I tried different ways but failed until now
This should do what you want
df = pd.DataFrame() # this should be your original df
import pandas as pd
data = pd.DataFrame(columns=["id", "sent", "sol"])
for i, row in df.iterrows():
id_ = row["id"]
sent0 = row["sent0"]
sent1 = row["sent1"]
sol = row["sol"]
# calculate solutions
if sol == 0: # if sent0 is against common sense
sol0 = 0
sol1 = 1 # is common sense
else:
sol0 = 1 # is common sense
sol1 = 0
columns = ["id", "sent", "sol"]
data = data.append(pd.DataFrame([[id_, sent0, sol0]], columns=columns))
data = data.append(pd.DataFrame([[id_, sent1, sol1]], columns=columns))

pandas str.contains match exact substring not working with regex boudry

I have two dataframes, and trying to find out a way to match the exact substring from one dataframe to another dataframe.
First DataFrame:
import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl', 'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'],
'Site':['DV360', 'Adikteev']}
dataframe = pd.DataFrame(random_data)
print(dataframe)
Second DataFrame
test_data = {'code name': ['PB', 'PB', 'PB'],
'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
'code':['progra', 'emo', 'prog']}
test_dataframe = pd.DataFrame(test_data)
Approach
for k, l, m in zip(test_dataframe.iloc[:, 0], test_dataframe.iloc[:, 1], test_dataframe.iloc[:, 2]):
dataframe['Site'] = np.select([dataframe['Place Name'].str.contains(r'\b{}~{}\b'.format(k, m), regex=False)], [l],
default=dataframe['Site'])
The current output is as below, though I am expecting to match the exact substring, which is not working with the code above.
Current Output:
Place Name Site
TS~HOT_MD~h_PB~progra_VV~gogl programmatic-mechanics
FM~uiosv_PB~emo_SZ~1x1_TG~bhv emoteev
Expected Output:
Place Name Site
TS~HOT_MD~h_PB~progra_VV~gogl programmatic me
FM~uiosv_PB~emo_SZ~1x1_TG~bhv emoteev
Data
import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl',
'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'], 'Site':['DV360', 'Adikteev']}
dataframe = pd.DataFrame(random_data)
test_data = {'code name': ['PB', 'PB', 'PB'], 'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
'code':['progra', 'emo', 'prog']}
test_dataframe = pd.DataFrame(test_data)
Map the test_datframe code and Actual into dictionary as key and value respectively
keys=test_dataframe['code'].values.tolist()
dicto=dict(zip(test_dataframe.code, test_dataframe.Actual))
dicto
Join the keys separated by | to enable search of either phrases
k = '|'.join(r"{}".format(x) for x in dicto.keys())
k
Extract string from datframe meeting any of the phrases in k and map them to to the dictionary
dataframe['Site'] = dataframe['Place Name'].str.extract('('+ k + ')', expand=False).map(dicto)
dataframe
Output
Not the most elegant solution, but this does the trick.
Set up data
import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl',
'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'], 'Site':['DV360', 'Adikteev']}
dataframe = pd.DataFrame(random_data)
test_data = {'code name': ['PB', 'PB', 'PB'], 'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
'code':['progra', 'emo', 'prog']}
test_dataframe = pd.DataFrame(test_data)
Solution
Create a column in test_dataframe with the substring to match:
test_dataframe['match_str'] = test_dataframe['code name'] + '~' + test_dataframe.code
print(test_dataframe)
code name Actual code match_str
0 PB programmatic me progra PB~progra
1 PB emoteev emo PB~emo
2 PB programmatic-mechanics prog PB~prog
Define a function to apply to test_dataframe:
def match_string(row, dataframe):
ind = row.name
try:
if row[-1] in dataframe.loc[ind, 'Place Name']:
return row[1]
else:
return dataframe.loc[ind, 'Site']
except KeyError:
# More rows in test_dataframe than there are in dataframe
pass
# Apply match_string and assign back to dataframe
dataframe['Site'] = test_dataframe.apply(match_string, args=(dataframe,), axis=1)
Output:
Place Name Site
0 TS~HOT_MD~h_PB~progra_VV~gogl programmatic me
1 FM~uiosv_PB~emo_SZ~1x1_TG~bhv emoteev

How to build features using pandas column and a dictionary efficiently?

I have a machine learning problem where I am calculating bigram Jaccard similarity of a pandas dataframe text column with values of a dictionary. Currently I am storing them as a list and then converting them to columns. This is proving to be very slow in production. Is there a more efficient way to do it?
Following are the steps I am currently following:
For each key in dict:
1. Get bigrams for the pandas column and the dict[key]
2. Calculate Jaccard similarity
3. Append to an empty list
4. Store the list in the dataframe
5. Convert the list to columns
from itertools import tee, islice
def count_ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
def n_gram_jaccard_similarity(str1, str2,n):
a = set(count_ngrams(str1.split(),n))
b = set(count_ngrams(str2.split(),n))
intersection = a.intersection(b)
union = a.union(b)
try:
return len(intersection) / float(len(union))
except:
return np.nan
def jc_list(sample_dict,row,n):
sim_list = []
for key in sample_dict:
sim_list.append(n_gram_jaccard_similarity(sample_dict[key],row["text"],n))
return str(sim_list)
Using the above functions to build the bigram Jaccard similarity features as follows:
df["bigram_jaccard_similarity"]=df.apply(lambda row: jc_list(sample_dict,row,2),axis=1)
df["bigram_jaccard_similarity"] = df["bigram_jaccard_similarity"].map(lambda x:[float(i) for i in [a for a in [s.replace(',','').replace(']', '').replace('[','') for s in x.split()] if a!='']])
df[[i for i in sample_dict]] = pd.DataFrame(df["bigram_jaccard_similarity"].values.tolist(), index= df.index)
Sample input:
df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
import collections
sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1"
sample_dict["r2"] = "is sample"
sample_dict["r3"] = "sample text 2"
Expected output:
So, this is more difficult than I though, due to some broadcasting issues of sparse matrices. Additionally, in the short period of time I was not able to fully vectorize it.
I added an additional text row to the frame:
df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
df.loc[1] = ["2","this is a second sample text"]
import collections
sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1"
sample_dict["r2"] = "is sample"
sample_dict["r3"] = "sample text 2"
We will use the following modules/functions/classes:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix
import numpy as np
and define a CountVectorizer to create character based n_grams
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
feel free to choose the n-grams you need. I'd advise to take an existing tokenizer and n-gram creator. You should find plenty of those. Also the CountVectorizer can be tweaked extensively (e.g. convert to lowercase, get rid of whitespace etc.)
We concatenate all the data:
all_data = np.concatenate((df.text.to_numpy(),np.array(list(sample_dict.values()))))
we do this, as our vectorizer needs to have a common indexing scheme for all the tokens appearing.
Now let's fit the Count vectorizer and transform the data accordingly:
ngrammed = ngram_vectorizer.fit_transform(all_data) >0
ngrammed is now a sparse matrix containing the identifiers to the tokens appearing in the respective rows and not the counts anymore as before. you can inspect the ngram_vecotrizer and find a mapping from tokens to column ids.
Next we want to compare every grammes entry from the sample dict against every row of our ngrammed text data. We need some magic here:
texts = ngrammed[:len(df)]
samples = ngrammed[len(df):]
text_rows = len(df)
jaccard_similarities = []
for key, ngram_sample in zip(sample_dict.keys(), samples):
repeated_row_matrix = (csr_matrix(np.ones([text_rows,1])) * ngram_sample).astype(bool)
support = texts.maximum(repeated_row_matrix)
intersection = texts.multiply(repeated_row_matrix).todense()
jaccard_similarities.append(pd.Series((intersection.sum(axis=1)/support.sum(axis=1)).A1, name=key))
support is the boolean array, that measures the union of the n-grams over both comparable. intersection is only True if a token is present in both comparable. Note that .A1 represents a matrix-object as the underlying base array.
Now
pd.concat(jaccard_similarities, axis=1)
gives
r1 r2 r3
0 0.631579 0.444444 0.500000
1 0.480000 0.333333 0.384615
you can concat is as well to df and obtain with
pd.concat([df, pd.concat(jaccard_similarities, axis=1)], axis=1)
id text r1 r2 r3
0 1 this is a sample text 0.631579 0.444444 0.500000
1 2 this is a second sample text 0.480000 0.333333 0.384615

Categories