Convert raw data to pandas dataframe? - python

I have raw data in a string, which are basically multiple keywords in the form-
Law, of, three, stages
Alienation
Social, Facts
Theory, of, Social, System
How do I import it into a dataframe such that it counts repetition and returns me a count of each word?
Edit: I've converted it into the following format
Law,of,three,stages,Alienation,Social,Facts,Theory,of,Social,System
I want to convert it into a dataframe because i want to eventually predict which word has the highest probability of reocurring.

import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': [ 'Law','of','three','stages','Alienation','Social','Facts','Theory','of','Social','System']
})
df['name'] = df.name.str.split('[ ,]', expand=True)
print(df)
word_freq = pd.Series(np.concatenate([x.split() for x in df.name])).value_counts()
print(word_freq)

Use a dictionary
word_count_dict = {}
with open("Yourfile.txt") as file_stream:
lines = file_stream.readlines()
for line in lines:
if "," in line:
line = line.split(",")
else:
line = [line]
for item in line:
if item in word_count_dict.keys():
word_count_dict[item] += 1
else:
word_count_dict[item] = 1
Since Now you will be having all the count list of words if you want the probability-based order. Its recommended dividing each value by total count of occurrences
total = sum(word_count_dict.itervalues(), 0.0)
probability_words = {k: v / total for k, v in word_count_dict.iteritems()}
Now the probability words have all the chance of occurrence of that specific word.
Reverse Ordering based on Probabilities
sorted_probability_words = sorted(probability_words, key = lambda x : x[1], reverse = True)
Getting the first Element with highest chance
print(sorted_probability_words[0]) # to access the word Key value
print(sorted_probability_words[0][0]) # to get the first word
print(sorted_probability_words[0][1]) # to get the first word probability

Related

Going from code that compares one value to all other values to all values of all other values

I've written the code that will find take a number of n grams, a specific index, and a threshold, and return the values that fall within that threshold. However, currently, it only compares a set of tokens (a given index) to each set of tokens at all other indices. I want to compare each set tokens at all indices to every other set of tokens at all indices. I don't think this is a difficult question, but python is my main language and I struggle with for loops a bit.
So essentially, the variable token in the function should actually iterate over each string in the column, and be compared with comp_token and the index call would be removed, since it would be iterating over all indices.
Let me know if that isn't clear enough and I will think more about how to say this: it is just difficult because the thing I am asking is the thing I am struggling with.
data = ['Time', "NY Times", 'Atlantic']
ph = pd.DataFrame(data, columns=['companies'])
ph.reset_index(inplace=True)
import py_stringmatching as sm
import pandas as pd
import numpy as np
jac = sm.Jaccard()
def predict_label(num, index, thresh):
qg_num_tok = sm.QgramTokenizer(qval = num)
companies = ph.companies.to_list()
ids = ph['index']
companies_qg_num_token = {}
companies_id2index = {}
for i in range(len(companies)):
companies_id2index[i] = companies[i]
companies_qg_num_token[i] = qg_num_tok.tokenize(companies[i])
predicted_label = [1]
token = companies_qg_num_token[index] #index you want: to get all the tokens
for comp_name in ids[1:]:
comp_token = companies_qg_num_token[comp_name]
sim = jac.get_sim_score(token, comp_token)
if sim > thresh:
predicted_label.append(1)
else:
predicted_label.append(0)
#companies_id2index must be equal to token numbner
ph.loc[ph['index'] != companies_id2index[index], 'label'] = 0 #if not equal to index
ph['prediction'] = predicted_label
print(token)
print(companies_id2index[index])
return ph.query('prediction==1')
predict_label(ph, 1, .5)

pandas: calculate overlapping words between rows only if values in another column match (issue with multiple instances)

I have a dataframe that looks like the following, but with many rows:
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','order call','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['order','call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
I have calculated the jaccard similarity using the code below (not my solution):
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
and modified the code given by #Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
#gold_cy 's answer has helped me and i made some changes to it to get the output i like:
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x[1], y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")
the issue is that when there are more instances of the same intent, i run into the error:
ValueError: too many values to unpack (expected 2)
and I do not know how to handle that for many more examples that i have in my dataset
Do you want this?
from itertools import combinations
from operator import itemgetter
items_to_consider = []
for item in list(combinations(zip(df.Sent.values, map(set,df.key_words.values)),2)):
keywords = (list(map(itemgetter(1),item)))
intersect = keywords[0].intersection(keywords[1])
if len(intersect) > 0:
str_list = list(map(itemgetter(0),item))
str_list.append(intersect)
items_to_consider.append(str_list)
for i in items_to_consider:
for item in i[2]:
if item in i[0] and item in i[1]:
print(f"Overlap of intent (order_food) for ({i[0]}) and ({i[1]}) is {item}")

Counting term frequency in list of strings in pd dataframe

I have a dataframe and one column contains the lemmatized words of a paragraph. I wish to count the frequency of each word within the whole dataframe, not just within the record. There are over 40,000 records so the computation has to be quick and not reach the limit of my RAM.
For example, this basic input:
ID lemm
1 ['test','health']
2 ['complete','health','science']
would have this desired output:
'complete':1
'health':2
'science':1
'test':1
This is my current code:
from collections import Counter
cnt = Counter()
for entry in df.lemm:
for word in entry:
cnt[word]+=1
cnt
Which works when I manually enter a list of a list of strings (ex/[['completing', 'dog', 'cat'], ['completing','degree','health','health']]), but not when it iterates through the df.
I have also tried this:
top_N=20
word_dist = nltk.FreqDist(df_main.stem)
print('All frequences')
print('='*60)
rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency'])
print(rslt)
to return the top 20 terms, but the output lists the frequencies of terms within the entry, not the entire dataframe.
Any help would be appreciated!
You can try explode if you have Pandas 0.25+:
df.Text.explode().value_counts()
maybe you can the counter line to:
cnt = Counter(word for entry in df.lemm for word in entry)
Refer to: How to find the lemmas and frequency count of each word in list of sentences in a list?
Assuming your column names and input data:
data = {
"ID": [1, 2],
"lemm": [['test', 'health'], ['complete', 'health', 'science']]
}
df = pd.DataFrame(data)
freq = df.explode("lemm").groupby(["lemm"]).count().rename(columns={"ID" : "Frecuency"})
Output:
from collections import Counter
cnt = df.apply(lambda x:Counter(x['lemm']),axis=1).sum()
Will do it for you. That will make cnt a Counter object so you can do most common on it or anything else counter offers.

index 0 is out of bounds for axis 0 with size 0 Python

PLEASE READ:
I have looked at all the other answers related to this question and none of them solve my specific problem so please carry on reading below.
I have the below code. what the code basically does is keeps the Title column and then concatenated the rest of the columns into one in order to be able to create a cosine matrix.
the main point is the recommendations function that is suppose to take in a Title for imput and return the top 10 matches based on that title but what i get at the end is the index 0 is out of bounds for axis 0 with size 0 error and i have no idea why.
import pandas as pd
from rake_nltk import Rake
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
df =
pd.read_csv('https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7')
df = df[['Title','Genre','Director','Actors','Plot']]
df.head()
df['Key_words'] = ""
for index, row in df.iterrows():
plot = row['Plot']
# instantiating Rake, by default it uses english stopwords from NLTK
# and discards all puntuation characters as well
r = Rake()
# extracting the words by passing the text
r.extract_keywords_from_text(plot)
# getting the dictionary whith key words as keys and their scores as values
key_words_dict_scores = r.get_word_degrees()
# assigning the key words to the new column for the corresponding movie
row['Key_words'] = list(key_words_dict_scores.keys())
# dropping the Plot column
df.drop(columns = ['Plot'], inplace = True)
# instantiating and generating the count matrix
df['bag_of_words'] = df[df.columns[1:]].apply(lambda x: '
'.join(x.astype(str)),axis=1)
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim
indices = pd.Series(df.index)
# defining the function that takes in movie title
# as input and returns the top 10 recommended movies
def recommendations(title, cosine_sim = cosine_sim):
#print(title)
# initializing the empty list of recommended movies
recommended_movies = []
# gettin the index of the movie that matches the title
idx = indices[indices == title].index[0]
print('idx is '+ idx)
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most similar movies
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the titles of the best 10 matching movies
for i in top_10_indexes:
recommended_movies.append(list(df.index)[i])
return recommended_movies
This line:
idx = indices[indices == title].index[0]
will fail if you do not return a match:
df.loc[df['Title']=='This is not a valid title'].index[0]
returns:
IndexError: index 0 is out of bounds for axis 0 with size 0
You need to confirm that the title you are passing in is actually in DF before trying to access any data associated with it:
def recommendations(title, cosine_sim = cosine_sim):
#print(title)
# initializing the empty list of recommended movies
recommended_movies = []
if title not in indices:
raise KeyError("title is not in indices")
# gettin the index of the movie that matches the title
idx = indices[indices == title].index[0]
print('idx is '+ idx)
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most similar movies
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the titles of the best 10 matching movies
for i in top_10_indexes:
recommended_movies.append(list(df.index)[i])
return recommended_movies
This expression also seems to be doing nothing:
for index, row in df.iterrows():
plot = row['Plot']
If you just want a single plot record with which to do some development try:
plot = df['Plot'].sample(n=1)
Finally, it appears that recommendations is using the global variable indices - in general this is bad practice, as if indices changes outside of the scope of recommendations the function might break. I would consider refactoring this to be a little less brittle overall.

Python Import data dictionary and pattern

If I have data as:
Code, data_1, data_2, data_3, [....], data204700
a,1,1,0, ... , 1
b,1,0,0, ... , 1
a,1,1,0, ... , 1
c,0,1,0, ... , 1
b,1,0,0, ... , 1
etc. same code different value (0, 1, ?(not known))
I need to create a big matrix and I want to analyze.
How can I import data in a dictionary?
I want to use dictionary for column (204.700+1)
There is a built in function (or package) that return to me pattern?
(I expect a percent pattern). I mean as 90% of 1 in column 1, 80% of in column 2.
Alright so I am going to assume you want this in a dictionary for storing purposes and I will tell you that you don't want that with this kind of data. use a pandas DataFrame
this is how you will get your code into a dataframe:
import pandas as pd
my_file = 'file_name'
df = pd.read_csv(my_file)
now you don't need a package for returning the pattern you are looking for, just write a simple algorithm for returning that!
def one_percentage(data):
#get total number of rows for calculating percentages
size = len(data)
#get type so only grabbing the correct rows
x = data.columns[1]
x = data[x].dtype
#list of touples to hold amount of 1s and the column names
ones = [(i,sum(data[i])) for i in data if data[i].dtype == x]
my_dict = {}
#create dictionary with column names and percent
for x in ones:
percent = x[1]/float(size)
my_dict[x[0]] = percent
return my_dict
now if you want to get the percent of ones in any column, this is what you do:
percentages = one_percentage(df)
column_name = 'any_column_name'
print percentages[column_name]
now if you want to have it do every single column, then you can grab all of the column names and loop through them:
columns = [name for name in percentages]
for name in columns:
print str(percentages[name]) + "% of 1 in column " + name
let me know if you need anything else!

Categories