convert bi-gram dataframe to vectors - python

I have a text data-set (SMSCollection) which every line is a corpus (with one or more sentences). By using this code, I got its bi-gram as a pandas series (actually I get it from the internet but unfortunately I forget the link):
import pandas as pd
news = pd.read_csv('/content/drive/MyDrive/data.txt', encoding='utf-8', lineterminator='\n' , names=['text'],index_col = False)`
def basic_clean(text):
wnl = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
text = (unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8','ignore').lower())
words = re.sub(r'[^\w\s]', '', text).split()
return [wnl.lemmatize(word) for word in words if word not in stopwords]
words = basic_clean(''.join(str(df['text'].tolist())))
bigram_seri = pd.Series(nltk.ngrams(words, 2)).value_counts()
result is something like this:
(prize, guaranteed) 22
(customer, service) 20
(1000, cash) 20
(urgent, mobile) 18
(every, week) 18
(show, 800) 18
(send, stop) 17
(valid, 12hrs) 17
(account, statement) 16
(land, line) 16
(free, entry) 15
(identifier, code) 15
(dating, service) 15
....
Finally I want to plot "Top bi-grams found in the database" using plotly module from this link based on this projrct which it uses a vactorized bi-gram data-set like this. But I do not know how to construct a bi-gram vectorized data-frame like this (then I need to use t-sne algorithm). Actually I don't know what this data-set is and how to make it.
The github project is available from here. Data-set that is used in this github project is different from mine.

Related

Twitter Sentiment Analysis in Python, Lemmatization in Pandas

I am trying to produce a very simple twitter sentiment analysis. I have so far been able to pre-process my tweets however I am greatly struggling to lemmatize within my data frame. This is my code so far:
import nltk
import pandas as pd
from nltk.corpus import stopwords # Importing Natural Language Toolkit
from nltk.stem import WordNetLemmatizer
df = pd.read_csv(r'/Users/sarfrazkhan/Desktop/amazon.csv') # Loading Amazon data set into code
df = df['x'].str.replace('http\S+|www.\S+', '', case=False) # Removing URL's from data set
df = df.str.replace(r'\<.*\>', '') # Removing noise contained in '< >' these parenthesis
df = df.str.replace('RT ', '', case=False) # Removing the phrase 'RT" from all strings
df = df.str.replace('#[^\s]+', '', case=False) # Removing '#' and the following twitter handle from strings
df = df.str.replace('[^\w\s]', ' ') # Removing any punctuation
df = df.str.replace('\r\n', ' ') # Removing '\r\n' which is present in some strings
df = df.str.replace('\d+', '').str.lower().str.strip() # Removing numbers, capitalisation and white space
df = df.apply(nltk.word_tokenize) # Tokenizing data set
stop = nltk.download('stopwords') # Downloading stop words
stop = set(stopwords.words('english')) # Selecting English stop words
df = df.apply(lambda x: [item for item in x if item not in stop]) # Removing stop words from each string
lemmatizer = WordNetLemmatizer()
lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in df]
I am struggling to get my Lemmatizer to work and constantly met with errors possibly due to the fact my dataset is in a list form. (which I am struggling to work around) The Excel which I am trying to process is simply a long list of tweets with the heading 'x'. You can see on line 6 of my code that I focus primarily on this column, however I'm unsure if this is the correct way to do it!
My expected outcome would be a list of words which have been lemmatised correctly within their respective rows, to which I can then carry out a sentiment analysis.
These are the first few lines of my data frame before attempting the lemmatising process:
1 [swinging, pendulum, wall, clock, love, give, ...
2 [enter, via, gleam, l]
3 [screw, every, follow, gets, nude, dms, dm, pr...
4 [bishop, law, coming, soon, bishop, series, bo...
5 [adventures, bella, amp, emily, book, series, ...
6 [written, books, various, genres, amazon, kind...
7 [author, books, amwriting, fantasy, mystery, p...
8 [wonderful, mentor, recent, times, graham, kee...
9 [available, amazon, ebay, disabilities, hidden...
10 [screw, every, follow, gets, nude, dms, dm, pr...
Your code is trying to lemmatize an actual list hence the error.
... for w in df -> here, w is the list, rather than each element of each list.
To get around this you could use pandas apply to pass each row to a function (assuming df is a pd.DataFrame and not a pd.Series. If it's a Series and the below doesn't work, try df = df.to_frame() first) :
def df_lemmatize(row):
lemmatizer = WordNetLemmatizer()
row.at['lemma_words'] = [lemmatizer.lemmatize(w, pos='a') for w in row.x]
return row
df = df.apply(df_lemmatize, axis=1)
df_lemmatize will iterate over each element in the list, lemmatize it and then add the new list to a new column lemma_words.

ways to improve efficiency of Python script

I have a list of genes, their coordinates, and their expression (right now just looking at the top 500 most highly expressed genes) and 12 files corresponding to DNA reads. I have a python script that searches for reads overlapping with each gene's coordinates and storing the values in a dictionary. I then use this dictionary to create a Pandas dataframe and save this as a csv. (I will be using these to create a scatterplot.)
The RNA file looks like this (the headers are gene name, chromosome, start, stop, gene coverage/enrichment):
MSTRG.38 NC_008066.1 9204 9987 48395.347656
MSTRG.36 NC_008066.1 7582 8265 47979.933594
MSTRG.33 NC_008066.1 5899 7437 43807.781250
MSTRG.49 NC_008066.1 14732 15872 26669.763672
MSTRG.38 NC_008066.1 8363 9203 19514.273438
MSTRG.34 NC_008066.1 7439 7510 16855.662109
And the DNA file looks like this (the headers are chromosome, start, stop, gene name, coverage, strand):
JQ673480.1 697 778 SRX6359746.5505370/2 8 +
JQ673480.1 744 824 SRX6359746.5505370/1 8 -
JQ673480.1 1712 1791 SRX6359746.2565519/2 27 +
JQ673480.1 3445 3525 SRX6359746.7028440/2 23 -
JQ673480.1 4815 4873 SRX6359746.6742605/2 37 +
JQ673480.1 5055 5092 SRX6359746.5420114/2 40 -
JQ673480.1 5108 5187 SRX6359746.2349349/2 24 -
JQ673480.1 7139 7219 SRX6359746.3831446/2 22 +
The RNA file has >9,000 lines, and the DNA files have > 12,000,000 lines.
I originally had a for-loop that would generate a dictionary containing all values for all 12 files in one go, but it runs extremely slowly. Since I have access to a computing system with multiple cores, I've decided to run a script that only calculates coverage one DNA file at a time, like so:
#import modules
import csv
import pandas as pd
import matplotlib.pyplot as plt
#set sample name
sample='CON-2'
#set fraction number
f=6
#dictionary to store values
d={}
#load file name into variable
fileRNA="top500_R8_7-{}-RNA.gtf".format(sample)
print(fileRNA)
#read tsv file
tsvRNA = open(fileRNA)
readRNA = csv.reader(tsvRNA, delimiter="\t")
expGenes=[]
#convert tsv file into Python list
for row in readRNA:
gene=row[0],row[1],row[2],row[3],row[4]
expGenes.append(row)
#print(expGenes)
#establish file name for DNA reads
fileDNA="D2_7-{}-{}.bed".format(sample,f)
print(fileDNA)
tsvDNA = open(fileDNA)
readDNA = csv.reader(tsvDNA, delimiter="\t")
#put file into Python list
MCNgenes=[]
for row in readDNA:
read=row[0],row[1],row[2]
MCNgenes.append(read)
#find read counts
for r in expGenes:
#include FPKM in the dictionary
d[r[0]]=[r[4]]
regionCount=0
#set start and stop points based on transcript file
chr=r[1]
start=int(r[2])
stop=int(r[3])
#print("start:",start,"stop:",stop)
for row in MCNgenes:
if start < int(row[1]) < stop:
regionCount+=1
d[r[0]].append(regionCount)
n+=1
df=pd.DataFrame.from_dict(d)
#convert to heatmap
df.to_csv("7-CON-2-6_forHeatmap.csv")
This script also runs quite slowly, however. Are there any changes I can make to get it run more efficiently?
If I understood correctly and you are trying to match between coordinates of genes in different files I believe the best option would be to use something like KDTree partitioning algorithm.
You can use KDtree to partition your DNA and RNA data. I'm assumming you're using 'start' and 'stop' as 'coordinates':
import pandas as pd
import numpy as np
from sklearn.neighbors import KDTree
dna = pd.DataFrame() # this is your dataframe with DNA data
rna = pd.DataFrame() # Same for RNA
# Let's assume you are using 'start' and 'stop' columns as coordinates
dna_coord = dna.loc[:, ['start', 'stop']]
rna_coord = rna.loc[:, ['start', 'stop']]
dna_kd = KDTree(dna_coord)
rna_kd = KDTree(rna_coord)
# Now you can go through your data and match with DNA:
my_data = pd.DataFrame()
for start, stop in zip(my_data.start, my_data.stop):
coord = np.array(start, stop)
dist, idx = dna_kd.query(coord, k=1)
# Assuming you need an exact match
if np.islose(dist, 0):
# Now that you have the index of the matchin row in DNA data
# you can extract information using the index and do whatever
# you want with it
dna_gene_data = dna.loc[idx, :]
You can adjust your search parameters to get the desired results, but this will be much faster than searching every time.
Generally, Python is extremely extremely easy to work with at the cost of it being inefficient! Scientific libraries (such as pandas and numpy) help here by only paying the Python overhead a minimum limited number of times to map the work into a convenient space, then doing the "heavy lifting" in a more efficient language (which may be quite painful/inconvenient to work with).
General advice
try to get data into a dataframe whenever possible and keep it there (do not convert data into some intermediate Python object like a list or dict)
try to use methods of the dataframe or parts of it to do work (such as .apply() and .map()-like methods)
whenever you must iterate in native Python, iterate on the shorter side of a dataframe (ie. there may be only 10 columns, but 10,000 rows ; go over the columns)
More on this topic here:
How to iterate over rows in a DataFrame in Pandas?
Answer: DON'T*!
Once you have a program, you can benchmark it by collecting runtime information. There are many libraries for this, but there is also a builtin one called cProfile which may work for you.
docs: https://docs.python.org/3/library/profile.html
python3 -m cProfile -o profile.out myscript.py

Properly calculate cosine similarities for low memory on large datasets?

I am following this tutorial here to just learn a bit about content recommenders: https://www.datacamp.com/community/tutorials/recommender-systems-python
but i ran into a Memory Error when running the "content based" part of the tutorial. Upon some reading I found that this has to do with just how large the dataset being used it. I couldn't really find an exact way for this specific case on how to run this with low memory, so instead i modified this a little bit to split the original dataframe up into 6 pieces, run this cosine similarity calculation for each split dataframe, merge together the results, then run this one last time to get a final result. here is my code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, indices, cosine_sim, final=False):
# Get the index of the movie that matches the title
idx = indices[title]
# Get the pairwsie similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return the top 10 most similar movies
if not final:
return metadata.iloc[movie_indices, :]
else:
return metadata['title'].iloc[movie_indices]
# Load Movies Metadata
metadata = pd.read_csv('dataset/movies_metadata.csv', low_memory=False)
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')
#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')
split_db = np.array_split(metadata, 6)
source_db = None
search_db = None
db_remove_idx = None
new_db_list = list()
for x, db in enumerate(split_db):
search = db.loc[db['title'] == 'The Dark Knight Rises']
if not search.empty:
source_db = db
new_db_list.append(source_db)
search_db = search
db_remove_idx = x
break
split_db.pop(db_remove_idx)
for x, db in enumerate(split_db):
new_db_list.append(db.append(search_db, ignore_index=True))
del(split_db)
refined_db = None
for db in new_db_list:
small_db = db.reset_index()
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(small_db['overview'])
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
#Construct a reverse map of indices and movie titles
indices = pd.Series(small_db.index, index=small_db['title']).drop_duplicates()
result = (get_recommendations('The Dark Knight Rises', indices, cosine_sim))
if type(refined_db) != pd.core.frame.DataFrame:
refined_db = result.append(search_db, ignore_index=True)
else:
refined_db = refined_db.append(result, ignore_index=True)
final_db = refined_db.reset_index()
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(final_db['overview'])
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#Construct a reverse map of indices and movie titles
indices = pd.Series(final_db.index, index=final_db['title']).drop_duplicates()
final_result = (get_recommendations('The Dark Knight Rises', indices, cosine_sim, final=True))
print(final_result)
i thought this would work, but the results are not even close to what is given in the tutorial:
11 Dracula: Dead and Loving It
13 Nixon
12 Balto
15 Casino
20 Get Shorty
18 Ace Ventura: When Nature Calls
14 Cutthroat Island
16 Sense and Sensibility
19 Money Train
17 Four Rooms
Name: title, dtype: object
could anyone explain what i am doing wrong here? i figured since the dataset was too large by splitting it up, running this "cosine similarity" process as first a refinement, then using the resulting data and running the process again would give a similar result, but then why is the result i am getting so different than what is expected?
And this is the data i am using this against: https://www.kaggle.com/rounakbanik/the-movies-dataset/data

time series trend recognization in python

I have a CSV containing selling figures for various dates.
Here is an example of the file:
DATE, ARTICLENO, QUANTITY
2018-07-17, 101, 50
2018-07-16, 101, 55
2018-07-16, 105, 36
2018-07-15, 105, 23
I read this into a pandas dataframe and ran a basic kmeans-algorithm on this but i need more help.
Data description:
The date column is the index of the dataframe and describes the date for the selling value. There are multiple tuples (Date-Quantity-ArticleNo) so there is a time series for each article number. Those can have different lengths and starting dates, which makes predicting and recognizing trends (e.g. good selling in summer or winter) even harder. The CSV is sorted by ArticleNo and Date.
Goal:
Cluster a given set of data from a csv and create labels for good selling articles in summer or winter (seasonal trends) and match future articles to them.
Here is what I did so far (currently i did not have date as index xet, but that is the goal):
from __future__ import absolute_import, division, print_function
import pandas as pd
import numpy as np
from matplotlib import pyplot as plp
from sklearn import preprocessing
from sklearn.cluster import KMeans
import sys
def extract_articles(data, article_numbers):
return pd.DataFrame(
[
data[data['ARTICLENO'] == article_no]['QUANTITY'].values
for article_no in article_numbers
]
).fillna(0)
def read_csv_file(file_name, number_of_lines):
return pd.read_csv(file_name, parse_dates=['DATE'],
nrows=number_of_lines)
def get_unique_article_numbers(data):
return data['ARTICLENO'].unique()
def main():
data = read_csv_file('statistic.csv', 400000)
modeling_article_numbers = get_unique_article_numbers(data)
print("Clustering on", len(modeling_article_numbers), "article numbers")
modeling_data = extract_articles(data, modeling_article_numbers)
modeling_data = modeling_data.iloc[:50, :]
# 'switch' dataframe
modeling_data = modeling_data.T
modeling_data = modeling_data.pct_change().fillna(0)
normalized_modeling_data = preprocessing.normalize(modeling_data,
norm='l2', axis=0)
print(modeling_data)
predicting_article_numbers = [30079229, 30079854, 30086845]
predicting_article_data = extract_articles(data,
predicting_article_numbers)
predicting_article_data = predicting_article_data.pct_change().fillna(0)
normalized_predicting_article_data = preprocessing.normalize(
predicting_article_data, norm='l2'
)
kmeans = KMeans(n_clusters=5,
random_state=0).fit(normalized_modeling_data)
print(kmeans.labels_)
# for data, article_no in [
# (normalized_predicting_article_data, 430079229),
# (normalized_predicting_article_data, 430079854),
# (modeling_data, 430074590),
# ]:
# print('Predicting article {0}'.format(article_no))
# print(kmeans.predict([data[0]]))
for i, cluster_center in enumerate(kmeans.cluster_centers_):
plp.plot(cluster_center, label='Center {0}'.format(i))
plp.legend(loc='best')
plp.title(('Cluster based on ' + str(len(modeling_article_numbers)) + '
article numbers'))
plp.show()
main()
I transposed the dataframe, beacause it did not contain the series for each article number along the axis 1.
My question is: How can i get the 'description' of the label? Can i name them?
Maybe kmeans is the wrong algorithm for my intentions?
have you tried making each article a row in your dataset?
I'm not sure if you did after reading your question.
After you did that you can aggregate your date e.g. as quantity per week. If you have more than one year data make it average quantity per week. So you get a table with 52 Features {week 1 : sold 500; week 2 : sold 520 ...} for every article.
I dont think k-means is what you are looking for because you know pretty well what you want and that makes you a good "teacher" for your algorithm, ergo: use supervised algortihms.
Therefore you need to lable at least some (at best all) of your aggregated product data by hand, but it should be worth the work due to better results.
Also you could look into Time-Series Sesonality Analysis / Time Series decomposition.
Anyway if you are familiar with sci-kit learn i would give the supervised algorithms (Decision Trees, Random Forest, SVM, MLPClassifier ...) a chance, might be way easier to accomplish.

Python/Pandas aggregation combined with NLTK

I want to do some text processing on a dataset containing Twitter messages. So far I'm able to load the data (.CSV) in a Pandas dataframe and index that by a (custom) column 'timestamp'.
df = pandas.read_csv(f)
df.index = pandas.to_datetime(df.pop('timestamp'))
Looks a bit like this:
user_name user_handle
timestamp
2015-02-02 23:58:42 Netherlands Startups NLTechStartups
2015-02-02 23:58:42 shareNL share_NL
2015-02-02 23:58:42 BreakngAmsterdamNews iAmsterdamNews
[49570 rows x 8 columns]
I can create a new object (Series) containing just the text like so:
texts = pandas.Series(df['text'])
Which creates this:
2015-06-02 14:50:54 Business Update Meer cruiseschepen dan ooit in...
2015-06-02 14:50:53 RT #ProvincieNH: Provincie maakt Markermeerdij...
2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat: In ...
2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat http...
2015-06-02 14:50:53 Lugar secreto em Amsterdam: Begijnhof // Hidde...
Name: text, Length: 49570
1. Is this new object of the same sort of type (dataframe) as my initial df variable, just with different columns/rows?
Now together with the nltk tookit I'm able to tokenize the strings using this:
for w in words:
print(nltk.word_tokenize(w))
This iterates the array instead of mapping the 'text' column to a multiple-column 'words' array. 2. How would I do this and moreover how do I then count the occurrences of each word?
I know there is a unique() method which I could use to create a distinct list of words. But then again I'd need an extra column which is a count over the array which I'm unable to produce in the first place. :) 3. Or would the next step towards 'counting' occurrences of those words be grouping?
EDIT. 3: I seem to need "CountVectorizer", thanks EdChum
documents = df['text'].values
vectorizer = CountVectorizer(min_df=0, stop_words=[])
X = vectorizer.fit_transform(documents)
print(X.toarray())
My main goal is to count the occurences of each word and selecting the top X results. I feel I'm on the right track, but I can't get the final steps just right..
Building on EdChums comments here is a way to get the (I assume global) word counts from CountVectorizer:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vect= CountVectorizer()
df= pd.DataFrame({'text':['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\
,'class': ['a','a','a','a','c','c','b','e']})
X = vect.fit_transform(df['text'].values)
y = df['class'].values
covert the sparse matrix returned by CountVectoriser to a dense matrix, and pass it and the feature names to the dataframe constructor. Then transpose the frame and sum along axis=1 to get the total per word:
word_counts =pd.DataFrame(X.todense(),columns = vect.get_feature_names()).T.sum(axis=1)
word_counts.sort(ascending=False)
word_counts[:3]
If all you are interested in is the frequency distribution of the words consider using Freq Dist from NLTK:
import nltk
import itertools
from nltk.probability import FreqDist
texts = ['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']
texts = [nltk.word_tokenize(text) for text in texts]
# collapse into a single list
tokens = list(itertools.chain(*texts))
FD =FreqDist(tokens)

Categories