I'm creating a program that calculates embeddings of word in the text of many songs from a dataset (there are almost 6k songs)
The main issue is that the calculation takes far too long.
Is there a way to speed up the process?
The code is the following:
import os
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
songs = []
for filename in os.listdir('songs'):
df = pd.read_csv('songs/'+filename, index_col=None, header=0)
songs.append(df)
songs = pd.concat(songs, axis=0, ignore_index=True)
songs = songs.drop(columns=['Album', 'Date','Unnamed: 0'])
import spacy
nlp = spacy.load('en_core_web_lg')
from tqdm.notebook import tqdm
songEmbeddings = []
for i in tqdm(range(len(songs))):
text = nlp(songs.iloc[i]['Lyric'])
tokens = [lemmatize_token(token) for token in text if not token.is_stop and not token.is_punct]
wordEmbeddings = []
for t in tokens:
wordEmbeddings.append(nlp(t).vector)
emb = np.mean(wordEmbeddings,0)
songEmbeddings.append(emb)
I'm creating a program that calculates embeddings of word in the text of many songs from a dataset (there are almost 6k songs)
The main issue is that the calculation takes far too long.
Is there a way to speed up the process?
Related
So I'm learning PySpark by playing around with the DMOZ dataset in a jupyter notebook attached to an EMR cluster. The process I'm trying to achieve is as follows:
Load a csv with the location of files in an s3 public dataset in to a PySpark DataFrame (~130k rows)
Map over the DF with a function that retrieves the file contents (html) and rips the text
Join the output with the original DF as a new column
Write the joined DF to s3 (the problem: It seems to hang forever, its not a large job and the output json should only be a few gigs)
All of the writing is done in a function called run_job()
I let it sit for about 2 hours on a cluster with 10 m5.8xlarge instances which should be enough (?). All of the other steps execute fine on their own, except for the df.write(). I have tested on a
much smaller subset and it wrote to s3 with no issue, but when I go to do the whole file it seemingly hangs at at "0/n jobs complete."
I am new to PySpark and distributed computing in general, so its probably a simple "best practice" that I am missing. (Edit: Maybe its in the config of the notebook? I'm not using any magics to configure spark currently, do I need to?)
Code below...
import html2text
import boto3
import botocore
import os
import re
import zlib
import gzip
from bs4 import BeautifulSoup as bs
from bs4 import Comment
# from pyspark import SparkContext, SparkConf
# from pyspark.sql import SQLContext, SparkSession
# from pyspark.sql.types import StructType, StructField, StringType, LongType
import logging
def load_index():
input_file='s3://cc-stuff/uploads/DMOZ_bussineses_ccindex.csv'
df = spark.read.option("header",True) \
.csv(input_file)
#df = df.select('url_surtkey','warc_filename', 'warc_record_offset', 'warc_record_length','content_charset','content_languages','fetch_time','fetch_status','content_mime_type')
return df
def process_warcs(id_,iterator):
html_textract = html2text.HTML2Text()
html_textract.ignore_links = True
html_textract.ignore_images = True
no_sign_request = botocore.client.Config(signature_version=botocore.UNSIGNED)
s3client = boto3.client('s3', config=no_sign_request)
text = None
s3pattern = re.compile('^s3://([^/]+)/(.+)')
PREFIX = "s3://commoncrawl/"
for row in iterator:
try:
start_byte = int(row['warc_record_offset'])
stop_byte = (start_byte + int(row['warc_record_length']))
s3match = s3pattern.match((PREFIX + row['warc_filename']))
bucketname = s3match.group(1)
path = s3match.group(2)
#print('Bucketname: ',bucketname,'\nPath: ',path)
resp = s3client.get_object(Bucket=bucketname, Key=path, Range='bytes={}-{}'.format(start_byte, stop_byte))
content = resp['Body'].read()#.decode()
data = zlib.decompress(content, wbits = zlib.MAX_WBITS | 16).decode('utf-8',errors='ignore')
data = data.split('\r\n\r\n',2)[2]
soup = bs(data,'html.parser')
for x in soup.findAll(text=lambda text:isinstance(text, Comment)):
x.extract()
for x in soup.find_all(["head","script","button","form","noscript","style"]):
x.decompose()
text = html_textract.handle(str(soup))
except Exception as e:
pass
yield (id_,text)
def run_job(write_out=True):
df = load_index()
df2 = df.rdd.repartition(200).mapPartitionsWithIndex(process_warcs).toDF()
df2 = df2.withColumnRenamed('_1','idx').withColumnRenamed('_2','page_md')
df = df.join(df2.select('page_md'))
if write_out:
output = "s3://cc-stuff/emr-out/DMOZ_bussineses_ccHTML"
df.coalesce(4).write.json(output)
return df
df = run_job(write_out=True)
So I managed to make it work. I attribute this to either of the 2 changes below. I also changed the hardware configuration and opted for a higher quantity of smaller instances. Gosh I just LOVE it when I spend an entire day in a deep state of utter confusion when all I needed to do was add an "/" to the save location......
I added a trailing "/" to the output file location in s3
1 Old:
output = "s3://cc-stuff/emr-out/DMOZ_bussineses_ccHTML"
1 New:
output = "s3://cc-stuff/emr-out/DMOZ_bussineses_ccHTML/"
I removed the "coalesce" in the "run_job()" function, I have 200 output files now, but it worked and it was super quick (under 1 min).
2 Old:
df.coalesce(4).write.json(output)
2 New:
df.write.mode('overwrite').json(output)
I have a csv file of 550,000 rows of text. I read it into a pandas dataframe, loop over it, and perform some operation on it. Here is some sample code:
import pandas as pd
def my_operation(row_str):
#perform operation on row_str to create new_row_str
return new_row_str
df = pd.read_csv('path/to/myfile.csv')
results_list = []
for ii in range(df.shape[0]):
my_new_str = my_operation(df.iloc[ii, 0])
results_list.append(my_new_str)
I started to implement dask.delayed but after reading the Delayed Best Practices section, I am not sure I am using dask.delayed in the most optimal way for this problem. Here is the same code with dask.delayed:
import pandas as pd
import dask
def my_operation(row_str):
#perform operation on row_str to create new_row_str
return new_row_str
df = pd.read_csv('path/to/myfile.csv')
results_list = []
for ii in range(df.shape[0]):
my_new_str = dask.delayed(my_operation)(df.iloc[ii, 0])
results_list.append(my_new_str)
results_list = dask.compute(*results_list)
I'm running this on a single machine with 8 cores and was wanting to know if there was a more optimal way to load this large dataset and perform the same operation over each of the rows?
Thanks in advance for your help and let me know what else I can provide!
I'm new to Python and NLTK. I'm trying to prepare text for tokenization using NLTK in Python after I import the text from a csv. There's only one column in the file with free text. I want to isolate that specific column, which I did.... I think.
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
import re
import unicodedata
pd.set_option('display.max_colwidth',50)
oiw = pd.read_csv(r'C:\Users\tgray\Documents\PythonScripts\Worksheets.csv')
text = oiw.drop(oiw.columns[[1,2,3]],axis=1)
for row in text:
for text['value'] in row:
tokens = word_tokenize(row)
print(tokens)
When I run the code, the output it gives me is ['values'] which is the column name. How do I get the rest of the rows to show up in the output?
Sample data I have in the 'values' column:
The way was way too easy to order online.
Everything is great.
It's too easy for me to break.
The output I'm hoping to receive is:
['The','way','was','too','easy','to','order','online','Everything','is','great','It''s','for','me','break']
The correction you need to be made is in the segment.
oiw = pd.read_csv(r'C:\Users\tgray\Documents\PythonScripts\Worksheets.csv')
text = oiw.drop(columns=[1,2,3]) # correctly dropping columns named 1 2 and 3
for row in text['value']: # Correctly selecting the column
tokens = word_tokenize(row)
print(tokens) # Will print tokens in each row
print(tokens) # Will print the tokens of the last row
Hence you will be iterating over the correct column of the dataframe.
I'm loading Excel sheets into Python in order to clean (tokenize, stem et cetera) rows of text. I'm using Pandas to clean each individual line and return a new, cleaned Excel file in the same format as the original. In order for the tokenizer and stemmer to be able to read the Excel file, the Pandas dataframe needs to be in string format.
It more or less works, but the below code splits the text in each row by individual words, resulting in each row only containing one (cleaned) word and not a sentence like the original file. How can I make sure it doesn't split each row of text?
(simplified) code below:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer('english')
stop_words = set(stopwords.words('english'))
excel = pd.read_excel(open('example.xls', 'rb'))
data_to_string = pd.DataFrame.to_string(excel)
for line in data_to_string:
tokens = tokenizer.tokenize(data_to_string)
stopped = [word for word in tokens if not word in stop_words] #removes stop words
trimmed = [ word for word in stopped if len(word) >= 3 ] #takes out all words of two characters or less.
stemmed = [stemmer.stem(word) for word in trimmed] #stems the words
return_to_dataframe = pd.DataFrame(stemmed) #resets back to pandas dataframe
I've thought about using this, but it doesn't work:
data_to_string = excel.astype(str).apply(' '.join, axis=1)
Edit: Maarten asked if I could upload an image of what my current and desired output would be. The format of the original input file (uncleaned) is on the left. The middle is the desired outcome (stemmed and stop words removed etc.), and the right image is the current output.
EDIT: I managed to solve it; the main problem was with the tokenization. First, I had to convert the pandas dataframe to a list of lists (see strdata in the code below'), and then tokenize each item in each list. The rest was solved with a simple for loop, appending the cleaned rows back to a list and converting the list back to a pandas dataframe. The remove_NaN is there because pandas saw each None-type element as a string of alphanumeric characters (namely the word "None") instead of an empty cell, so this string had to be removed. Also, pandas put each tokenized word into a separate column. mergeddf is there in order to merge all words back into the same column.
The working code looks like this:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
import pandas as pd
import numpy as np
#load tokenizer, stemmer and stop words
tokenizer = RegexpTokenizer(r'\w+')
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
excel = pd.read_excel(open(inFilePath, 'rb')) #use pandas to read excel file
strdata = excel.values.tolist() #convert values to list of lists (each row becomes a separate list)
tokens = [tokenizer.tokenize(str(i)) for i in strdata] #tokenize words in lists
cleaned_list = []
for m in tokens:
stopped = [i for i in m if str(i).lower() not in stop_words] #remove stop words
stemmed = [stemmer.stem(i) for i in stopped] #stem words
cleaned_list.append(stemmed) #append stemmed words to list
backtodf = pd.DataFrame(cleaned_list) #convert list back to pandas dataframe
remove_NaN = backtodf.replace(np.nan, '', regex=True) #remove None (which return as words (str))
mergeddf = remove_NaN.astype(str).apply(lambda x: ' '.join(x), axis=1) #convert cells to strings, merge columns
I am new at Scikit-Learn and I want to convert a collection of data which I have already labelled into a dataset. I have converted the .csv file of the data into a NumPy array, however one problem I have run into is to classify the data into training set based on the presence of a flag in the second column. I want to know how to access a particular row, column of a .csv file using the Pandas Utility Module. The following is my code:
import numpy as np
import pandas as pd
import csv
import nltk
import pickle
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from nltk.classify import ClassifierI
from statistics import mode
def numpyfy(fileid):
data = pd.read_csv(fileid,encoding = 'latin1')
#pd.readline(data)
target = data["String"]
data1 = data.ix[1:,:-1]
#print(data)
return data1
def learn(fileid):
trainingsetpos = []
trainingsetneg = []
datanew = numpyfy(fileid)
if(datanew.ix['Status']==1):
trainingsetpos.append(datanew.ix['String'])
if(datanew.ix['Status']==0):
trainingsetneg.append(datanew.ix['String'])
print(list(trainingsetpos))
You can use boolean indexing to split the data. Something like
import pandas as pd
def numpyfy(fileid):
df = pd.read_csv(fileid, encoding='latin1')
target = df.pop('String')
data = df.ix[1:,:-1]
return target, data
def learn(fileid):
target, data = numpyfy(fileid)
trainingsetpos = data[data['Status'] == 1]
trainingsetneg = data[data['Status'] == 0]
print(trainingsetpos)