I'm iterating over M dataframes, each containing a column with N URLs. For each URL, I extract paragraph text, then conduct standard cleaning for textual analysis before calculating "sentiment" scores.
Is it more efficient for me to:
Continue as it is (compute scores in the URL for-loop itself)
Extract all of the text from URLs first, and then separately iterate over the list / column of text ?
Or does it not make any difference?
Currently running calculations within the loop itself. Each DF has about 15,000 - 20,000 URLs so it's taking an insane amount of time too!
# DFs are stored on a website
# I extract links to each .csv file and store it as a list in "df_links"
for link in df_links:
cleaned_articles = []
df = pd.read_csv(link, sep="\t", header=None)
# Conduct df cleaning
# URLs for articles to scrape are stored in 1 column, which I iterate over as...
for url in df['article_url']:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
para_text = [text.get_text() for text in soup.findAll('p')]
text = " ".join(para_text)
words = text.split()
if len(words) > 500:
# Conduct Text Cleaning & Scores Computations
# Cleaned text stored as a variable "clean_text"
cleaned_articles.append(clean_text)
df['article_text'] = cleaned_articles
df.to_csv('file_name.csv')
To answer the question, it shouldn't make too much of a difference if you download the data and then apply analysis to it. You'd just be re arranging the order in which you do a set of tasks that would effectively take the same time.
The only difference may be if the text corpus' are rather large and then read write time to disk will start to play a part so could be a little faster running the analytics all in memory. But this still isn't going to really solve your problem.
May I be so bold as to reinterpret your question as: "My analysis is taking too long help me speed it up!"
This sounds like a perfect use case for multiprocessing! Since this sounds like a data science project you'll need to pip install multiprocess if you're using a ipython notebook (like Jupyter) or import multiprocessing if using a python script. This is because of the way python passes information between processes, don't worry though the API's for both multiprocess and multiprocessing are identical!
A basic and easy way to speed up your analysis is to indent you for loop and put it in a function. That function can then be passed to a multiprocessing map which can spawn multiple processes and do the analysis on several urls all at once:
from multiprocess import Pool
import numpy as np
import os
import pandas as pd
num_cpus = os.cpu_count()
def analytics_function(*args):
#Your full function including fetching data goes here and accepts a array of links
return something
df_links_split = np.array_split(df_links, num_cpus * 2) #I normally just use 2 as a rule of thumb
pool = Pool(num_cpus * 2) #Start a pool with num_cpus * 2 processes
list_of_returned = pool.map(analytics_function, df_links_split)
This will spin up a load of processes and utilise your full cpu. You'll not be able to do much else on your computer, and you'll need to have your resource monitor open to check you're not maxing our your memory and slowing down/crashing the process. But it should significantly speed up your analysis by roughly a factor of num_cpus * 2!!
Extracting all of the texts then processing all of it or extracting one text then processing it before extracting the next wont do any difference.
Doing ABABAB takes as much time as doing AAABBB.
You might however be interested in using threads or asynchronous requests to fetch all of the data in parallel.
Related
I have a local linux server which contains 4 cores. I am running a pyspark job on it locally which basically reads two tables from database and saves the data into 2 dataframes. Now i am using these 2 dataframes to do some processing and then i am using the resultant processed df to save it into elasticsearch. Below is the code
def save_to_es(df):
df.write.format('es').option('es.nodes', 'es_node').option('es.port', some_port_no.).option('es.resource', index_name).option('es.mapping', es_mappings).save()
def coreFun():
spark = SparkSession.builder.master("local[1]").appName('test').getOrCreate()
spark.catalog.clearCache()
spark.sparkContext.setLogLevel("ERROR")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
select_sql = """(select * from db."master_table")"""
df_master = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql).option("user", "username").option("password", "password").option("driver", "database_driver").load()
select_sql_child = """(select * from db."child_table")"""
df_child = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql_cost).option("user", "username").option("password", "password").option("driver", "database_driver").load()
merged_df = merged_python_file.merged_function(df_master,df_child,sqlContext)
logic1_df = logic1_python_file.logic1_function(df_master,sqlContext)
logic2_df = logic2_python_file.logic2_function(df_master,sqlContext)
logic3_df = logic3_python_file.logic3_function(df_master,sqlContext)
logic4_df = logic4_python_file.logic4_function(df_master,sqlContext)
logic5_df = logic5_python_file.logic5_function(df_master,sqlContext)
save_to_es(merged_df)
save_to_es(logic1_df)
save_to_es(logic2_df)
save_to_es(logic3_df)
save_to_es(logic4_df)
save_to_es(logic5_df)
end_time = int(time.time())
print(end_time-start_time)
sc.stop()
if __name__ == "__main__":
coreFun()
There are different logic for processing written in separate python files e.g logic1 in logic1_python_file etc. I send my df_master to separate functions and they return resultant processed df back to driver. Now i use this resultant processed df to save into elasticsearch.
It works fine but problem is here everything is happening sequentially first merged_df gets processed and while it is getting processed others simply wait even though they are not really dependent on the o/p of merged_df function and then logic_1 gets processed while others wait and it goes on. This is not an ideal system design considering the o/p of one logic is not dependent on other.
I am sure asynchronous processing can help me here but i am not sure how to implement it here in my usecase. I know i may have to use some kind of queue(jms,kafka etc) to accomplish this but i dont have a complete picture.
Please let me know how can i utilize asynchronous processing here. Any other inputs which can help in improving the performance of job is welcome.
If during the processing of one single step like (merged_python_file.merged_function), only one core of the CPU is getting heavily utilized and others are nearly idle, multiprocessing can speed up. It can be achieved by using multiprocessing module of python. For more details can check answer on How to do parallel programming in Python?
I have 1080 .txt files, each of which contain over 100k rows of values in three columns. I have to perform an average of the first column in each of these .txt files.
Any method that performs looping is proving to be too slow as only one file is loaded by numpy.loadtxt at a time.
The kicker is that I have 38 of these folders on which I need to perform this operation. So 38*1030 files in total. Using time module to get compute time for each numpy.loadtxt gives me around 1.7 seconds. So the total time to run over all folders is over 21 hours which seems a bit too much time.
So this has me wondering if there is a way to perform multiple operations at once by being able to open multiple txt files and performing average on the first column. Then also being able to store that average in the corresponding order of the txt files, since the order is important.
Since I am a begineer, I'm not sure if this even is the fastest way. Thanks in advance.
import numpy as np
import glob
import os
i = 0
while i < 39:
source_directory = "something/" + str(i) #Go to specific folder with the numbering
hw_array = sorted(glob.glob(source_directory + "/data_*.txt")) # read paths of 1080 txt files
velocity_array = np.zeros((30,36,3))
for probe in hw_array:
x = 35 - int((i-0.0001)/30) #describing position of the probes where velocities are measured
y = (30 - int(round((i)%30)))%30
velocity_column = np.loadtxt(data_file, usecols=(0)) #step that takes most time
average_array = np.mean(velocity_column, axis=0)
velocity_array[y,x,0] = average_array
velocity_array[y,x,1] = y*(2/29)
velocity_array[y,x,2] = x*0.5
np.save(r"C:/Users/md101/Desktop/AE2/Project2/raw/" + "R29" + "/data" + "R29", velocity_array) #save velocity array for use in later analysis
i += 1
Python has pretty slow I/O, and most of your time is spent in talking to the operating system and other costs associated with opening files.
Threading in python is strange and only provides an improvement in certain situations. Here is why it is good for your case. How threading works in Python is that a thread will do stuff if it has permission (called acquiring the GIL, or global interpreter lock. read about it). While it is waiting for something, like I/O, it will pass up the GIL to another thread. This will allow your files to be operated on (averaging the first row) while it has the GIL, and while the file is being opened it will pass the GIL on to another file to perform operations
It's completely possible to write a function that loads files from a directory, and spawn off one multiprocessing and get it done in close to 1/39th the time it was taking. Or, don't parallelize by directory, but by file and queue up the work and read from that work queue.
Some pseudocode:
pool = multiprocessing.Pool()
workers = []
for d in os.listdir("."):
for f in os.listdir(d):
workers.append(pool.apply_async(counter_function_for_file, (os.path.join(d, f),))
s = sum(worker.get() for worker in workers)
...and put your code for reading from the file in that counter_function_for_file(filename) function.
So, I have this database with thousands of rows and columns. At the start of the program I load the data and assign a variable to it:
data=np.loadtxt('database1.txt',delimiter=',')
Since this database contains many elements, it takes minutes to start the program. Is there a way in Python (similar to .mat files in matlab) which makes me only load the data once even when I stop the program then run it again? Currenly my time is wasted waiting for the program to load the data if I just change a small thing for testing.
Firstly, the Numpy package isn't good to read a large file, the Pandas package it's so strongly.
So just stop using np.loadtxt and start using pd.read_csv instead.
But, if you want to use it
I think that the np.fromfile() module is more efficient and faster than np.loadtxt().
So, my advice try:
data = np.fromfile('database1.txt', sep=',')
instead of:
data = np.loadtxt('database1.txt',delimiter=',')
You could pickle to cache your data.
import pickle
import os
import numpy as np
if os.path.isfile("cache.p"):
with open("cache.p","rb") as f:
data=pickle.load(f)
else:
data=data=np.loadtxt('database1.txt',delimiter=',')
with open("cache.p","wb") as f:
pickle.dump(data,f)
The first time it will be very slow, then in later executions it will be pretty fast.
just tested with a file containing 1 million rows and 20 columns of random floats, it took ~30s the first time, and ~0.4s the following times.
I am trying to use BALANCED ShardingStrategy to get more then 1 stream and python multiprocessing lib to read stream in parallel.
However, when reading streams in parallel the same rows number and data is returned. As, if I understand correctly, no data is assigned to any stream before it starts reading and is finalized, so two parallel streams try to read same data and a part of data is never read as a result.
Using LIQUID strategy we can read all the data from one stream, which cannot be split.
According to documentation it is possible to read multiple streams in parallel with BALANCED one. However, I cannot figure out how to read in parallel and to assign different data to each stream
I have the following toy code:
import pandas as pd
from google.cloud import bigquery_storage_v1beta1
import os
import google.auth
from multiprocessing import Pool
import multiprocessing
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='key.json'
credentials, your_project_id = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
bq_storage_client = bigquery_storage_v1beta1.BigQueryStorageClient(credentials=credentials)
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = "bigquery-public-data"
table_ref.dataset_id = "ethereum_blockchain"
table_ref.table_id = "contracts"
parent = "projects/{}".format(your_project_id)
session = bq_storage_client.create_read_session(
table_ref,
parent,
format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]), timeout=100000).to_arrow(session).to_pandas()
return reader
if __name__ == '__main__':
p = Pool(2)
output = p.map(read_rows,([i for i in range(0,2)]))
print(output)
Need assistance to have multiple streams being read in parallel.
Probably there is a way to assign data to a stream before the reading starts. Any examples of code or explanations and tips would be appreciated
I apologize for the partial answer, but it didn't fit in a comment.
LIQUID or BALANCED just affect how data is allocated to streams, not the fact that data arrives in multiple streams (see here).
When I ran a variant of your code with this read_rows function, I saw different data for the first row of both streams, so I was otherwise unable to replicate your problem with seeing the same data on this dataset with either shading strategy.
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(
bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]),timeout=100000)
for row in reader.rows(session):
print(row)
break
I was running this code on a Linux compute engine instance.
I do worry that the output you are asking for in the map call is otherwise going to be quite large, however.
I am trying to run a series of operations on a json file using Dask and read_text but I find that when I check Linux Systems Monitor, only one core is ever used at 100%. How do I know if the operations I am performing on a Dask Bag are able to be parallelized? Here is the basic layout of what I am doing:
import dask.bag as db
import json
js = db.read_text('path/to/json').map(json.loads).filter(lambda d: d['field'] == 'value')
result = js.pluck('field')
result = result.map(cleantext, tbl=tbl).str.lower().remove(exclusion).str.split()
result.map(stopwords,stop=stop).compute()
The basic premise is to extract text entries from the json file and then perform some cleaning operations. This seems like something that can be parallelized since each piece of text could be handed off to a processor since each text and the cleaning of each text is independent of any of the other. Is this an incorrect thought? Is there something I should be doing differently?
Thanks.
The read_text function breaks up a file into chunks based on byte ranges. My guess is that your file is small enough to fit into one chunk. You can check this by looking at the .npartitions attribute.
>>> js.npartitions
1
If so, then you might consider reducing the blocksize to increase the number of partitions
>>> js = db.read_text(..., blocksize=1e6)... # 1MB chunks