BiqQuery Storage. Python. Reading multiple streams in parallel issue (multiprocessing) - python

I am trying to use BALANCED ShardingStrategy to get more then 1 stream and python multiprocessing lib to read stream in parallel.
However, when reading streams in parallel the same rows number and data is returned. As, if I understand correctly, no data is assigned to any stream before it starts reading and is finalized, so two parallel streams try to read same data and a part of data is never read as a result.
Using LIQUID strategy we can read all the data from one stream, which cannot be split.
According to documentation it is possible to read multiple streams in parallel with BALANCED one. However, I cannot figure out how to read in parallel and to assign different data to each stream
I have the following toy code:
import pandas as pd
from google.cloud import bigquery_storage_v1beta1
import os
import google.auth
from multiprocessing import Pool
import multiprocessing
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='key.json'
credentials, your_project_id = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
bq_storage_client = bigquery_storage_v1beta1.BigQueryStorageClient(credentials=credentials)
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = "bigquery-public-data"
table_ref.dataset_id = "ethereum_blockchain"
table_ref.table_id = "contracts"
parent = "projects/{}".format(your_project_id)
session = bq_storage_client.create_read_session(
table_ref,
parent,
format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]), timeout=100000).to_arrow(session).to_pandas()
return reader
if __name__ == '__main__':
p = Pool(2)
output = p.map(read_rows,([i for i in range(0,2)]))
print(output)
Need assistance to have multiple streams being read in parallel.
Probably there is a way to assign data to a stream before the reading starts. Any examples of code or explanations and tips would be appreciated

I apologize for the partial answer, but it didn't fit in a comment.
LIQUID or BALANCED just affect how data is allocated to streams, not the fact that data arrives in multiple streams (see here).
When I ran a variant of your code with this read_rows function, I saw different data for the first row of both streams, so I was otherwise unable to replicate your problem with seeing the same data on this dataset with either shading strategy.
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(
bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]),timeout=100000)
for row in reader.rows(session):
print(row)
break
I was running this code on a Linux compute engine instance.
I do worry that the output you are asking for in the map call is otherwise going to be quite large, however.

Related

Python and Dask - reading and concatenating multiple files

I have some parquet files, all coming from the same domain but with some differences in structure. I need to concatenate all of them. Below some example of these files:
file 1:
A,B
True,False
False,False
file 2:
A,C
True,False
False,True
True,True
What I am looking to do is to read and concatenate these files in the fastest way possible obtaining the following result:
A,B,C
True,False,NaN
False,False,NaN
True,NaN,False
False,NaN,True
True,NaN,True
To do that I am using the following code, extracted using (Reading multiple files with Dask, Dask dataframes: reading multiple files & storing filename in column):
import glob
import dask.dataframe as dd
from dask.distributed import Client
import dask
def read_parquet(path):
return pd.read_parquet(path)
if __name__=='__main__':
files = glob.glob('test/*/file.parquet')
print('Start dask client...')
client = Client()
results = [dd.from_delayed(dask.delayed(read_parquet)(diag)) for diag in diag_files]
results = dd.concat(results).compute()
client.close()
This code works, and it is already the fastest version I could come up with (I tried sequential pandas and multiprocessing.Pool). My idea was that Dask could ideally start part of the concatenation while still reading some of the files, however, from the task graph I see some sequential reading of the metadata of each parquet file, see the screenshot below:
The first part of the task graph is a mixture of read_parquet followed by read_metadata. The first part always shows only 1 task executed (in the task processing tab). The second part is a combination of from_delayed and concat and it is using all of my workers.
Any suggestion on how to speed up the file reading and reduce the execution time of the first part of the graph?
The problem with your code is that you use Pandas version of
read_parquet.
Instead use:
dask version of read_parquet,
map and gather methods offered by Client,
dask version of concat,
Something like:
def read_parquet(path):
return dd.read_parquet(path)
def myRead():
L = client.map(read_parquet, glob.glob('file_*.parquet'))
lst = client.gather(L)
return dd.concat(lst)
result = myRead().compute()
Before that I created a client, once only.
The reason was that during my earlier experiments I got an error
message when I attempted to create it again (in a function), even
though the first instance has been closed before.

How to read chunks of multiple large CSV files from google cloud storage using Dask without overloading the memory all at once

I'm trying to read a bunch of large csv files (multiple files) from google storage. I use the Dask distribution library for parallel computation, but the problem I'm facing here is, though I mention the blocksize (100mb), I'm not sure how to read partition by partition and save it to my postgres database so that I don't want overload my memory.
from dask.distributed import Client
from dask.diagnostics import ProgressBar
client = Client(processes=False)
import dask.dataframe as dd
def read_csv_gcs():
with ProgressBar():
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
pd = df.compute(scheduler='threads')
return pd
def write_df_to_db(df):
try:
from sqlalchemy import create_engine
engine = create_engine('postgresql://usr:pass#localhost:5432/sampledb')
df.to_sql('sampletable', engine, if_exists='replace',index=False)
except Exception as e:
print(e)
pass
pd = read_csv_gcs()
write_df_to_db(pd)
The above code is my basic implementation, but as said I would like to read it in chunk and update the db. Something like
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
for chunk in df:
write_it_to_db(chunk)
Is it possible to do it in Dask? or should I go for pandas's chunksize and iterate, then save it to DB (But I miss parallel computation here)?
Can someone shed some light?
This line
df.compute(scheduler='threads')
says: load the data in chunks in worker threads, and concatenate them all into a single in-memory dataframe, df. This is not what you wanted. You wanted to insert the chunks as they come and then drop them from memory.
You probably wanted to use map_partitions
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
df.map_partitions(write_it_to_db).compute()
or use df.to_delayed().
Note that, depending on your SQL driver, you might not be able to get parallelism this way, and if not, the pandas iter-chunk method would have worked just as well.

Nested For Loops With Calculations Vs. Linear Process

I'm iterating over M dataframes, each containing a column with N URLs. For each URL, I extract paragraph text, then conduct standard cleaning for textual analysis before calculating "sentiment" scores.
Is it more efficient for me to:
Continue as it is (compute scores in the URL for-loop itself)
Extract all of the text from URLs first, and then separately iterate over the list / column of text ?
Or does it not make any difference?
Currently running calculations within the loop itself. Each DF has about 15,000 - 20,000 URLs so it's taking an insane amount of time too!
# DFs are stored on a website
# I extract links to each .csv file and store it as a list in "df_links"
for link in df_links:
cleaned_articles = []
df = pd.read_csv(link, sep="\t", header=None)
# Conduct df cleaning
# URLs for articles to scrape are stored in 1 column, which I iterate over as...
for url in df['article_url']:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
para_text = [text.get_text() for text in soup.findAll('p')]
text = " ".join(para_text)
words = text.split()
if len(words) > 500:
# Conduct Text Cleaning & Scores Computations
# Cleaned text stored as a variable "clean_text"
cleaned_articles.append(clean_text)
df['article_text'] = cleaned_articles
df.to_csv('file_name.csv')
To answer the question, it shouldn't make too much of a difference if you download the data and then apply analysis to it. You'd just be re arranging the order in which you do a set of tasks that would effectively take the same time.
The only difference may be if the text corpus' are rather large and then read write time to disk will start to play a part so could be a little faster running the analytics all in memory. But this still isn't going to really solve your problem.
May I be so bold as to reinterpret your question as: "My analysis is taking too long help me speed it up!"
This sounds like a perfect use case for multiprocessing! Since this sounds like a data science project you'll need to pip install multiprocess if you're using a ipython notebook (like Jupyter) or import multiprocessing if using a python script. This is because of the way python passes information between processes, don't worry though the API's for both multiprocess and multiprocessing are identical!
A basic and easy way to speed up your analysis is to indent you for loop and put it in a function. That function can then be passed to a multiprocessing map which can spawn multiple processes and do the analysis on several urls all at once:
from multiprocess import Pool
import numpy as np
import os
import pandas as pd
num_cpus = os.cpu_count()
def analytics_function(*args):
#Your full function including fetching data goes here and accepts a array of links
return something
df_links_split = np.array_split(df_links, num_cpus * 2) #I normally just use 2 as a rule of thumb
pool = Pool(num_cpus * 2) #Start a pool with num_cpus * 2 processes
list_of_returned = pool.map(analytics_function, df_links_split)
This will spin up a load of processes and utilise your full cpu. You'll not be able to do much else on your computer, and you'll need to have your resource monitor open to check you're not maxing our your memory and slowing down/crashing the process. But it should significantly speed up your analysis by roughly a factor of num_cpus * 2!!
Extracting all of the texts then processing all of it or extracting one text then processing it before extracting the next wont do any difference.
Doing ABABAB takes as much time as doing AAABBB.
You might however be interested in using threads or asynchronous requests to fetch all of the data in parallel.

How to random access avro records in python?

I have a set of index/avro file (kb/gb), I use the following program to read it
import avro.schema
from avro.datafile import DataFileReader
from avro.io import DatumReader, DatumWriter
schema = avro.schema.Parse(open(r"hmd.avsc", "rb").read())
reader = DataFileReader(open(r"data", "rb"), DatumReader())
reader_index = DataFileReader(open(r"index", "rb"), DatumReader())
The problem is the reader is very slow, when data is as large as 5gb, it takes around 1 hour to iterate every line into memory, then I want to use multi-thread to speed up the process, the idea is to read the index which is small, and as I have the keys in my hand, I can divide them into 10 parts, then I can speed up the process by doing it concurrently, so is there any python api that can support random access with avro reader?
Edit1:
I do see there is a seek method at 1.2 api version, https://avro.apache.org/docs/1.2.0/api/py/avro.io.html, but it seems that it's gone at 1.8.2, is there any other alternative?
I will talk from the java point of view, but I would guess the python side have the same. did you try the seek method in the DataFileReader object? it allows random access to the file and speed up your process, the complex problem will be point to the correct sync point. I would recommend save the sync point during the writing of the files.
UPDATE: The link has been changed to point to the most recent docs.

Python Dask Running Bag operations in parallel

I am trying to run a series of operations on a json file using Dask and read_text but I find that when I check Linux Systems Monitor, only one core is ever used at 100%. How do I know if the operations I am performing on a Dask Bag are able to be parallelized? Here is the basic layout of what I am doing:
import dask.bag as db
import json
js = db.read_text('path/to/json').map(json.loads).filter(lambda d: d['field'] == 'value')
result = js.pluck('field')
result = result.map(cleantext, tbl=tbl).str.lower().remove(exclusion).str.split()
result.map(stopwords,stop=stop).compute()
The basic premise is to extract text entries from the json file and then perform some cleaning operations. This seems like something that can be parallelized since each piece of text could be handed off to a processor since each text and the cleaning of each text is independent of any of the other. Is this an incorrect thought? Is there something I should be doing differently?
Thanks.
The read_text function breaks up a file into chunks based on byte ranges. My guess is that your file is small enough to fit into one chunk. You can check this by looking at the .npartitions attribute.
>>> js.npartitions
1
If so, then you might consider reducing the blocksize to increase the number of partitions
>>> js = db.read_text(..., blocksize=1e6)... # 1MB chunks

Categories