How to random access avro records in python? - python

I have a set of index/avro file (kb/gb), I use the following program to read it
import avro.schema
from avro.datafile import DataFileReader
from avro.io import DatumReader, DatumWriter
schema = avro.schema.Parse(open(r"hmd.avsc", "rb").read())
reader = DataFileReader(open(r"data", "rb"), DatumReader())
reader_index = DataFileReader(open(r"index", "rb"), DatumReader())
The problem is the reader is very slow, when data is as large as 5gb, it takes around 1 hour to iterate every line into memory, then I want to use multi-thread to speed up the process, the idea is to read the index which is small, and as I have the keys in my hand, I can divide them into 10 parts, then I can speed up the process by doing it concurrently, so is there any python api that can support random access with avro reader?
Edit1:
I do see there is a seek method at 1.2 api version, https://avro.apache.org/docs/1.2.0/api/py/avro.io.html, but it seems that it's gone at 1.8.2, is there any other alternative?

I will talk from the java point of view, but I would guess the python side have the same. did you try the seek method in the DataFileReader object? it allows random access to the file and speed up your process, the complex problem will be point to the correct sync point. I would recommend save the sync point during the writing of the files.
UPDATE: The link has been changed to point to the most recent docs.

Related

Python: Do not remove the data when I stop the program. Loading a very big database only once

So, I have this database with thousands of rows and columns. At the start of the program I load the data and assign a variable to it:
data=np.loadtxt('database1.txt',delimiter=',')
Since this database contains many elements, it takes minutes to start the program. Is there a way in Python (similar to .mat files in matlab) which makes me only load the data once even when I stop the program then run it again? Currenly my time is wasted waiting for the program to load the data if I just change a small thing for testing.
Firstly, the Numpy package isn't good to read a large file, the Pandas package it's so strongly.
So just stop using np.loadtxt and start using pd.read_csv instead.
But, if you want to use it
I think that the np.fromfile() module is more efficient and faster than np.loadtxt().
So, my advice try:
data = np.fromfile('database1.txt', sep=',')
instead of:
data = np.loadtxt('database1.txt',delimiter=',')
You could pickle to cache your data.
import pickle
import os
import numpy as np
if os.path.isfile("cache.p"):
with open("cache.p","rb") as f:
data=pickle.load(f)
else:
data=data=np.loadtxt('database1.txt',delimiter=',')
with open("cache.p","wb") as f:
pickle.dump(data,f)
The first time it will be very slow, then in later executions it will be pretty fast.
just tested with a file containing 1 million rows and 20 columns of random floats, it took ~30s the first time, and ~0.4s the following times.

BiqQuery Storage. Python. Reading multiple streams in parallel issue (multiprocessing)

I am trying to use BALANCED ShardingStrategy to get more then 1 stream and python multiprocessing lib to read stream in parallel.
However, when reading streams in parallel the same rows number and data is returned. As, if I understand correctly, no data is assigned to any stream before it starts reading and is finalized, so two parallel streams try to read same data and a part of data is never read as a result.
Using LIQUID strategy we can read all the data from one stream, which cannot be split.
According to documentation it is possible to read multiple streams in parallel with BALANCED one. However, I cannot figure out how to read in parallel and to assign different data to each stream
I have the following toy code:
import pandas as pd
from google.cloud import bigquery_storage_v1beta1
import os
import google.auth
from multiprocessing import Pool
import multiprocessing
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='key.json'
credentials, your_project_id = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
bq_storage_client = bigquery_storage_v1beta1.BigQueryStorageClient(credentials=credentials)
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = "bigquery-public-data"
table_ref.dataset_id = "ethereum_blockchain"
table_ref.table_id = "contracts"
parent = "projects/{}".format(your_project_id)
session = bq_storage_client.create_read_session(
table_ref,
parent,
format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]), timeout=100000).to_arrow(session).to_pandas()
return reader
if __name__ == '__main__':
p = Pool(2)
output = p.map(read_rows,([i for i in range(0,2)]))
print(output)
Need assistance to have multiple streams being read in parallel.
Probably there is a way to assign data to a stream before the reading starts. Any examples of code or explanations and tips would be appreciated
I apologize for the partial answer, but it didn't fit in a comment.
LIQUID or BALANCED just affect how data is allocated to streams, not the fact that data arrives in multiple streams (see here).
When I ran a variant of your code with this read_rows function, I saw different data for the first row of both streams, so I was otherwise unable to replicate your problem with seeing the same data on this dataset with either shading strategy.
def read_rows(stream_position, session=session):
reader = bq_storage_client.read_rows(
bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[stream_position]),timeout=100000)
for row in reader.rows(session):
print(row)
break
I was running this code on a Linux compute engine instance.
I do worry that the output you are asking for in the map call is otherwise going to be quite large, however.

Python Dask Running Bag operations in parallel

I am trying to run a series of operations on a json file using Dask and read_text but I find that when I check Linux Systems Monitor, only one core is ever used at 100%. How do I know if the operations I am performing on a Dask Bag are able to be parallelized? Here is the basic layout of what I am doing:
import dask.bag as db
import json
js = db.read_text('path/to/json').map(json.loads).filter(lambda d: d['field'] == 'value')
result = js.pluck('field')
result = result.map(cleantext, tbl=tbl).str.lower().remove(exclusion).str.split()
result.map(stopwords,stop=stop).compute()
The basic premise is to extract text entries from the json file and then perform some cleaning operations. This seems like something that can be parallelized since each piece of text could be handed off to a processor since each text and the cleaning of each text is independent of any of the other. Is this an incorrect thought? Is there something I should be doing differently?
Thanks.
The read_text function breaks up a file into chunks based on byte ranges. My guess is that your file is small enough to fit into one chunk. You can check this by looking at the .npartitions attribute.
>>> js.npartitions
1
If so, then you might consider reducing the blocksize to increase the number of partitions
>>> js = db.read_text(..., blocksize=1e6)... # 1MB chunks

Spark reading python3 pickle as input

My data are available as sets of Python 3 pickled files. Most of them are serialization of Pandas DataFrames.
I'd like to start using Spark because I need more memory and CPU that one computer can have. Also, I'll use HDFS for distributed storage.
As a beginner, I didn't found relevant information explaining how to use pickle files as input file.
Does it exists? If not, are there any workaround?
Thanks a lot
A lot depends on the data itself. Generally speaking Spark doesn't perform particularly well when it has to read large, not splittable files. Nevertheless you can try to use binaryFiles method and combine it with the standard Python tools. Lets start with a dummy data:
import tempfile
import pandas as pd
import numpy as np
outdir = tempfile.mkdtemp()
for i in range(5):
pd.DataFrame(
np.random.randn(10, 2), columns=['foo', 'bar']
).to_pickle(tempfile.mkstemp(dir=outdir)[1])
Next we can read it using bianryFiles method:
rdd = sc.binaryFiles(outdir)
and deserialize individual objects:
import pickle
from io import BytesIO
dfs = rdd.values().map(lambda p: pickle.load(BytesIO(p)))
dfs.first()[:3]
## foo bar
## 0 -0.162584 -2.179106
## 1 0.269399 -0.433037
## 2 -0.295244 0.119195
One important note is that it typically requires significantly more memory than a simple methods like textFile.
Another approach is to parallelize only the paths and use libraries which can read directly from a distributed file system like hdfs3. This typically means lower memory requirements at the price of a significantly worse data locality.
Considering these two facts it is typically better to serialize your data in a format which can be loaded with a higher granularity.
Note:
SparkContext provides pickleFile method, but the name can be misleading. It can be used to read SequenceFiles containing pickle objects not the plain Python pickles.

Is this a good way to export all entities of a type to a csv file?

I have millions of entities of a particular type that i would like to export to a csv file. The following code writes entities in batches of 1000 to a blob while keeping the blob open and deferring the next batch to the task queue. When there are no more entities to be fetched the blob is finalized. This seems to work for most of my local testing but I wanted to know:
If i am missing out on any gotchas or corner cases before running it on my production data and incurring $s for datastore reads.
If the deadline is exceeded or the memory runs out while the batch is being written to the blob, this code is defaulting to the start of the current batch for running the task again which may cause a lot of duplication. Any suggestions to fix that?
def entities_to_csv(entity_type,blob_file_name='',cursor='',batch_size=1000):
more = True
next_curs = None
q = entity_type.query()
results,next_curs,more = q.fetch_page(batch_size,start_cursor=Cursor.from_websafe_string(cursor))
if results:
try:
if not blob_file_name:
blob_file_name = files.blobstore.create(mime_type='text/csv',_blob_uploaded_filename='%s.csv' % entity_type.__name__)
rows = [e.to_dict() for e in results]
with files.open(blob_file_name, 'a') as f:
writer = csv.DictWriter(f,restval='',extrasaction='ignore',fieldnames=results[0].keys())
writer.writerows(rows)
if more:
deferred.defer(entity_type,blob_file_name,next_curs.to_websafe_string())
else:
files.finalize(blob_file_name)
except DeadlineExceededError:
deferred.defer(entity_type,blob_file_name,cursor)
Later in the code, something like:
deferred.defer(entities_to_csv,Song)
The problem with your current solution is that your memory will increase with every write to preform to the blobstore. the blobstore is immutable and write all the data at once from the memory.
You need to run the job on a backend that can hold all the records in memory, you need to define a backend in your application and call defer with _target='<backend name>'.
Check out this Google I/O video, pretty much describes what you want to do using MapReduce, starting at around the 23:15 mark in the video. The code you want is at 27:19
https://developers.google.com/events/io/sessions/gooio2012/307/

Categories