I have a csv file on client . To make it's data accessible to worker nodes on different machines ,I am using client.scatter with reference to
Loading local file from client onto dask distributed cluster .
This is my code .
import pandas as pd
import time
import dask as dask
import dask.distributed as distributed
import dask.dataframe as dd
import dask.delayed as delayed
from dask.distributed import Client,progress
client = Client('scheduler:port')
df = pd.read_csv('file.csv')
[f]= client.scatter([df]) # send dataframe to one worker
ddf = dd.from_delayed([f], meta=df).repartition(npartitions=100).persist()
ddf = ddf.groupby(['col1'])[['col2']].sum()
future=client.compute(ddf)
print future
progress(future)
result = client.gather(future)
print result
This code is running fine for a csv file with 1 million records but for a csv file with 60 million records ,it is taking too long .Am I making any mistake ? Any help would be appreciated.
Related
I would like to benefit dask repartition feature, but the requested size is not fulfilled, and smaller files are produced.
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import dask.dataframe as dd
file = 'example.parquet'
file_res_dd = 'example_res'
# Generate a random df and write it down as an input data file.
df = pd.DataFrame(np.random.randint(100,size=(100000, 20)),columns=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T'])
table = pa.Table.from_pandas(df)
pq.write_table(table, file, version='2.0')
# Read back with dask, repartition, and write it down.
dd_df = dd.read_parquet(file, engine='pyarrow')
dd_df = dd_df.repartition(partition_size='1MB')
dd_df.to_parquet(file_res_dd, engine='pyarrow')
With this example, I am expecting files with size about 1MB.
Input file that is written 1st is about 1,7MB, so I am expecting 2 files at most.
But in the example_res folder that is created, I get 9 files being ~270kB.
Why is that so?
Thanks for your help! Bests,
The "partition size" is of the in-memory representation, and only an approximation.
Parquet offers various encoding and compression options that generally result in a file that is a good deal smaller - but how much smaller will depend greatly on the specific data in question.
I am trying to start using dask for handling large data sets in some ML projects. Loading singular CSV files into a dask dataframe works fine. When i try to use multiple CSV files, any "compute" like operation causes the program to hang indefinitely.
This runs fine
import dask.dataframe as dd
import pandas as pd
import dask
from dask.distributed import Client
client = Client(processes=False)
df = dd.read_csv('sftp://somestuff//4120109.csv')
shape = dask.delayed(print)(df.shape)
shape.compute()
Output: (3600, 3723)
The following code hangs indefinitely
import dask.dataframe as dd
import pandas as pd
import dask
from dask.distributed import Client
client = Client(processes=False)
df = dd.read_csv('sftp://somestuff//412010*.csv')
shape = dask.delayed(print)(df.shape)
shape.compute()
It should load the 10 files that match and give a shape of (36000, 3273)
I know it hangs specifically on the shape.compute() line after putting in some choice print lines. Any help would be greatly appreciated!!!
You should not mix dask.delayed and dask.dataframe. Probably you just wanted to call dask.compute(df.shape)
https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections
I'm trying to read a bunch of large csv files (multiple files) from google storage. I use the Dask distribution library for parallel computation, but the problem I'm facing here is, though I mention the blocksize (100mb), I'm not sure how to read partition by partition and save it to my postgres database so that I don't want overload my memory.
from dask.distributed import Client
from dask.diagnostics import ProgressBar
client = Client(processes=False)
import dask.dataframe as dd
def read_csv_gcs():
with ProgressBar():
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
pd = df.compute(scheduler='threads')
return pd
def write_df_to_db(df):
try:
from sqlalchemy import create_engine
engine = create_engine('postgresql://usr:pass#localhost:5432/sampledb')
df.to_sql('sampletable', engine, if_exists='replace',index=False)
except Exception as e:
print(e)
pass
pd = read_csv_gcs()
write_df_to_db(pd)
The above code is my basic implementation, but as said I would like to read it in chunk and update the db. Something like
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
for chunk in df:
write_it_to_db(chunk)
Is it possible to do it in Dask? or should I go for pandas's chunksize and iterate, then save it to DB (But I miss parallel computation here)?
Can someone shed some light?
This line
df.compute(scheduler='threads')
says: load the data in chunks in worker threads, and concatenate them all into a single in-memory dataframe, df. This is not what you wanted. You wanted to insert the chunks as they come and then drop them from memory.
You probably wanted to use map_partitions
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
df.map_partitions(write_it_to_db).compute()
or use df.to_delayed().
Note that, depending on your SQL driver, you might not be able to get parallelism this way, and if not, the pandas iter-chunk method would have worked just as well.
I have the below code. It uses dask distributed to read 100 json files:(Workers: 5
Cores: 5
Memory: 50.00 GB)
from dask.distributed import Client
import dask.dataframe as dd
client = Client('xxxxxxxx:8786')
df = dd.read_json('gs://xxxxxx/2018-04-18/data-*.json')
df = client.persist(df)
When I run the code, I only see one worker takes up the read_json() task, and then I got memory error and got WorkerKilled error.
Should I manually read each file and concat them? or is dask supposed to do it under-the-hood?
You may want to use dask.bag instead of dask.dataframe
import json
import dask.bag as db
mybag = db.read_text('gs://xxxxxx/2018-04-18/data-*.json').map(json.loads)
After that you can convert the bag into a dask dataframe with
mybag.to_dataframe()
This may require some additional uses of dask.map to get the structure right.
If your data is hadoop style json (aka one object per line), the bag trick will still work but you may need to operate on individual lines.
The following operation works, but takes nearly 2h:
from dask import dataframe as ddf
ddf.read_csv('data.csv').to_parquet('data.pq')
Is there a way to parallelize this?
The file data.csv is ~2G uncompressed with 16 million rows by 22 columns.
I'm not sure if it is a problem with data or not. I made a toy example on my machine and the same command takes ~9 seconds
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask.distributed import Client
import dask
client = Client()
# if you wish to connect to the dashboard
client
# fake df size ~2.1 GB
# takes ~180 seconds
N = int(5e6)
df = pd.DataFrame({i: np.random.rand(N)
for i in range(22)})
df.to_csv("data.csv", index=False)
# the following takes ~9 seconds on my machine
dd.read_csv("data.csv").to_parquet("data_pq")
read_csv has a blocksize parameter (docs) that you can use to control the size of the resulting partition and hence, the number of partitions. That, from what I understand, will result in reading the partitions in parallel - each worker will read block size at relevant offset.
You can set blocksize so that it yields the required number of partition to take advantage of the cores you have. For example
cores = 8
size = os.path.getsize('data.csv')
ddf = dd.read_csv("data.csv", blocksize=np.rint(size/cores))
print(ddf.npartitions)
Will output:
8
Better yet, you can try to modify the size so that the resulting parquet has partitions of recommended size (which I have seen different numbers in different places :-|).
Writing out multiple CSV files in parallel is also easy with the repartition method:
df = dd.read_csv('data.csv')
df = df.repartition(npartitions=20)
df.to_parquet('./data_pq', write_index=False, compression='snappy')
Dask likes working with partitions that are around 100 MB, so 20 partitions should be good for a 2GB dataset.
You could also speed this up, by splitting the CSV before reading it, so Dask can read the CSV files in parallel. You could use these tactics to break up the 2GB CSV into 20 different CSVs and then write them out without repartitioning:
df = dd.read_csv('folder_with_small_csvs/*.csv')
df.to_parquet('./data_pq', write_index=False, compression='snappy')