Lazily create dask dataframe from generator - python

I want to lazily create a Dask dataframe from a generator, which looks something like:
[parser.read(local_file_name) for local_file_name in repo.download_files())]
Where both parser.read and repo.download_files return generators (using yield). parser.read yields a dictionary of key-value pairs, which (if I was just using plain pandas) would collect each dictionary in to a list, and then use:
df = pd.DataFrame(parsed_rows)
What's the best way to create a dask dataframe from this? The reason is that a) I don't know necessarily the number of results returned, and b) I don't know the memory allocation of the machine that it will be deployed on.
Alternatively what should I be doing differently (e.g. maybe create a bunch of dataframes and then put those in to dask instead?)
Thanks.

If you want to use the single-machine Dask scheduler then you'll need to know how many files you have to begin with. This might be something like the following:
filenames = repo.download_files()
dataframes = [delayed(load)(filename) for filename in filenames]
df = dd.from_delayed(dataframes)
If you use the distributed scheduler you can add new computations on the fly, but this is a bit more advanced.

Related

Slow Pandas Series initialization from list of DataFrames

I found it to be extremely slow if we initialize a pandas Series object from a list of DataFrames. E.g. the following code:
import pandas as pd
import numpy as np
# creating a large (~8GB) list of DataFrames.
l = [pd.DataFrame(np.zeros((1000, 1000))) for i in range(1000)]
# This line executes extremely slow and takes almost extra ~10GB memory. Why?
# It is even much, much slower than the original list `l` construction.
s = pd.Series(l)
Initially I thought the Series initialization accidentally deep-copied the DataFrames which make it slow, but it turned out that it's just copy by reference as the usual = in python does.
On the other hand, if I just create a series and manually shallow copy elements over (in a for loop), it will be fast:
# This for loop is faster. Why?
s1 = pd.Series(data=None, index=range(1000), dtype=object)
for i in range(1000):
s1[i] = l[i]
What is happening here?
Real-life usage: I have a table loader which reads something on disk and returns a pandas DataFrame (a table). To expedite the reading, I use a parallel tool (from this answer) to execute multiple reads (each read is for one date for example), and it returns a list (of tables). Now I want to transform this list to a pandas Series object with a proper index (e.g. the date or file location used in the read), but the Series construction takes ridiculous amount of time (as the sample code shown above). I can of course write it as a for loop to solve the issue, but that'll be ugly. Besides I want to know what is really taking the time here. Any insights?
This is not a direct answer to the OP's question (what's causing the slow-down when constructing a series from a list of dataframes):
I might be missing an important advantage of using pd.Series to store a list of dataframes, however if that's not critical for downstream processes, then a better option might be to either store this as a dictionary of dataframes or to concatenate into a single dataframe.
For the dictionary of dataframes, one could use something like:
d = {n: df for n, df in enumerate(l)}
# can change the key to something more useful in downstream processes
For concatenation:
w = pd.concat(l, axis=1)
# note that when using with the snippet in this question
# the column names will be duplicated (because they have
# the same names) but if your actual list of dataframes
# contains unique column names, then the concatenated
# dataframe will act as a normal dataframe with unique
# column names

Best way to perform arbitrary operations on groups with Dask DataFrames

I want to use Dask for operations of the form
df.groupby(some_columns).apply(some_function)
where some_function() may compute some summary statistics, perform timeseries forecasting, or even just save the group to a single file in AWS S3.
Dask documentation states (and several other StackOverflow answers cite) that groupby-apply is not appropriate for aggregations:
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
It is not clear whether Aggregation supports operations on multiple columns. However, this DataFrames tutorial seems to do exactly what I'm suggesting, with roughly some_function = lambda x: LinearRegression().fit(...). The example seems to work as intended, and I've similarly had no problems so far with e.g. some_function = lambda x: x.to_csv(...).
Under what conditions can I expect that some_function will be passed all rows of the group? If this is never guaranteed, is there a way to break the LinearRegression example? And most importantly, what is the best way to handle these use cases?
It appears that the current version of documentation and the source code are not in sync. Specifically, in the source code for dask.groupby, there is this message:
Dask groupby supports reductions, i.e., mean, sum and alike, and apply. The former do not shuffle the data and are efficiently implemented as tree reductions. The latter is implemented by shuffling the underlying partiitons such that all items of a group can be found in the same parititon.
This is not consistent with the warning in the docs about partition-group. The snippet below and task graph visualization also show that there is shuffling of data to ensure that partitions contain all members of the same group:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({'group': [0,0,1,1,2,0,1,2], 'npeople': [1,2,3,4,5,6,7,8]})
ddf = dd.from_pandas(df, npartitions=2)
def myfunc(df):
return df['npeople'].sum()
results_pandas = df.groupby('group').apply(myfunc).sort_index()
results_dask = ddf.groupby('group').apply(myfunc).compute().sort_index()
print(sum(results_dask != results_pandas))
# will print 0, so there are no differences
# between dask and pandas groupby
This is speculative, but maybe one way to work around the scenario where partition-group leads to rows from a single group split across partitions is to explicitly re-partition the data in a way that ensures each group is associated with a unique partition.
One way to achieve that is by creating an index that is the same as the group identifier column. This in general is not a cheap operation, but it can be helped by pre-processing the data in a way that the group identifier is already sorted.

Outputting pyspark dataframe in capnp (cap'n proto) format

I have been tasked with outputting a Pyspark Dataframe into cap'n proto (.capnp) format. Does anyone have a suggestion for the best way to do this?
I have a capnp schema, and I have seen the python wrapper for capnp (http://capnproto.github.io/pycapnp/), but I'm still not sure what's the best way to go from dataframe to capnp.
The easiest way is to go to RDD, the use mapPartitions to collect partitions as serialized byte arrays and join them in collect() or use toLocalIterator with saving to disk, if dataframe is big. See example pseudocode:
create = your_serialization_method
serialize_partition = lambda partition: [b''.join([create(object).to_bytes() for object in partition])] # creates one-element partition
output = b''.join(df.rdd.mapPartitions(serialize_partition).collect())

How to clone RDD object [Pyspark]

1) How can I clone a RDD object to another?
2) or reading a csv file, I'm using pandas to read and then using sc.parallelize to convert list to RDD object. Is that okay or should I use some RDD method to directly read from csv?
3) I understand that I need to convert huge data to RDDs but do I also need to convert single int values into RDDs? If I just declare an int variable, will it be distributed across nodes?
You can use spark-csv to reas a csv file in spark
Here is how you read in spark 2.X
spark.read
.schema(my_schema)
.option("header", "true")
.csv("data.csv")
In Spark < 2.0
sqlContext
.read.format("com.databricks.spark.csv")
.option("header", "true")
.option(schema, my_schema)
.load("data.csv"))
For more options as your requirement please refer here
You can just assign a RDD or dataframe to another variable to get cloned.
Hope this helps.
I am a bit confused on your 1) question. An RDD is an immutable object so you can load your data into one RDD and then you can define two different transformations based on your initial RDD. Each transformation will use the original RDD and generate new RDDs.
Something like this:
# load your CSV
loaded_csv_into_rdd = sc.textFile('data.csv').map(lambda x: x.split(','))
# You could even .cache or .persist the data
# Here two new RDDs will be created based on the data that you loaded
one_rdd = loaded_csv_into_rdd.<apply one transformation>
two_rdd = loaded_csv_into_rdd.<apply another transformation>
This is using the low level API, RDDs. It is probably better if you try to do with the DataFrame API (Dataset[Row]) as it can infer the schema and in general will be easier to use.
On 2) what you are looking for if you want to use RDDs is sc.textFile and then you apply a split by commas to generate lists that you can then manage.
On 3), in Spark you do not declare variables. You are working with functional programming so state is not something that you need to keep. You have accumulators which is a special case, but in general you are defining functions that are applied to the entire dataset, something that is called coarse grained transformations.

basic groupby operations in Dask

I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group.
In pandas, I would do the following:
df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')
What would be the equivalent in Dask?
Also, I am a little bit lost as to how to structure problems in Dask as opposed to in Pandas,
thank you,
My progress so far:
First set index:
df1 = df.set_index(['A','B'])
Then groupby:
df1.groupby(['A','B']).apply(lambda x: x.fillna(method='ffill').compute()
It appears dask does not currently implement the fillna method for GroupBy objects. I've tried PRing it some time ago and gave up quite quickly.
Also, dask doesn't support the method parameter (as it isn't always trivial to implement with delayed algorithms).
A workaround for this could be using fillna before grouping, like so:
df['C'] = df.fillna(0).groupby(['A','B'])['C']
Although this wasn't tested.
You can find my (failed) attempt here: https://github.com/nirizr/dask/tree/groupy_fillna

Categories