I am working jointly with vaex and dask for some analysis. In the first part of the analysis I do some processing with dask.dataframe, and my intention is to export the dataframe I computed into something vaex reads. I want to export the data into a memory-mappable format, like hdf or arrow.
dask allows exports into hdf and parquet files. Vaex allows imports as hdf and arrow. Both allow exports and imports as csv files, but I want to avoid that.
So far I got the following options (and problems):
If I export into an hdf5 file, since dask exports the file in a row format, but vaex reads it in a column format, the file cannot be imported (https://vaex.readthedocs.io/en/latest/faq.html).
I can export the data into parquet files, but I don't know how to read them from vaex. I've seen some answer in SO that transforms the files into an arrow table, but this requires the table to be loaded into memory, which I can't because the table is too large to fit into memory.
I can of course do an export into a csv and load it in chunks into vaex, then export it into a column-format hdf, but I don't think that should be the purpose of two modules for big objects.
Is there any option I am missing and that would be compatible to "bridge" the two modules without either loading the full table into memory, or having to read/write the dataset twice?
In order to open parquet with vaex you should use vaex.open and the extension of your file must be parquet.
Generate Data
fldr = "test"
os.makedirs(fldr, exist_ok=True)
n = 1_000
for i in range(10):
fn = f"{fldr}/file{i}.parquet"
df = pd.DataFrame(np.random.randn(n, 2), columns=["a", "b"])
df["key"] = np.random.randint(0, high=100, size=n)
df.to_parquet(fn, index=False)
Example: aggregation and save with dask
df = dd.read_parquet(fldr)
grp = df.groupby("key").sum()
grp.to_parquet("output")
Read with vaex
df = vaex.open("output/part.0.parquet")
Related
I have a python code like this , it converts csv to parquet file.
import pandas as pd
import pyarrow.parquet as pq
df = pd.read_csv('27.csv')
print(df)
df.to_parquet('27.parquet' )
da = pd.read_parquet('27.parquet')
metadata = pq.read_metadata('27.parquet')
print(metadata)
print(da.head(10))
the result : https://imgur.com/a/Mhw3sot which parquet file version too high
want to change version to parquet-cpp version 1.5.1-SNAPSHOT ( a small version).
how to do it? where to set the file version??
27.CSV below
Temp,Flow,site_no,datetime,Conductance,Precipitation,GageHeight
11.0,16200,09380000,2018-06-27 00:00,669,0.00,9.97
10.9,16000,09380000,2018-06-27 00:15,668,0.00,9.93
10.9,15700,09380000,2018-06-27 00:30,668,0.00,9.88
10.8,15400,09380000,2018-06-27 00:45,672,0.00,9.82
10.8,15100,09380000,2018-06-27 01:00,672,0.00,9.77
10.8,14700,09380000,2018-06-27 01:15,672,0.00,9.68
10.7,14300,09380000,2018-06-27 01:30,673,0.00,9.61
10.7,13900,09380000,2018-06-27 01:45,672,0.00,9.53
10.7,13600,09380000,2018-06-27 02:00,671,0.00,9.46
10.6,13200,09380000,2018-06-27 02:15,672,0.00,9.38
11.0,16200,09380000,2018-06-27 00:00,669,0.00,9.97
10.9,16000,09380000,2018-06-27 00:15,668,0.00,9.93
10.9,15700,09380000,2018-06-27 00:30,668,0.00,9.88
10.8,15400,09380000,2018-06-27 00:45,672,0.00,9.82
10.8,15100,09380000,2018-06-27 01:00,672,0.00,9.77
10.8,14700,09380000,2018-06-27 01:15,672,0.00,9.68
10.7,14300,09380000,2018-06-27 01:30,673,0.00,9.61
10.7,13900,09380000,2018-06-27 01:45,672,0.00,9.53
10.7,14300,09380000,2018-06-27 01:30,673,0.00,9.61
As I know the parquet format in tera data has some limits.
Directly using the pyarrow.parquet should have some problem.
How to write a parquet file for tera data to read ? (file format limits have something to do) Any one did this before?
Parquet Format Limitations in Tera Data
1.The READ_NOS table operator does not support Parquet.
However, READ_NOS can be used to view the Parquet schema, using
RETURNTYPE('NOSREAD_PARQUET_SCHEMA'). This is helpful in creating the
foreign table when you do not know the schema of your Parquet data
beforehand.
2.Certain complex data types are not supported, including STRUCT, MAP, LIST,
and ENUM.
3.Because support for the STRUCT data type is not available, nested Parquet
object stores cannot be processed by Native Object Store.
files which I try.
https://ufile.io/f/wi1k9
I have a very large dataset I write to hdf5 in chunks via append like so:
with pd.HDFStore(self.train_store_path) as train_store:
for filepath in tqdm(filepaths):
with open(filepath, 'rb') as file:
frame = pickle.load(file)
if frame.empty:
os.remove(filepath)
continue
try:
train_store.append(
key='dataset', value=frame,
min_itemsize=itemsize_dict)
os.remove(filepath)
except KeyError as e:
print(e)
except ValueError as e:
print(frame)
print(e)
except Exception as e:
print(e)
The data is far too large to load into one DataFrame, so I would like to try out vaex for further processing. There's a few things I don't really understand though.
Since vaex uses a different representation in hdf5 than pandas/pytables (VOTable), I'm wondering how to go about converting between those two formats. I tried loading the data in chunks into pandas, converting it to a vaex DataFrame and then storing it, but there seems to be no way to append data to an existing vaex hdf5 file, at least none that I could find.
Is there really no way to create a large hdf5 dataset from within vaex? Is the only option to convert an existing dataset to vaex' representation (constructing the file via a python script or TOPCAT)?
Related to my previous question, if I work with a large dataset in vaex out-of-core, is it possible to then persist the results of any transformations i apply in vaex into the hdf5 file?
The problem with this storage format is that it is not column-based, which does not play well with datasets with large number of rows, since if you only work with 1 column, for instance, the OS will probably also read large portions of the other columns, as well as the CPU cache gets polluted with it. It would be better to store them to a column based format such as vaex' hdf5 format, or arrow.
Converting to a vaex dataframe can done using:
import vaex
vaex_df = vaex.from_pandas(pandas_df, copy_index=False)
You can do this for each dataframe, and store them on disk as hdf5 or arrow:
vaex_df.export('batch_1.hdf5') # or 'batch_1.arrow'
If you do this for many files, you can lazily (i.e. no memory copies will be made) concatenate them, or use the vaex.open function:
df1 = vaex.open('batch_1.hdf5')
df2 = vaex.open('batch_2.hdf5')
df = vaex.concat([df1, df2]) # will be seen as 1 dataframe without mem copy
df_altnerative = vaex.open('batch*.hdf5') # same effect, but only needs 1 line
Regarding your question about the transformations:
If you do transformations to a dataframe, you can write out the computed values, or get the 'state', which includes the transformations:
import vaex
df = vaex.example()
df['difference'] = df.x - df.y
# df.export('materialized.hdf5', column_names=['difference']) # do this if IO is fast, and memory abundant
# state = df.state_get() # get state in memory
df.state_write('mystate.json') # or write as json
import vaex
df = vaex.example()
# df.join(vaex.open('materialized.hdf5')) # join on rows number (super fast, 0 memory use!)
# df.state_set(state) # or apply the state from memory
df.state_load('mystate.json') # or from disk
df
Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this:
dataframe = spark.createDataFrame(pandas_dataframe)
I do that transformation because with spark writing dataframes to hdfs is very easy:
dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy")
But the transformation is failing for dataframes which are bigger than 2 GB.
If I transform a spark dataframe to pandas I can use pyarrow:
// temporary write spark dataframe to hdfs
dataframe.write.parquet(path, mode="overwrite", compression="snappy")
// open hdfs connection using pyarrow (pa)
hdfs = pa.hdfs.connect("default", 0)
// read parquet (pyarrow.parquet (pq))
parquet = pq.ParquetDataset(path_hdfs, filesystem=hdfs)
table = parquet.read(nthreads=4)
// transform table to pandas
pandas = table.to_pandas(nthreads=4)
// delete temp files
hdfs.delete(path, recursive=True)
This is a fast converstion from spark to pandas and it also works for dataframes bigger than 2 GB. I yet could not find a way to do it the other way around. Meaning having a pandas dataframe which I transform to spark with the help of pyarrow. The problem is that I really cant find how to write a pandas dataframe to hdfs.
My pandas version: 0.19.0
Meaning having a pandas dataframe which I transform to spark with the help of pyarrow.
pyarrow.Table.fromPandas is the function your looking for:
Table.from_pandas(type cls, df, bool timestamps_to_ms=False, Schema schema=None, bool preserve_index=True)
Convert pandas.DataFrame to an Arrow Table
import pyarrow as pa
pdf = ... # type: pandas.core.frame.DataFrame
adf = pa.Table.from_pandas(pdf) # type: pyarrow.lib.Table
The result can be written directly to Parquet / HDFS without passing data via Spark:
import pyarrow.parquet as pq
fs = pa.hdfs.connect()
with fs.open(path, "wb") as fw
pq.write_table(adf, fw)
See also
#WesMcKinney answer to read a parquet files from HDFS using PyArrow.
Reading and Writing the Apache Parquet Format in the pyarrow documentation.
Native Hadoop file system (HDFS) connectivity in Python
Spark notes:
Furthermore since Spark 2.3 (current master) Arrow is supported directly in createDataFrame (SPARK-20791 - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame). It uses SparkContext.defaultParallelism to compute number of chunks so you can easily control the size of individual batches.
Finally defaultParallelism can be used to control number of partitions generated using standard _convert_from_pandas, effectively reducing size of the slices to something more manageable.
Unfortunately these are unlikely to resolve your current memory problems. Both depend on parallelize, therefore store all data in memory of the driver node. Switching to Arrow or adjusting configuration can only speedup the process or address block size limitations.
In practice I don't see any reason to switch to Spark here, as long as you use local Pandas DataFrame as the input. The most severe bottleneck in this scenario is driver's network I/O and distributing data won't address that.
From https://issues.apache.org/jira/browse/SPARK-6235
Support for parallelizing R data.frame larger than 2GB
is resolved.
From https://pandas.pydata.org/pandas-docs/stable/r_interface.html
Converting DataFrames into R objects
you can convert a pandas dataframe to an R data.frame
So perhaps the transformation pandas -> R -> Spark -> hdfs?
One other way is to convert your pandas dataframe to spark dataframe (using pyspark) and saving it to hdfs with save command.
example
df = pd.read_csv("data/as/foo.csv")
df[['Col1', 'Col2']] = df[['Col2', 'Col2']].astype(str)
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)
Here astype changes type of your column from object to string. This saves you from otherwise raised exception as spark couldn't figure out pandas type object. But make sure these columns really are of type string.
Now to save your df in hdfs:
sdf.write.csv('mycsv.csv')
An hack could be to create N pandas dataframes (each less than 2 GB) (horizontal partitioning) from the big one and create N different spark dataframes, then merging (Union) them to create a final one to write into HDFS. I am assuming that your master machine is powerful but you also have available a cluster in which you are running Spark.
Just was wondering if there is a way to improve the performance of reading large csv files into a pandas dataframe. I have 3 large (3.5MM records each) pipe delimited file which I want to load into dataframe and perform some task on it. Currently I am using pandas.read_csv() defining the cols and there datatypes in the parameter like below. I did see some improvement by defining the datatype of the columns but it still takes more than 3 minutes to load.
import pandas as pd
df = pd.read_csv(file_, index_col=None, usecols = sourceFields, sep='|', header=0, dtype={'date':'str', 'gwTimeUtc':'str', 'asset':'|str',
'instrumentId':'|str', 'askPrice':'float64', 'bidPrice':'float64',
'askQuantity':'float64', 'bidQuantity':'float64', 'currency':'|str',
'venue':'|str', 'owner':'|str', 'status':'|str', 'priceNotation':'|str', 'nominalQuantity':'float64'})
Depending on what you wish to do with the data, a good option is dask.dataframe. This library works out-of-memory, and allows you to perform a subset of pandas operations lazily. You can then bring the results in memory as a pandas dataframe. Below is example code you can try:
import dask.dataframe as dd, pandas as pd
# point to all files beginning with "file"
dask_df = dd.read_csv('file*.csv')
# define your calculations as you would in pandas
dask_df['col2'] = dask_df['col1'] * 2
# compute results & return to pandas
df = dask_df.compute()
Crucially, nothing significant is computed until the very last line.
The .feather file is significantly faster than .csv. Pandas has built-in support for feather files.
Read the csv in using pd.read_csv(path) and then export it to a feather file: pd.to_feather(path). Now, read the feather file instead of csv.
In my case, a 950 MB csv file was compressed to a 180 MB feather file. Instead of taking 30 seconds to read, it takes about 1 second. I know I am a bit late to the party, but feather files are seriously underrated.
Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this:
dataframe = spark.createDataFrame(pandas_dataframe)
I do that transformation because with spark writing dataframes to hdfs is very easy:
dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy")
But the transformation is failing for dataframes which are bigger than 2 GB.
If I transform a spark dataframe to pandas I can use pyarrow:
// temporary write spark dataframe to hdfs
dataframe.write.parquet(path, mode="overwrite", compression="snappy")
// open hdfs connection using pyarrow (pa)
hdfs = pa.hdfs.connect("default", 0)
// read parquet (pyarrow.parquet (pq))
parquet = pq.ParquetDataset(path_hdfs, filesystem=hdfs)
table = parquet.read(nthreads=4)
// transform table to pandas
pandas = table.to_pandas(nthreads=4)
// delete temp files
hdfs.delete(path, recursive=True)
This is a fast converstion from spark to pandas and it also works for dataframes bigger than 2 GB. I yet could not find a way to do it the other way around. Meaning having a pandas dataframe which I transform to spark with the help of pyarrow. The problem is that I really cant find how to write a pandas dataframe to hdfs.
My pandas version: 0.19.0
Meaning having a pandas dataframe which I transform to spark with the help of pyarrow.
pyarrow.Table.fromPandas is the function your looking for:
Table.from_pandas(type cls, df, bool timestamps_to_ms=False, Schema schema=None, bool preserve_index=True)
Convert pandas.DataFrame to an Arrow Table
import pyarrow as pa
pdf = ... # type: pandas.core.frame.DataFrame
adf = pa.Table.from_pandas(pdf) # type: pyarrow.lib.Table
The result can be written directly to Parquet / HDFS without passing data via Spark:
import pyarrow.parquet as pq
fs = pa.hdfs.connect()
with fs.open(path, "wb") as fw
pq.write_table(adf, fw)
See also
#WesMcKinney answer to read a parquet files from HDFS using PyArrow.
Reading and Writing the Apache Parquet Format in the pyarrow documentation.
Native Hadoop file system (HDFS) connectivity in Python
Spark notes:
Furthermore since Spark 2.3 (current master) Arrow is supported directly in createDataFrame (SPARK-20791 - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame). It uses SparkContext.defaultParallelism to compute number of chunks so you can easily control the size of individual batches.
Finally defaultParallelism can be used to control number of partitions generated using standard _convert_from_pandas, effectively reducing size of the slices to something more manageable.
Unfortunately these are unlikely to resolve your current memory problems. Both depend on parallelize, therefore store all data in memory of the driver node. Switching to Arrow or adjusting configuration can only speedup the process or address block size limitations.
In practice I don't see any reason to switch to Spark here, as long as you use local Pandas DataFrame as the input. The most severe bottleneck in this scenario is driver's network I/O and distributing data won't address that.
From https://issues.apache.org/jira/browse/SPARK-6235
Support for parallelizing R data.frame larger than 2GB
is resolved.
From https://pandas.pydata.org/pandas-docs/stable/r_interface.html
Converting DataFrames into R objects
you can convert a pandas dataframe to an R data.frame
So perhaps the transformation pandas -> R -> Spark -> hdfs?
One other way is to convert your pandas dataframe to spark dataframe (using pyspark) and saving it to hdfs with save command.
example
df = pd.read_csv("data/as/foo.csv")
df[['Col1', 'Col2']] = df[['Col2', 'Col2']].astype(str)
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)
Here astype changes type of your column from object to string. This saves you from otherwise raised exception as spark couldn't figure out pandas type object. But make sure these columns really are of type string.
Now to save your df in hdfs:
sdf.write.csv('mycsv.csv')
An hack could be to create N pandas dataframes (each less than 2 GB) (horizontal partitioning) from the big one and create N different spark dataframes, then merging (Union) them to create a final one to write into HDFS. I am assuming that your master machine is powerful but you also have available a cluster in which you are running Spark.