I have data in Azure blob storage which was written using PySpark, when it was written, a WeekStartDate field (Monday) was used for partitionBy i.e. df.write.partitionBy("WeekStartDate"). This gives rise to a folder/file structure in Azure blob storage like this
container/MyTable/WeekStartDate=2020-09-28/
hash0.gz.parquet
...
hash(n-1).gz.parquet
container/MyTable/WeekStartDate=2020-10-05/
hash0.gz.parquet
...
hash(n-1).gz.parquet
container/MyTable/WeekStartDate=2020-10-12/
hash0.gz.parquet
...
hash(n-1).gz.parquet
container/MyTable/WeekStartDate=2020-10-19/
etc
Now, say I want to read data for just WeekStartDate's 2020-09-28 and 2020-10-05.
Is there any way to do this in pandas without having to build up the two paths and then stitching together like
df1 = pd.read_parquet("MyTable/WeekStartDate=2020-09-28", storage_options=xxx)
df2 = pd.read_parquet("MyTable/WeekStartDate=2020-09-28", storage_options=xxx)
df = pd.concat((df1, df2))
Related
I have few parquet files stored in my storage account, which I am trying to read using the below code. However it fails with error as incorrect syntax. Can someone suggest to me as whats the correct way to read parquet files using azure databricks?
val data = spark.read.parquet("abfss://containername#storagename.dfs.core.windows.net/TestFolder/XYZ/part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet")
display(data)
abfss://containername#storagename.dfs.core.windows.net/TestFolder/XYZ/part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet
As per the above abfss URL you can use delta or parquet format in the storage account.
Note: If you created delta table, part file creates automatically like this part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet.As per above code it is not possible to read parquet file in delta format .
I have written the datafram df1 and overwrite into a storage account with parquet format.
df1.coalesce(1).write.format('parquet').mode("overwrite").save("abfss://<container>#<stoarge_account>.dfs.core.windows.net/<folder>/<sub_folder>")
Scala
val df11 = spark.read.format("parquet").load("abfss://<container>#<stoarge_account>.dfs.core.windows.net/demo/d121/part-00000-tid-2397072542034942773-def47888-c000.snappy.parquet")
display(df11)
python
df11 = spark.read.format("parquet").load("abfss://<container>#<stoarge_account>.dfs.core.windows.net/demo/d121/part-00000-tid-2397072542034942773-def47888-c000.snappy.parquet")
display(df11)
Output:
Suppose you have two s3 buckets that you want to read a spark data frame from. For one file reading in a spark data frame would look like this:
file_1 = ("s3://loc1/")
df = spark.read.option("MergeSchema","True").load(file_1)
If we have two files:
file_1 = ("s3://loc1/")
file_2 = ("s3://loc2/")
how would we read in a spark data frame? Is there a way to merge those two file locations?
As the previous comment states, you could read in each individually and then do a union function.
Another option could be to try the Spark RDD API and then convert that into a data frame. So for example:
sc = spark.sparkContext
raw_data_RDD = sc.textfile(<dir1> , <dir2>, ...)
For nested directories, you can do wildcard symbol (*). Now one thing you have to consider is whether your schemas for both locations are equal. You may have to do some pre-processing before converting to the dataframe. Once your schema is set up, you can just do:
raw_df = spark.createDataFrame(raw_data_RDD, schema=<schema>)
Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this:
dataframe = spark.createDataFrame(pandas_dataframe)
I do that transformation because with spark writing dataframes to hdfs is very easy:
dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy")
But the transformation is failing for dataframes which are bigger than 2 GB.
If I transform a spark dataframe to pandas I can use pyarrow:
// temporary write spark dataframe to hdfs
dataframe.write.parquet(path, mode="overwrite", compression="snappy")
// open hdfs connection using pyarrow (pa)
hdfs = pa.hdfs.connect("default", 0)
// read parquet (pyarrow.parquet (pq))
parquet = pq.ParquetDataset(path_hdfs, filesystem=hdfs)
table = parquet.read(nthreads=4)
// transform table to pandas
pandas = table.to_pandas(nthreads=4)
// delete temp files
hdfs.delete(path, recursive=True)
This is a fast converstion from spark to pandas and it also works for dataframes bigger than 2 GB. I yet could not find a way to do it the other way around. Meaning having a pandas dataframe which I transform to spark with the help of pyarrow. The problem is that I really cant find how to write a pandas dataframe to hdfs.
My pandas version: 0.19.0
Meaning having a pandas dataframe which I transform to spark with the help of pyarrow.
pyarrow.Table.fromPandas is the function your looking for:
Table.from_pandas(type cls, df, bool timestamps_to_ms=False, Schema schema=None, bool preserve_index=True)
Convert pandas.DataFrame to an Arrow Table
import pyarrow as pa
pdf = ... # type: pandas.core.frame.DataFrame
adf = pa.Table.from_pandas(pdf) # type: pyarrow.lib.Table
The result can be written directly to Parquet / HDFS without passing data via Spark:
import pyarrow.parquet as pq
fs = pa.hdfs.connect()
with fs.open(path, "wb") as fw
pq.write_table(adf, fw)
See also
#WesMcKinney answer to read a parquet files from HDFS using PyArrow.
Reading and Writing the Apache Parquet Format in the pyarrow documentation.
Native Hadoop file system (HDFS) connectivity in Python
Spark notes:
Furthermore since Spark 2.3 (current master) Arrow is supported directly in createDataFrame (SPARK-20791 - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame). It uses SparkContext.defaultParallelism to compute number of chunks so you can easily control the size of individual batches.
Finally defaultParallelism can be used to control number of partitions generated using standard _convert_from_pandas, effectively reducing size of the slices to something more manageable.
Unfortunately these are unlikely to resolve your current memory problems. Both depend on parallelize, therefore store all data in memory of the driver node. Switching to Arrow or adjusting configuration can only speedup the process or address block size limitations.
In practice I don't see any reason to switch to Spark here, as long as you use local Pandas DataFrame as the input. The most severe bottleneck in this scenario is driver's network I/O and distributing data won't address that.
From https://issues.apache.org/jira/browse/SPARK-6235
Support for parallelizing R data.frame larger than 2GB
is resolved.
From https://pandas.pydata.org/pandas-docs/stable/r_interface.html
Converting DataFrames into R objects
you can convert a pandas dataframe to an R data.frame
So perhaps the transformation pandas -> R -> Spark -> hdfs?
One other way is to convert your pandas dataframe to spark dataframe (using pyspark) and saving it to hdfs with save command.
example
df = pd.read_csv("data/as/foo.csv")
df[['Col1', 'Col2']] = df[['Col2', 'Col2']].astype(str)
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)
Here astype changes type of your column from object to string. This saves you from otherwise raised exception as spark couldn't figure out pandas type object. But make sure these columns really are of type string.
Now to save your df in hdfs:
sdf.write.csv('mycsv.csv')
An hack could be to create N pandas dataframes (each less than 2 GB) (horizontal partitioning) from the big one and create N different spark dataframes, then merging (Union) them to create a final one to write into HDFS. I am assuming that your master machine is powerful but you also have available a cluster in which you are running Spark.
Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this:
dataframe = spark.createDataFrame(pandas_dataframe)
I do that transformation because with spark writing dataframes to hdfs is very easy:
dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy")
But the transformation is failing for dataframes which are bigger than 2 GB.
If I transform a spark dataframe to pandas I can use pyarrow:
// temporary write spark dataframe to hdfs
dataframe.write.parquet(path, mode="overwrite", compression="snappy")
// open hdfs connection using pyarrow (pa)
hdfs = pa.hdfs.connect("default", 0)
// read parquet (pyarrow.parquet (pq))
parquet = pq.ParquetDataset(path_hdfs, filesystem=hdfs)
table = parquet.read(nthreads=4)
// transform table to pandas
pandas = table.to_pandas(nthreads=4)
// delete temp files
hdfs.delete(path, recursive=True)
This is a fast converstion from spark to pandas and it also works for dataframes bigger than 2 GB. I yet could not find a way to do it the other way around. Meaning having a pandas dataframe which I transform to spark with the help of pyarrow. The problem is that I really cant find how to write a pandas dataframe to hdfs.
My pandas version: 0.19.0
Meaning having a pandas dataframe which I transform to spark with the help of pyarrow.
pyarrow.Table.fromPandas is the function your looking for:
Table.from_pandas(type cls, df, bool timestamps_to_ms=False, Schema schema=None, bool preserve_index=True)
Convert pandas.DataFrame to an Arrow Table
import pyarrow as pa
pdf = ... # type: pandas.core.frame.DataFrame
adf = pa.Table.from_pandas(pdf) # type: pyarrow.lib.Table
The result can be written directly to Parquet / HDFS without passing data via Spark:
import pyarrow.parquet as pq
fs = pa.hdfs.connect()
with fs.open(path, "wb") as fw
pq.write_table(adf, fw)
See also
#WesMcKinney answer to read a parquet files from HDFS using PyArrow.
Reading and Writing the Apache Parquet Format in the pyarrow documentation.
Native Hadoop file system (HDFS) connectivity in Python
Spark notes:
Furthermore since Spark 2.3 (current master) Arrow is supported directly in createDataFrame (SPARK-20791 - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame). It uses SparkContext.defaultParallelism to compute number of chunks so you can easily control the size of individual batches.
Finally defaultParallelism can be used to control number of partitions generated using standard _convert_from_pandas, effectively reducing size of the slices to something more manageable.
Unfortunately these are unlikely to resolve your current memory problems. Both depend on parallelize, therefore store all data in memory of the driver node. Switching to Arrow or adjusting configuration can only speedup the process or address block size limitations.
In practice I don't see any reason to switch to Spark here, as long as you use local Pandas DataFrame as the input. The most severe bottleneck in this scenario is driver's network I/O and distributing data won't address that.
From https://issues.apache.org/jira/browse/SPARK-6235
Support for parallelizing R data.frame larger than 2GB
is resolved.
From https://pandas.pydata.org/pandas-docs/stable/r_interface.html
Converting DataFrames into R objects
you can convert a pandas dataframe to an R data.frame
So perhaps the transformation pandas -> R -> Spark -> hdfs?
One other way is to convert your pandas dataframe to spark dataframe (using pyspark) and saving it to hdfs with save command.
example
df = pd.read_csv("data/as/foo.csv")
df[['Col1', 'Col2']] = df[['Col2', 'Col2']].astype(str)
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)
Here astype changes type of your column from object to string. This saves you from otherwise raised exception as spark couldn't figure out pandas type object. But make sure these columns really are of type string.
Now to save your df in hdfs:
sdf.write.csv('mycsv.csv')
An hack could be to create N pandas dataframes (each less than 2 GB) (horizontal partitioning) from the big one and create N different spark dataframes, then merging (Union) them to create a final one to write into HDFS. I am assuming that your master machine is powerful but you also have available a cluster in which you are running Spark.
Hello i am completely new to handling big data and comfortable in python
I have 150 csv's each of size 70MB which i have to integrate in one source to remove basic stats like unique counts, unique names and all.
Any one could suggest how should i go about it?
I came across a package 'pyelastic search' in python how feasible it is for me to use in enthaught canopy.
Suggestion needed!
Try to use pandas package.
reading a single csv would be:
import pandas as pd
df = pd.read_csv('filelocation.csv')
in case of multiple files, just concat them. let's say ls is a list of file locations, then:
df = pd.concat([pd.read_csv(f) for f in ls])
and then to write them as a single file, do:
df.to_csv('output.csv')
of course all this is valid for in-memory operations (70x150 = ~ 10.5 GB RAM). If that's not possible - consider building an incremental process or using dask dataframes.
One option if you are in AWS
Step1 - move data to S3 (AWS native File storage)
Step2 - create table for each data structure in redshift
Step3 - run COPY command to move data from S3 to Redshift (AWS native DW)
COPY command loads data in bulk, detects file name pattern