Ingesting 150 csv's into one data source - python

Hello i am completely new to handling big data and comfortable in python
I have 150 csv's each of size 70MB which i have to integrate in one source to remove basic stats like unique counts, unique names and all.
Any one could suggest how should i go about it?
I came across a package 'pyelastic search' in python how feasible it is for me to use in enthaught canopy.
Suggestion needed!

Try to use pandas package.
reading a single csv would be:
import pandas as pd
df = pd.read_csv('filelocation.csv')
in case of multiple files, just concat them. let's say ls is a list of file locations, then:
df = pd.concat([pd.read_csv(f) for f in ls])
and then to write them as a single file, do:
df.to_csv('output.csv')
of course all this is valid for in-memory operations (70x150 = ~ 10.5 GB RAM). If that's not possible - consider building an incremental process or using dask dataframes.

One option if you are in AWS
Step1 - move data to S3 (AWS native File storage)
Step2 - create table for each data structure in redshift
Step3 - run COPY command to move data from S3 to Redshift (AWS native DW)
COPY command loads data in bulk, detects file name pattern

Related

Pyspark read and combine many parquet files efficiently

I have ~ 4000 parquet files that are each 3mb. I would like to read all of the files from an S3 bucket, do some aggregations, combine the files into one dataframe, and do some more aggregations. The parquet dataframes all have the same schema. The files are not all in the same folder in the S3 bucket but rather are spread across 5 different folders. I need some files in each folder but not all. I am working on my local machine which has 16gb RAM and 8 processors that I hope to leverage to make this faster. Here is my code:
ts_dfs = []
# For loop to read parquet files and append to empty list
for id in ids:
# Make the path to the timeseries data
timeseries_path = f'{dataset_path}/timeseries_individual_buildings/by_county/upgrade=0/county={id[0]}'
ts_data_df = spark.read.parquet(timeseries_path).select('bldg_id',
'`out.natural_gas.heating.energy_consumption_intensity`',
'`out.electricity.heating.energy_consumption_intensity`',
'timestamp')
# Aggregate
ts_data_df = ts_data_df \
.groupBy(f.month('timestamp').alias('month'),'bldg_id') \
.agg(f.sum('`out.electricity.heating.energy_consumption_intensity`').alias('eui_elec'),
f.sum('`out.natural_gas.heating.energy_consumption_intensity`').alias('eui_gas'))
# Append
ts_dfs.append(ts_data_df)
# Combine all of the dfs into one
ts_sdf = reduce(DataFrame.unionAll, ts_dfs)
# Merge with ids_df
ts = ts_sdf.join(ids_df, on = ['bldg_id'])
# Mean and Standard Deviation by month
stats_df = ts_sdf.groupBy('month', '`in.hvac_heating_type_and_fuel`') \
.agg(f.mean('eui_elec').alias('mean_eui_elec'),
f.stddev('eui_elec').alias('std_eui_elec'),
f.mean('eui_gas'). alias('mean_eui_gas'),
f.stddev('eui_gas').alias('std_eui_gas'))
Reading all of the files through a forloop does not leverage the multiple cores, defeating the purpose of using Spark. As such this process takes 90 minutes on my own (though that may be more a function of my internet connection). Additionally, the next step: ts_sdf = reduce(DataFrame.unionAll, ts_dfs) which combines the dataframes using unionAll is taking even longer. I think my questions are:
How do I avoid having to read the files in a forloop?
Why is the unionAll taking so long? Does PySpark not parallelize this process somehow?
In general, how do I do this better? Perhaps PySpark is not the ideal tool for this?

handling zip files using python library pandas

we have a big [ file_name.tar.gz] file here big in the sense our machine can not handle in go, it has three type of files inside it, let us say [first_file.unl, second_file.unl, thrid_file.unl]
background about unl extension: pd.read_csv able to read the file successfully without giving any kind of errors.
i am trying below steps in order to accomplish the tasks
step 1:
all_files = glob.glob(path + "/*.gz")
above step able to list all three types of file now using below code to process further
step 2:
li = []
for filename in x:
df_a = pd.read_csv(filename, index_col= False, header=0,names= header_name,
low_memory=False, sep ="|")
li.append(df_a)
step 3:
frame = pd.concat(li, axis=0, ignore_index= True)
all three steps will work perfectly if
we have small data that could fit in our machine memory
we have only one type of files inside zip file
how do we overcome this problem, please help
we are expecting to have a code, that has ability to read a file in chunk for particular file type and create data frame for the same.
also please do advise, apart from pandas libary, is there any other approaches or library that could handle this more efficiently considering our data residing in linux server.
You can refer to this link:
How do I read a large csv file with pandas?
In general, you can try with chunks
For better performance, I suggest to use Dask or Pyspark
Use tarfile's open, next, and extractfile to get the entries, where extractfile returns a file object with which you can read that entry. You can provide that object to read_csv.

Writing a big Spark Dataframe into a csv file

I'm using Spark 2.3 and I need to save a Spark Dataframe into a csv file and I'm looking for a better way to do it.. looking over related/similar questions, I found this one, but I need a more specific:
If the DataFrame is too big, how can I avoid using Pandas? Because I used toCSV() function (code below) and it produced:
Out Of Memory error (could not allocate memory).
Is directly writing to a csv using file I/O a better way? Can it preserve the separators?
Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv') will cause the header to be written in each file and when the files are merged, it will have headers in the middle. Am I wrong?
Using spark write and then hadoop getmerge is better than using coalesce from the point of performance?
def toCSV(spark_df, n=None, save_csv=None, csv_sep=',', csv_quote='"'):
"""get spark_df from hadoop and save to a csv file
Parameters
----------
spark_df: incoming dataframe
n: number of rows to get
save_csv=None: filename for exported csv
Returns
-------
"""
# use the more robust method
# set temp names
tmpfilename = save_csv or (wfu.random_filename() + '.csv')
tmpfoldername = wfu.random_filename()
print n
# write sparkdf to hadoop, get n rows if specified
if n:
spark_df.limit(n).write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
else:
spark_df.write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
# get merge file from hadoop
HDFSUtil.getmerge(tmpfoldername, tmpfilename)
HDFSUtil.rmdir(tmpfoldername)
# read into pandas df, remove tmp csv file
pd_df = pd.read_csv(tmpfilename, names=spark_df.columns, sep=csv_sep, quotechar=csv_quote)
os.remove(tmpfilename)
# re-write the csv file with header!
if save_csv is not None:
pd_df.to_csv(save_csv, sep=csv_sep, quotechar=csv_quote)
If the DataFrame is too big, how can I avoid using Pandas?
You can just save the file to HDFS or S3 or whichever distributed storage you have.
Is directly writing to a csv using file I/O a better way? Can it
preserve the separators?
If you mean by that to save file to local storage - it will still cause OOM exception, since you will need to move all data in memory on local machine to do it.
Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv')
will cause the header to be written in each file and when the files
are merged, it will have headers in the middle. Am I wrong?
In this case you will have only 1 file (since you do coalesce(1)). So you don't need to care about headers. Instead - you should care about memory on the executors - you might get OOM on the executor since all the data will be moved to that executor.
Using spark write and then hadoop getmerge is better than using
coalesce from the point of performance?
Definitely better (but don't use coalesce()). Spark will efficiently write data to storage, then HDFS will duplicate data and after that getmerge will be able to efficiently read data from the nodes and merge it.
We used databricks library . It works fine
df.save("com.databricks.spark.csv", SaveMode.Overwrite, Map("delimiter" -> delim, "nullValue" -> "-", "path" -> tempFPath))
Library :
<!-- spark df to csv -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.3.0</version>
</dependency>

pandas write dataframe to parquet format with append

I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?
the write syntax is
df.to_parquet(path, mode='append')
the read syntax is
pd.read_parquet(path)
Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.
Below is from pandas doc:
DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)
we have to pass in both engine and **kwargs.
engine{‘auto’, ‘pyarrow’, ‘fastparquet’}
**kwargs - Additional arguments passed to the parquet library.
**kwargs - here we need to pass is: append=True (from fastparquet)
import pandas as pd
import os.path
file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
df.to_parquet(file_path, engine='fastparquet')
else:
df.to_parquet(file_path, engine='fastparquet', append=True)
If append is set to True and the file does not exist then you will see below error
AttributeError: 'ParquetFile' object has no attribute 'fmd'
Running above script 3 times I have below data in parquet file.
If I inspect the metadata, I can see that this resulted in 3 row groups.
Note:
Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.
To append, do this:
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"
# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)
# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)
This will automatically append into your table.
I used aws wrangler library. It works like charm
Below are the reference docs
https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html
I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter
Below is the sample code I used:
!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
wr.s3.to_parquet(
df=evet_data,
path=s3_path,
dataset=True,
partition_cols=['e','f'],
mode="append",
database="wat_q4_stg",
table="raw_data_v3",
catalog_versioning=True # Optional
)
print("write successful")
except Exception as e:
print(str(e))
Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient
There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.
Use the fastparquet write function
from fastparquet import write
write(file_name, df, append=True)
The file must already exist as I understand it.
API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
os.makedirs(path, exist_ok=True)
# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))
# read
pd.read_parquet(path)

Spark coalesce vs collect, which one is faster?

I am using pyspark to process 50Gb data using AWS EMR with ~15 m4.large cores.
Each row of the data contains some information at a specific time on a day. I am using the following for loop to extract and aggregate information for every hour. Finally I union the data, as I want my result to save in one csv file.
# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)
As of my knowledge, I have to perform the following to force the pyspark.sql.Dataframe object to save to 1 csv files (approx 1Mb) instead of 100+ files:
daily_df.coalesce(1).write.csv("some_local.csv")
It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using collect() method like?
daily_df_pandas = daily_df.collect()
daily_df_pandas.to_csv("some_local.csv")
Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.
One simple improvement is to drop loop -> filter -> union and perform a single aggregation:
df.groupby(hour("Time"), col("Animal")).agg(mean("weights"), sum("is_male"))
If that's not enough then most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).
To save as single file these are options
Option 1 :
coalesce(1) (minimum shuffle data over network) or repartition(1) or collect may work for small data-sets, but large data-sets it may not perform, as expected.since all data will be moved to one partition on one node
option 1 would be fine if a single executor has more RAM for use than the driver.
Option 2 :
Other option would be FileUtil.copyMerge() - to merge the outputs into a single file like below code snippet.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
Option 3 :
after getting part files you can use hdfs getMerge command like this...
hadoop fs -getmerge /tmp/demo.csv /localmachine/tmp/demo.csv
Now you have to decide based on your requirements... which one is safer/faster
also, can have look at Dataframe save after join is creating numerous part files

Categories