Stripping whtiespace from a dask dataframe colunn - python

I am relatively new to Dask and have a large file 12GB that I wish to process. This file was imported from a SQL BCP file that I want to wrangle with Dask prior to uploading to sql. As part of this, I need to remove some proceeding whitespace e.g. ' SQL Tutorial’ changed to 'SQL Tutorial'. I would do this using pandas as follows:
df_train['colum1'] = pd.core.strings.str_strip(df_train['column1'])
dask doesn't seem to have this feature as I get the error
AttributeError: module 'dask.dataframe.core' has no attribute
'strings'
Is there a memory-efficient way to do this using dask?

After a long searching I find it in dask API:
str
Namespace for string methods
So you can use:
df_train['colum1'] = df_train['column1'].str.strip()

Related

Using Pyarrow to read parquet files written by Spark increases memory significantly

I am running into this issue where the memory usage of my container increases drastically when I try to read a parquet file. The parquet file was written using Spark and I am trying to read it using Pyarrow.
In this specific scenario, a 10mb parquet file becomes over 3gb in size. I am guessing the data types for some columns are given large memory reservations, but I do not know for sure.
Here is the code and some screenshots of what I am seeing:
The code that writes the file(Scala 2.11, spark 2.2.1, (AWS Glue)):
def write(implicit spark: SparkSession, df: DataFrame, s3Bucket: String, s3Key: String, mode: String): Boolean ={
df.write
.mode(mode)
.option("header", true)
.parquet(s3Bucket + "/" + s3Key)
true
}
}
The schema of the file once it's written:
The code that reads the file(pandas==0.25.2, pyarrow==0.15.1):
#profile
def readParquet():
srDf = pq.ParquetDataset('test.parquet').read().to_pandas()
print(srDf.info(verbose=True))
readParquet()
The output of memory_profiler:
The file itself:
The schema of the file after it is read:
I am confused as to why the string columns are being inferred as object types in pyarrow. I believe that is what is causing all the memory increases. This is a small amount of data and I was hoping to process it in AWS lambda, but lambda has a 3gb memory limitation so I am unable to read the file there. I looked into try to explicitly define a schema for pyarrow to read the parquet file as but I couldn't seem to get that to work. It doesn't look like it's supported unless I have overlooked something. Please let me know if you guys see an obvious error in my implementation. Thank you.

Deleting files from the Hadoop with pyspark (Query)

I'm using Hadoop for storing my data- for some data I'm using partitions, for some data I don't.
I'm saving the data with parquet format using the pyspark DataFrame class, like this:
df = sql_context.read.parquet('/some_path')
df.write.mode("append").parquet(parquet_path)
I want to write a script that deletes an old data, with a similar way (I need to query this old data with filtering on the data frame) with pyspark. I haven't found something in the pyspark documentation...
Is there a way to achieve this?
Pyspark is predominantly a processing engine. The deletion can be handled by subprocess module of raw python itself.
import subprocess
some_path = ...
subprocess.call(["hadoop", "fs", "-rm", "-f", some_path])

Writing a big Spark Dataframe into a csv file

I'm using Spark 2.3 and I need to save a Spark Dataframe into a csv file and I'm looking for a better way to do it.. looking over related/similar questions, I found this one, but I need a more specific:
If the DataFrame is too big, how can I avoid using Pandas? Because I used toCSV() function (code below) and it produced:
Out Of Memory error (could not allocate memory).
Is directly writing to a csv using file I/O a better way? Can it preserve the separators?
Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv') will cause the header to be written in each file and when the files are merged, it will have headers in the middle. Am I wrong?
Using spark write and then hadoop getmerge is better than using coalesce from the point of performance?
def toCSV(spark_df, n=None, save_csv=None, csv_sep=',', csv_quote='"'):
"""get spark_df from hadoop and save to a csv file
Parameters
----------
spark_df: incoming dataframe
n: number of rows to get
save_csv=None: filename for exported csv
Returns
-------
"""
# use the more robust method
# set temp names
tmpfilename = save_csv or (wfu.random_filename() + '.csv')
tmpfoldername = wfu.random_filename()
print n
# write sparkdf to hadoop, get n rows if specified
if n:
spark_df.limit(n).write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
else:
spark_df.write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
# get merge file from hadoop
HDFSUtil.getmerge(tmpfoldername, tmpfilename)
HDFSUtil.rmdir(tmpfoldername)
# read into pandas df, remove tmp csv file
pd_df = pd.read_csv(tmpfilename, names=spark_df.columns, sep=csv_sep, quotechar=csv_quote)
os.remove(tmpfilename)
# re-write the csv file with header!
if save_csv is not None:
pd_df.to_csv(save_csv, sep=csv_sep, quotechar=csv_quote)
If the DataFrame is too big, how can I avoid using Pandas?
You can just save the file to HDFS or S3 or whichever distributed storage you have.
Is directly writing to a csv using file I/O a better way? Can it
preserve the separators?
If you mean by that to save file to local storage - it will still cause OOM exception, since you will need to move all data in memory on local machine to do it.
Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv')
will cause the header to be written in each file and when the files
are merged, it will have headers in the middle. Am I wrong?
In this case you will have only 1 file (since you do coalesce(1)). So you don't need to care about headers. Instead - you should care about memory on the executors - you might get OOM on the executor since all the data will be moved to that executor.
Using spark write and then hadoop getmerge is better than using
coalesce from the point of performance?
Definitely better (but don't use coalesce()). Spark will efficiently write data to storage, then HDFS will duplicate data and after that getmerge will be able to efficiently read data from the nodes and merge it.
We used databricks library . It works fine
df.save("com.databricks.spark.csv", SaveMode.Overwrite, Map("delimiter" -> delim, "nullValue" -> "-", "path" -> tempFPath))
Library :
<!-- spark df to csv -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.3.0</version>
</dependency>

Converting JSON file to SQLITE or CSV

I'm attempting to convert a JSON file to an SQLite or CSV file so that I can manipulate the data with python. Here is where the data is housed: JSON File.
I found a few converters online, but those couldn't handle the quite large JSON file I was working with. I tried using a python module called sqlbiter but again, like the others, was never really able to output or convert the file.
I'm not. sure where to go now, if anyone has any recommendations or insights on how to get this data into a database, I'd really appreciate it.
Thanks in advance!
EDIT: I'm not looking for anyone to do it for me, I just need to be pointed in the right direction. Are there other methods I haven't tried that I could learn?
You can utilize pandas module for this data processing task as follows:
First, you need to read the JSON file using with, open and json.load.
Second, you need to change the format of your file a bit by changing the large dictionary that has a main key for every airport into a list of dictionaries instead.
Third, you can now utilize some pandas magic to convert your list of dictionaries into a DataFrame using pd.DataFrame(data=list_of_dicts).
Finally, you can utilize pandas's to_csv function to write your DataFrame as a CSV file into disk.
It would look something like this:
import pandas as pd
import json
with open('./airports.json.txt','r') as f:
j = json.load(f)
l = list(j.values())
df = pd.DataFrame(data=l)
df.to_csv('./airports.csv', index=False)
You need to load your json file and parse it to have all the fields available, or load the contents to a dictionary, then you could using pyodbc to write to the database these fields, or write them to the csv if you use import csv first.
But this is just a general idea. You need to study python and how to do every step.
For instance for writting to the database you could do something like:
for i in range(0,max_len):
sql_order = "UPDATE MYTABLE SET MYTABLE.MYFIELD ...."
cursor1.execute(sql_order)
cursor1.commit()

pandas transform a csv into a h5 file avoiding memory error

I have this simple code
data = pd.read_csv(file_path + 'PSI_TS_clean.csv', nrows=None,
names=None, usecols=None)
data.to_hdf(file_path + 'PSI_TS_clean.h5', 'table')
but my data is too big and I run into memory issues.
What is a clean way to do this chunk by chunk?
If the csv is really big split the file using a method such as detailed here : chunking-data-from-a-large-file-for-multiprocessing
then iterate through the files and use pd.read_csv on each then use the pd.to_hdf method
for to_hdf check the parameters here: DataFrame.to_hdf you need to ensure mode 'a' and consider append.
Without knowing further detail about the dataframe structure its difficult to comment further.
also for read_csv there is the param: low_memory=False

Categories