fastparquet export for Redshift

fastparquet export for Redshift - python

I had a very simple idea: Use Python Pandas (for convenience) to do some simple database operations with moderate data amounts and write the data back to S3 in Parquet format.
Then, the data should be exposed to Redshift as an external table in order to not take storage space from the actual Redshift cluster.
I found two ways to that.
Given the data:
data = {
'int': [1, 2, 3, 4, None],
'float': [1.1, None, 3.4, 4.0, 5.5],
'str': [None, 'two', 'three', 'four', 'five'],
'boolean': [True, None, True, False, False],
'date': [
date(2000, 1, 1),
date(2000, 1, 2),
date(2000, 1, 3),
date(2000, 1, 4),
None,
],
'timestamp': [
datetime(2000, 1, 1, 1, 1, 1),
datetime(2000, 1, 1, 1, 1, 2),
None,
datetime(2000, 1, 1, 1, 1, 4),
datetime(2000, 1, 1, 1, 1, 5),
]
}
df = pd.DataFrame(data)
df['int'] = df['int'].astype(pd.Int64Dtype())
df['date'] = df['date'].astype('datetime64[D]')
df['timestamp'] = df['timestamp'].astype('datetime64[s]')
The type casts at the end are necessary in both cases to assert, that Pandas' type recognition does not interfere.
With PyArrow:
Using Pyarrow, you do it like this:
import pyarrow as pa
pyarrow_schema = pa.schema([
('int', pa.int64()),
('float', pa.float64()),
('str', pa.string()),
('bool', pa.bool_()),
('date', pa.date64()),
('timestamp', pa.timestamp(unit='s'))
])
df.to_parquet(
path='pyarrow.parquet',
schema=pyarrow_schema,
engine='pyarrow'
)
Why use PyArrow: Pandas' default engine for Parquet export is PyArrow, so you can expect good integration. Also, PyArrow provides extensive features and caters for many datatypes.
With fastparquet:
First you need to write out the data with these additional steps:
from fastparquet import write
write('fast.parquet', df, has_nulls=True, times='int96')
The important bit here is the 'times' parameter. See this post, where I found a remedy for the 'date' column.
Why use fastparquet: Fastparquet is much more limited than PyArrow, especially, when it comes to accepted datatypes. On the other hand, the package is much smaller.
The external table:
Given, that you have exported your data to Parquet and stored it in S3, you can then expose it to Redshift like this:
CREATE EXTERNAL TABLE "<your_external_schema>"."<your_table_name>" (
"int" bigint,
"float" float,
"str" varchar(255),
"boolean" bool,
"date" date,
"timestamp" timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location
's3://<your_bucket>/<your_prefix>/';
Final story and note:
When I started working with Pandas, Parquet and external Redshift tables in the context of AWS Lambda functions, everything was fine for a while. Until I reached a point, where the bundle for my Lambda package reached its allowed limit (Deployment package size). Checking, which of my dependencies made up for all that, I found PyArrow, Pandas and Numpy (dependency of Pandas) to be the culprits. While I could definitely not drop Numpy (for efficiency) and did not want to loose Pandas (convenience, again), I looked to replace PyArrow with something more light-weight. Et voila: Fastparquet. After some research and a lot of experimentation, I could make this also work.
I hope, some other people find this explanation and resources helpful.

The question already holds the answer. :)

Related

Define partition for window operation using Pyspark.pandas

I am trying to learn how to use pyspark.pandas and I am coming across an issue that I don't know how to solve. I have a df of about 700k rows and 7 columns. Here is a sample of my data:
import pyspark.pandas as ps
import pandas as pd
data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF'],
'Year': [2016, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2019],
'Price': [500, 0,450,750,0,0,890,19,120,3],
'Quantity': [1200,0,330,500,190,70,120,300,50,80],
'Value': [600000,0,148500,350000,0,29100,106800,74300,5500,20750]}
df = ps.DataFrame(data)
Even when I run the simplest of operations like df.head(), I get the following warning and I'm not sure how to fix it:
WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
I know how to work around this with pyspark dataframes, but I'm not sure how to fix it using the Pandas API for Pyspark to define a partition for window operation.
Does anyone have any suggestions?

For Koalas, the repartition seems to only take in a number of partitions here: https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html
I think the goal here is to run Pandas functions on a Spark DataFrame. One option you can use is Fugue. Fugue can take a Python function and apply it on Spark per partition. Example code below.
from typing import List, Dict, Any
import pandas as pd
df = pd.DataFrame({"date":["2021-01-01", "2021-01-02", "2021-01-03"] * 3,
"id": (["A"]*3 + ["B"]*3 + ["C"]*3),
"value": [3, 4, 2, 1, 2, 5, 3, 2, 3]})
def count(df: pd.DataFrame) -> pd.DataFrame:
# this assumes the data is already partitioned
id = df.iloc[0]["id"]
count = df.shape[0]
return pd.DataFrame({"id": [id], "count": [count]})
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(df)
from fugue import transform
# Pandas
pdf = transform(df.copy(),
count,
schema="id:str, count:int",
partition={"by": "id"})
print(pdf.head())
# Spark
transform(sdf,
count,
schema="id:str, count:int",
partition={"by": "id"},
engine=spark).show()
You just need to annotate your function with input and output types and then you can use it with the Fugue transform function. Schema is a requirement for Spark so you need to pass it. If you supply spark as the engine, then the execution will happen on Spark. Otherwise, it will run on Pandas by default.

How to convert SQLAlchemy data into JSON format?

I am working on SQLAlchemy and want to fetch the data from database and convert the same into JSON format.
I have below code :
db_string = "postgres://user:pwd#10.**.**.***:####/demo_db"
Base = declarative_base()
db = create_engine(db_string)
record = db.execute("SELECT name, columndata, gridname, ownerid, issystem, ispublic, isactive FROM col.layout WHERE (ispublic=1 AND isactive=1) OR (isactive=1 AND ispublic=1 AND ownerid=ownerid);")
for row in record:
result.append(row)
print(result)
Data is coming in this format:
[('layout-1', {'theme': 'blue', 'sorting': 'price_down', 'filtering': ['Sub Strategye', 'PM Strategy']}, 'RealTimeGrid', 1, 0, 1, 1), ('layout-2', {'theme': 'orange', 'sorting': 'price_up', 'filtering': ['FX Rate', 'Start Price']}, 'RealBalancing Grid', 2, 0, 1, 1), ('layout-3', {'theme': 'red', 'sorting': 'mv_price', 'filtering': ['Sub Strategye', 'PM Strategy']}, 'RT', 3, 0, 1, 1)]
But I am facing a lot of issues to convert the above result into JSON Format. Please suggest.

Your data is basically a list of tuples.
like first tuple is like
('layout-3',
{'filtering': ['Sub Strategye', 'PM Strategy'],
'sorting': 'mv_price',
'theme': 'red'},
'RT',
3,
0,
1,
1)
if you want to convert whole data as it is to json, you can use json module dumps function
import json
jsn_data = json.dumps(data)
Your list of tuple is converted to json
[["layout-1", {"theme": "blue", "sorting": "price_down", "filtering": ["Sub Strategye", "PM Strategy"]}, "RealTimeGrid", 1, 0, 1, 1], ["layout-2", {"theme": "orange", "sorting": "price_up", "filtering": ["FX Rate", "Start Price"]}, "RealBalancing Grid", 2, 0, 1, 1], ["layout-3", {"theme": "red", "sorting": "mv_price", "filtering": ["Sub Strategye", "PM Strategy"]}, "RT", 3, 0, 1, 1]]
but If you need json formate as key value pair , first need to convert the result in python dictionary then use json.dumps(dictionary_Var)

What you want to accomplish is called "serialization".
You can follow Sudhanshu Patel's answer if you just want to dump json into response.
However if you intend to produce a more sophisticated application, consider using a serialization library. You'll be able to input data from request into db, check if input data is in the right format, and send response in a standarised format.
Check these libraries:
Marshmallow
Python's own Pickle

numpy.set_printoptions(nanstr='') is not working for numpy.savetxt()

I am using Python 3.6.4. I am running an experiment to find out a faster alternate to pandas.to_csv() and using numpy.savetxt() to store the dataframe as csv in a file. By default, this function puts nan string for np.nan values in the dataframe. I want to put '' (empty string) for np.nan values in the csv output.
I tried np.set_printoptions(nanstr=''), which doesn't seem to make any difference. I tried changing nanstr to various string values but looks like the option is not honored at all. I do see the option being set correctly in np.get_printoptions()
df0 = pd.DataFrame({'id': ['1_node', '2_node', '3_node', '4_node'],
'prop1': [np.nan,np.nan,'ABC','DEF'],
'prop2': [1,np.nan,2,np.nan]})
print("Numpy version: {}".format(np.__version__))
np.set_printoptions(nanstr='')
print(np.get_printoptions())
np.savetxt( 'temp.op', df0.values, fmt="%s", comments='',delimiter=",")
Output:
Numpy version: 1.14.0
{'edgeitems': 3, 'threshold': 1000, 'floatmode': 'maxprec', 'precision': 8, 'suppress': False, 'linewidth': 75, 'nanstr': '', 'infstr': 'inf', 'sign': '-', 'formatter': None, 'legacy': False}
temp.op
1_node,nan,1.0
2_node,nan,nan
3_node,ABC,2.0
4_node,DEF,nan
Expected output:
1_node,,1.0
2_node,,
3_node,ABC,2.0
4_node,DEF,

Is an SQL database more memory/performance efficient than a large Pandas dataframe?

I have more than 6000 XML want to parse and save as csv (or anything else for storage).
I need to perform JOIN for each XML to join them to big dataframe.
The problem is the process takes so long and uses too many memory.
I am wondering would sql can solve the problem? faster and less memory consumption?
def get_data(lst):
results = pd.DataFrame()
errors = []
for data in lst:
try:
df = parseXML_Annual(data)
try:
results = results.join(df, how = "outer")
except:
results = df
except:
errors.append(data)
return results, errors
results, errors = get_data(lst_result)

As I can see from your sample, entire XML file is related to the same company. To me it sounds that you need to add this a new row, not join it as a table. In my understanding you want to have some list of metrics for each company. If so you probably can just stick with key-value storage. If python is your primary tool, use dictionary, and then save it as a JSON file.
In your for loop just fill a blank dictionary with data from XML like this:
report = {
"apple": {
'metricSet1': {"m11": 5, "m12": 2, "m13": 3},
'metricSet2': {"m21": 4, "m22": 5, "m23": 6}
},
"google": {
'metricSet1': {"m11": 1, "m12": 13, "m13": 3},
'metricSet2': {"m21": 9, "m22": 0, "m23": 11}
},
"facebook": {
'metricSet1': {"m11": 1, "m12": 9, "m13": 9},
'metricSet2': {"m21": 7, "m22": 2, "m23": 4}
}
}
when you need to query it or fill some table with data do something like this:
for k in report.keys():
row = [
k,
report[k]["metricSet1"]["m12"],
report[k]["metricSet2"]["m22"],
report[k]["metricSet2"]["m23"]
]
print(row)
If data structure is not changing (say all these XML are the same) it would make sence to store it in SQL database, creating a table for each metric set. If XML structure may vary then just keep it as json file, or probably in some Key-Value based database, like mongo

Read Values from .csv file and convert them to float arrays

I stumbled upon a little coding problem. I have to basically read data from a .csv file which looks a lot like this:
2011-06-19 17:29:00.000,72,44,56,0.4772,0.3286,0.8497,31.3587,0.3235,0.9147,28.5751,0.3872,0.2803,0,0.2601,0.2073,0.1172,0,0.0,0,5.8922,1,0,0,0,1.2759
Now, I need to basically an entire file consisting of rows like this and parse them into numpy arrays. Till now, I have been able to get them into a big string type object using code similar to this:
order_hist = np.loadtxt(filename_input,delimiter=',',dtype={'names': ('Year', 'Mon', 'Day', 'Stock', 'Action', 'Amount'), 'formats': ('i4', 'i4', 'i4', 'S10', 'S10', 'i4')})
The format for this file consists of a set of S20 data types as of now. I need to basically extract all of the data in the big ORDER_HIST data type into a set of arrays for each column. I do not know how to save the date time column (I've kept it as String for now). I need to convert the rest to float, but the below code is giving me an error:
temparr=float[:len(order_hist)]
for x in range(len(order_hist['Stock'])):
temparr[x]=float(order_hist['Stock'][x]);
Can someone show me just how I can convert all the columns to the arrays that I need??? Or possibly direct me to some link to do so?

Boy, have I got a treat for you. numpy.genfromtxt has a converters parameter, which allows you to specify a function for each column as the file is parsed. The function is fed the CSV string value. Its return value becomes the corresponding value in the numpy array.
Morever, the dtype = None parameter tells genfromtxt to make an intelligent guess as to the type of each column. In particular, numeric columns are automatically cast to an appropriate dtype.
For example, suppose your data file contains
2011-06-19 17:29:00.000,72,44,56
Then
import numpy as np
import datetime as DT
def make_date(datestr):
return DT.datetime.strptime(datestr, '%Y-%m-%d %H:%M:%S.%f')
arr = np.genfromtxt(filename, delimiter = ',',
converters = {'Date':make_date},
names = ('Date', 'Stock', 'Action', 'Amount'),
dtype = None)
print(arr)
print(arr.dtype)
yields
(datetime.datetime(2011, 6, 19, 17, 29), 72, 44, 56)
[('Date', '|O4'), ('Stock', '<i4'), ('Action', '<i4'), ('Amount', '<i4')]
Your real csv file has more columns, so you'd want to add more items to names, but otherwise, the example should still stand.
If you don't really care about the extra columns, you can assign a fluff-name like this:
arr = np.genfromtxt(filename, delimiter=',',
converters={'Date': make_date},
names=('Date', 'Stock', 'Action', 'Amount') +
tuple('col{i}'.format(i=i) for i in range(22)),
dtype = None)
yields
(datetime.datetime(2011, 6, 19, 17, 29), 72, 44, 56, 0.4772, 0.3286, 0.8497, 31.3587, 0.3235, 0.9147, 28.5751, 0.3872, 0.2803, 0, 0.2601, 0.2073, 0.1172, 0, 0.0, 0, 5.8922, 1, 0, 0, 0, 1.2759)
You might also be interested in checking out the pandas module which is built on top of numpy, and which takes parsing CSV to an even higher level of luxury: It has a pandas.read_csv function whose parse_dates = True parameter will automatically parse date strings (using dateutil).
Using pandas, your csv could be parsed with
df = pd.read_csv(filename, parse_dates = [0,1], header = None,
names=('Date', 'Stock', 'Action', 'Amount') +
tuple('col{i}'.format(i=i) for i in range(22)))
Note there is no need to specify the make_date function. Just to be clear --pands.read_csvreturns aDataFrame, not a numpy array. The DataFrame may actually be more useful for your purpose, but you should be aware it is a different object with a whole new world of methods to exploit and explore.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

fastparquet export for Redshift - python

The question already holds the answer. :)

Related

Define partition for window operation using Pyspark.pandas

How to convert SQLAlchemy data into JSON format?

numpy.set_printoptions(nanstr='') is not working for numpy.savetxt()

Is an SQL database more memory/performance efficient than a large Pandas dataframe?

Read Values from .csv file and convert them to float arrays

Categories

Resources