Using Python to insert multiple rows into a Hive table

Using Python to insert multiple rows into a Hive table - python

Hive is a data warehouse designed for querying and aggregating large datasets that reside on HDFS.
The standard INSERT INTO syntax performs poorly because:
Each statement required a Map/Reduce process to be executed.
Each statement will result in a new file being added to HDFS - over time this will lead to very poor performance when reading from the table.
With that said, there is now a Streaming API for Hive / HCatalog, as detailed here.
I am faced with the need to insert data at velocity into Hive, using Python. I am aware of the pyhive and pyhs2 libraries, but neither of them appears to make use of the Streaming API.
Has anyone successfully managed to get Python to insert many rows into Hive using the Streaming API, and how was this done?
I look forward to your insights!

Hive user can stream table through script to transform that data:
ADD FILE replace-nan-with-zeros.py;
SELECT
TRANSFORM (...)
USING 'python replace-nan-with-zeros.py'
AS (...)
FROM some_table;
Here a simple Python script:
#!/usr/bin/env python
import sys
kFirstColumns= 7
def main(argv):
for line in sys.stdin:
line = line.strip();
inputs = line.split('\t')
# replace NaNs with zeros
outputs = [ ]
columnIndex = 1;
for value in inputs:
newValue = value
if columnIndex > kFirstColumns:
newValue = value.replace('NaN','0.0')
outputs.append(newValue)
columnIndex = columnIndex + 1
print '\t'.join(outputs)
if __name__ == "__main__":
main(sys.argv[1:])
Hive and Python
Python can be used as a UDF from Hive through the HiveQL TRANSFORM statement. For example, the following HiveQL invokes a Python script stored in the streaming.py file.
Linux-based HDInsight
add file wasb:///streaming.py;
SELECT TRANSFORM (clientid, devicemake, devicemodel)
USING 'streaming.py' AS
(clientid string, phoneLable string, phoneHash string)
FROM hivesampletable
ORDER BY clientid LIMIT 50;
Windows Based HDInsight
add file wasb:///streaming.py;
SELECT TRANSFORM (clientid, devicemake, devicemodel)
USING 'D:\Python27\python.exe streaming.py' AS
(clientid string, phoneLable string, phoneHash string)
FROM hivesampletable
ORDER BY clientid LIMIT 50;
Here's what this example does:
1.The add file statement at the beginning of the file adds the streaming.py file to the distributed cache, so it's accessible by all nodes in the cluster.
2.The SELECT TRANSFORM ... USING statement selects data from the hivesampletable, and passes clientid, devicemake, and devicemodel to the streaming.py script.
3.The AS clause describes the fields returned from streaming.py
Here's the streaming.py file used by the HiveQL example.
#!/usr/bin/env python
import sys
import string
import hashlib
while True:
line = sys.stdin.readline()
if not line:
break
line = string.strip(line, "\n ")
clientid, devicemake, devicemodel = string.split(line, "\t")
phone_label = devicemake + ' ' + devicemodel
print "\t".join([clientid, phone_label, hashlib.md5(phone_label).hexdigest()])
Since we are using streaming, this script has to do the following:
1.Read data from STDIN. This is accomplished by using sys.stdin.readline() in this example.
2.The trailing newline character is removed using string.strip(line, "\n "), since we just want the text data and not the end of line indicator.
3.When doing stream processing, a single line contains all the values with a tab character between each value. So string.split(line, "\t") can be used to split the input at each tab, returning just the fields.
4.When processing is complete, the output must be written to STDOUT as a single line, with a tab between each field. This is accomplished by using print "\t".join([clientid, phone_label, hashlib.md5(phone_label).hexdigest()]).
5.This all occurs within a while loop, that will repeat until no line is read, at which point break exits the loop and the script terminates.
Beyond that, the script just concatenates the input values for devicemake and devicemodel, and calculates a hash of the concatenated value. Pretty simple, but it describes the basics of how any Python script invoked from Hive should function: Loop, read input until there is no more, break each line of input apart at the tabs, process, write a single line of tab delimited output.

Related

Register python custom UDF in hive

My problem is registering the python UDF in hive.
I created an encryption and decryption python code to use in hive query. These are working as expected.
however I do not want to add files every time I use it, instead would like to make a permanent registry in udf.
I see the java jar file can be added to udf list using "Create temporary Function ". I do not see such thing for python.
appreaciate any quick help.
Below are the scripts that is working for me.
vi decry.py
import base64
import sys
try:
for line in sys.stdin:
line=line.strip()
(a,b)=line.split('\t')
print('\t'.join([a,b,base64.b64decode(bytes(a.replace('\n',''),encoding='utf-8')).decode()]))
except:
print(sys.exc_info())
vi encry.py
import base64
import sys
try:
for line in sys.stdin:
line=line.strip()
(a,b)=line.split('\t')
print('\t'.join([a,b,base64.b64encode(bytes(a.replace('\n',''),encoding='utf-8')).decode()]))
except:
print(sys.exc_info())
Testing on Database
create database test;
use test;
create table abc (id integer,name string);
insert into abc values(1,'vijay');
insert into abc values(2,'vaasista');
add FILE /hadoop/encry.py;
add FILE /hadoop/decry.py;
select transform(id) using 'python3 encry.py' as id from abc;
select transform(id,name) using 'python3 encry.py' as id,name,ct from
abc;

First of all, base64 is not encryption, it is encoding in different character set (radix-64 - base) and everyone can reverse it, because no secret key is used.
Second: Unfortunately, the only way of using Python UDF is TRANSFORM in Hive, CREATE [TEMPORARY] FUNCTION does not support Python.
Consider using using native functions for better performance and flexibility, like this:
--encode
base64(encode(regexp_replace(text_col,'\\n',''), 'UTF-8'))
--decode
decode(unbase64(str),'UTF-8')
Demo:
select base64(encode(regexp_replace('My test \n String','\\n',''), 'UTF-8'))
Result:
TXkgdGVzdCAgU3RyaW5n
Decode:
select decode(unbase64('TXkgdGVzdCAgU3RyaW5n'), 'UTF-8')
Result:
My test String
Also there are aes_encrypt and aes_decrypt functions available in Hive if you need encryption.

Extract only additions from diff in python

I am trying to solve a problem:
I receive auto-generated email from government with no tags in HTML. It's one table nested upon another. An abomination of a template. I get it every few days and I want to extract some fields from it. My idea was this
Use HTML in the email as template. Remove all fields that change with every mail like Name of my client, their Unique ID and issue explained in the mail.
Use this html template with missing fields and diff it with new emails. That will give me all the new info in one shot without having to parse this email.
Problem is, I can't find any way of loading only these additions. I am trying to use difflib in python and it returns byte streams of additions and subtractions in each line that I am not able to process properly. I want to find a way to only return the additions and nothing else. I am open to using other libraries or methods. I do not want to write a huge regex with tons of html.

When I got the stdout from using Popen calling diff it also returned bytes.
You can convert the bytes to chars, then continue with your processing.
You could do something similar to what I do below to convert your bytes to a string
The below calls diff on two files and prints only the lines beginning with the '>' symbol (new in the rhs file):
#! /usr/env python
import os
import sys, subprocess
file1 = 'test1'
file2 = 'test2'
if len(sys.argv)==3:
file1=sys.argv[1]
file2=sys.argv[2]
if not os.access(file1,os.R_OK):
print(f'Unable to read: \'{file1}\'')
sys.exit(1)
if not os.access(file2,os.R_OK):
print(f'Unable to read: \'{file2}\'')
sys.exit(1)
argv = ['diff',file1,file2]
runproc = subprocess.Popen(args=argv, stdout=subprocess.PIPE)
out, err = runproc.communicate()
outstr=''
for c in out:
outstr+=chr(c)
for line in outstr.split('\n'):
if len(line)==0:
continue
if line[0]=='>':
print(line)

File parsing using Unix Shell Scripting

I am trying to do some transformation and I’m stuck. Here goes the problem description.
Below is the pipe delimited file. I have masked data!
AccountancyNumber|AccountancyNumberExtra|Amount|ApprovedBy|BranchCurrency|BranchGuid|BranchId|BranchName|CalculatedCurrency|CalculatedCurrencyAmount|CalculatedCurrencyVatAmount|ControllerBy|Country|Currency|CustomFieldEnabled|CustomFieldGuid|CustomFieldName|CustomFieldRequired|CustomFieldValue|Date|DateApproved|DateControlled|Email|EnterpriseNumber|ExpenseAccountGuid|ExpenseAccountName|ExpenseAccountStatus|ExpenseGuid|ExpenseReason|ExpenseStatus|ExternalId|GroupGuid|GroupId|GroupName|IBAN|Image|IsInvoice|MatchStatus|Merchant|MerchantEnterpriseNumber|Note|OwnerShip|PaymentMethod|PaymentMethodGuid|PaymentMethodName|ProjectGuid|ProjectId|ProjectName|Reimbursable|TravellerId|UserGUID|VatAmount|VatPercentage|XpdReference|VatCode|FileName|CreateTstamp
61470003||30.00|null|EUR|168fcea9-17d4-45a1-8b6f-bfb249cdbea6|BEL|BEL|USD,INR,EUR|35.20,2420.11,30.00|null,null,null|null|BE|EUR|true|0d4b767b-0988-47e8-9144-05e607169284|careertitle|false|FE|2018-07-24T00:00:00|null|null|abc_def#xyz.com||c32f03c6-31df-4fd8-8cc2-1c5f3a580aad|Meals - In Office|true|781d10d2-2f3b-43bc-866e-a653fefacbbe||Approved|70926|40ac7117-c7e2-42ea-b34f-96330c9380b6|BEL-FSP-Users|BEL-FSP-Users|||false|None|in office meal #1|||Personal|Cash|1ee44666-f4c7-44b3-acd3-8ecd7127480a|Cash|2cb4ccb7-634d-4386-af43-b4572ec72098|00AA06|00AA06|true||6c5a835f-5152-46db-923a-3ebd08c7dad3|null|null|XPD012245802||1820711.xml|2018-08-07 05:42:10.46
In this file, we have got CalculatedCurrency field where we have multiple values delimited by a comma. The file also has field CalculatedCurrencyAmount which too has multiple values delimited by a comma. But I need to pick up only that currency value from CalculatedCurrency field which belongs to
BranchCurrency (another field in the file) and of course corresponding CalculatedCurrencyAmount for that Currency.
Required output : -
AccountancyNumber|AccountancyNumberExtra|Amount|ApprovedBy|BranchCurrency|BranchGuid|BranchId|BranchName|CalculatedCurrency|CalculatedCurrencyAmount|CalculatedCurrencyVatAmount|ControllerBy|Country|Currency|CustomFieldEnabled|CustomFieldGuid|CustomFieldName|CustomFieldRequired|CustomFieldValue|Date|DateApproved|DateControlled|Email|EnterpriseNumber|ExpenseAccountGuid|ExpenseAccountName|ExpenseAccountStatus|ExpenseGuid|ExpenseReason|ExpenseStatus|ExternalId|GroupGuid|GroupId|GroupName|IBAN|Image|IsInvoice|MatchStatus|Merchant|MerchantEnterpriseNumber|Note|OwnerShip|PaymentMethod|PaymentMethodGuid|PaymentMethodName|ProjectGuid|ProjectId|ProjectName|Reimbursable|TravellerId|UserGUID|VatAmount|VatPercentage|XpdReference|VatCode|FileName|CreateTstamp|ActualCurrency|ActualAmount
61470003||30.00|null|EUR|168fcea9-17d4-45a1-8b6f-bfb249cdbea6|BEL|BEL|USD,INR,EUR|35.20,2420.11,30.00|null,null,null|null|BE|EUR|true|0d4b767b-0988-47e8-9144-05e607169284|careertitle|false|FE|2018-07-24T00:00:00|null|null|abc_def#xyz.com||c32f03c6-31df-4fd8-8cc2-1c5f3a580aad|Meals - In Office|true|781d10d2-2f3b-43bc-866e-a653fefacbbe||Approved|70926|40ac7117-c7e2-42ea-b34f-96330c9380b6|BEL-FSP-Users|BEL-FSP-Users|||false|None|in office meal #1|||Personal|Cash|1ee44666-f4c7-44b3-acd3-8ecd7127480a|Cash|2cb4ccb7-634d-4386-af43-b4572ec72098|00AA06|00AA06|true||6c5a835f-5152-46db-923a-3ebd08c7dad3|null|null|XPD012245802||1820711.xml|2018-08-07 05:42:10.46|EUR|30.00
Please help.
Snaplogic Python Script
from com.snaplogic.scripting.language import ScriptHook
from com.snaplogic.scripting.language.ScriptHook import *
import csv
class TransformScript(ScriptHook):
def __init__(self, input, output, error, log):
self.input = input
self.output = output
self.error = error
self.log = log
def execute(self):
self.log.info("Executing Transform script")
while self.input.hasNext():
data = self.input.next()
branch_currency = data['BranchCurrency']
calc_currency = data['CalculatedCurrency'].split(',')
calc_currency_amount = data['CalculatedCurrencyAmount'].split(',')
result = None
for i, name in enumerate(calc_currency):
result = calc_currency_amount[i] if name == branch_currency else result
data["CalculatedCurrencyAmount"] = result
result1 = calc_currency[i] if name == branch_currency else result
data["CalculatedCurrency"] = result1
try:
data["mathTryCatch"] = data["counter2"].longValue() + 33
self.output.write(data)
except Exception as e:
data["errorMessage"] = e.message
self.error.write(data)
self.log.info("Finished executing the Transform script")
hook = TransformScript(input, output, error, log)

Using bash with some arrays:
arr_find() {
echo $(( $(printf "%s\0" "${#:2}" | grep -Fnxz "$1" | cut -d: -f1) - 1 ))
}
IFS='|' read -r -a headers
while IFS='|' read -r "${headers[#]}"; do
IFS=',' read -r -a CalculatedCurrency <<<"$CalculatedCurrency"
IFS=',' read -r -a CalculatedCurrencyAmount <<<"$CalculatedCurrencyAmount"
idx=$(arr_find "$BranchCurrency" "${CalculatedCurrency[#]}")
echo "BranchCurrency is $BranchCurrency. Hence CalculatedCurrency will be ${CalculatedCurrency[$idx]} and CalculatedCurrencyAmount will have to be ${CalculatedCurrencyAmount[$idx]}."
done
First I read all headers names. Then read all values into headers. Then read CalculatedCurrency* correctly, cause they are separated by ','. Then I find the element number which is equal to BranchCurrency inside CalculatedCurrency. Having the element index and arrays, I can just print the output.

I know, the op asked for unix shell, but as an alternative option I show some code to do it using python. (Obviously this code can be heavily improved also.) The great advantage is readability, for instance, that you can address your data via name. Or the code, which is much more readable at all, than doing this with awk et al.
Save your data in data.psv, write the following script into a file main.py. I've tested it using python3 and python2. Both works. Run the script using python main.py.
Update: I've extended the script to parse all lines. In the example data, I've set BranchCurrency to EUR in the first line and USD in the secondline, as a dummy test.
File: main.py
import csv
def parse_line(row):
branch_currency = row['BranchCurrency']
calc_currency = row['CalculatedCurrency'].split(',')
calc_currency_amount = row['CalculatedCurrencyAmount'].split(',')
result = None
for i, name in enumerate(calc_currency):
result = calc_currency_amount[i] if name == branch_currency else result
return result
def main():
with open('data.psv') as f:
reader = csv.DictReader(f, delimiter='|')
for row in reader:
print(parse_line(row))
if __name__ == '__main__':
main()
Example Data:
[:~] $ cat data.psv
AccountancyNumber|AccountancyNumberExtra|Amount|ApprovedBy|BranchCurrency|BranchGuid|BranchId|BranchName|CalculatedCurrency|CalculatedCurrencyAmount|CalculatedCurrencyVatAmount|ControllerBy|Country|Currency|CustomFieldEnabled|CustomFieldGuid|CustomFieldName|CustomFieldRequired|CustomFieldValue|Date|DateApproved|DateControlled|Email|EnterpriseNumber|ExpenseAccountGuid|ExpenseAccountName|ExpenseAccountStatus|ExpenseGuid|ExpenseReason|ExpenseStatus|ExternalId|GroupGuid|GroupId|GroupName|IBAN|Image|IsInvoice|MatchStatus|Merchant|MerchantEnterpriseNumber|Note|OwnerShip|PaymentMethod|PaymentMethodGuid|PaymentMethodName|ProjectGuid|ProjectId|ProjectName|Reimbursable|TravellerId|UserGUID|VatAmount|VatPercentage|XpdReference|VatCode|FileName|CreateTstamp
61470003||35.00|null|EUR|168fcea9-17d4-45a1-8b6f-bfb249cdbea6|BEL|BEL|USD,INR,EUR|35.20,2420.11,30.00|null,null,null|null|BE|EUR|true|0d4b767b-0988-47e8-9144-05e607169284|careertitle|false|FE|2018-07-24T00:00:00|null|null|abc_def#xyz.com||c32f03c6-31df-4fd8-8cc2-1c5f3a580aad|Meals - In Office|true|781d10d2-2f3b-43bc-866e-a653fefacbbe||Approved|70926|40ac7117-c7e2-42ea-b34f-96330c9380b6|BEL-FSP-Users|BEL-FSP-Users|||false|None|in office meal #1|||Personal|Cash|1ee44666-f4c7-44b3-acd3-8ecd7127480a|Cash|2cb4ccb7-634d-4386-af43-b4572ec72098|00AA06|00AA06|true||6c5a835f-5152-46db-923a-3ebd08c7dad3|null|null|XPD012245802||1820711.xml|2018-08-07 05:42:10.46
61470003||35.00|null|USD|168fcea9-17d4-45a1-8b6f-bfb249cdbea6|BEL|BEL|USD,INR,EUR|35.20,2420.11,30.00|null,null,null|null|BE|EUR|true|0d4b767b-0988-47e8-9144-05e607169284|careertitle|false|FE|2018-07-24T00:00:00|null|null|abc_def#xyz.com||c32f03c6-31df-4fd8-8cc2-1c5f3a580aad|Meals - In Office|true|781d10d2-2f3b-43bc-866e-a653fefacbbe||Approved|70926|40ac7117-c7e2-42ea-b34f-96330c9380b6|BEL-FSP-Users|BEL-FSP-Users|||false|None|in office meal #1|||Personal|Cash|1ee44666-f4c7-44b3-acd3-8ecd7127480a|Cash|2cb4ccb7-634d-4386-af43-b4572ec72098|00AA06|00AA06|true||6c5a835f-5152-46db-923a-3ebd08c7dad3|null|null|XPD012245802||1820711.xml|2018-08-07 05:42:10.46
Example Run:
[:~] $ python main.py
30.00
35.20

Using awk:
awk 'BEGIN{FS=OFS="|"}
NR==1{print $0,"ActualCurrency","ActualAmount";next}
{n=split($9,a,",");split($10,b,",");for(i=1;i<=n;i++) if(a[i]==$5) print $0,$5,b[i]}' file
BEGIN{FS=OFS="|"} sets the input and ouput delimiter to |.
The NR==1 statement takes care of the header by adding the 2 strings.
The 9th and 10th fields are splitted based on the , separator, and values are set inside the arrays a and b.
The for loop is trying to find the value of the a array corresponding to the 5th field. If found, the corresponding value of the b is printed.

You don't need to use the script snap here at all. Writing scripts for transformations all the time hampers the performance and defeats the purpose of an IPaaS tool altogether. The mapper should suffice.
I created the following test pipeline for this problem.
I saved the data provided in this question in a file and saved it in SnapLogic for the test. In the pipeline, I parsed it using a CSV parser.
Following is the parsed result.
Then I used a mapper for doing the required transformation.
Following is the expression for getting the actual amount.
$CalculatedCurrency.split(',').indexOf($BranchCurrency) >= 0 ? $CalculatedCurrencyAmount.split(',')[$CalculatedCurrency.split(',').indexOf($BranchCurrency)] : null
Following is the result.
Avoid writing scripts for problems that can be solved using mappers.

Python splitlines result changes based on terminal screen size?

I have some commands in a file that I'm reading in python and executing using subprocess.Popen. When run splitlines() method, the number of lines that are produced depends on the width of my terminal screen. The commands in the file are:
CREATE EXTERNAL TABLE tableforview (name string, dob string) STORED AS PARQUET LOCATION 'location';
There are no newlines in the file; it is all typed on one line.
hivep = subprocess.Popen("beeline -u 'connectionstring' --force=true --outputformat=csv2 --showWarnings=false -f hivetest", stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
(output, err) = hivep.communicate()
hivequeryList = []
hivequery = ""
for line in output.splitlines():
print(line)
However, when I print the output of splitlines() method line by line, I get two lines or three lines depending on how large my terminal screen is.
I'm reading the split lines into a dictionary. So what I would expect is for the key of the dictionary to contain the entire create query, and the value to contain the result. But what is happening is the query is getting cut off, like so
{"CREATE EXTERNAL TABLE tableforview (name string, dob string) STORED AS PARQUE" : "T LOCATION 'location';"}
And the location at which that cutoff occurs changes depending on how wide my terminal screen is.
I don't know why my terminal screen width should matter when reading from a file. Would appreciate any insight.

The moment you use the print, your screen width is used to print your content, if you resize it later, the print has already done its calculations.

subprocess, Popen, and stdin: Seeking practical advice on automating user input to .exe

Despite my obviously beginning Python skills, I’ve got a script that pulls a line of data from a 2,000-row CSV file, reads key parameters, and outputs a buffer CSV file organized as an N-by-2 rectangle, and uses the subprocess module to call the external program POVCALLC.EXE, which takes a CSV file organized that way as input. The relevant portion of the code is shown below. I THINK that subprocess or one of its methods should allow me to interact with the external program, but am not quite sure how - or indeed whether this is the module I need.
In particular, when POVCALLC.EXE starts it first asks for the input file, which in this case is buffer.csv. It then asks for several additional parameters including the name of an output file, which come from outside the snippet below. It then starts computing results, and then ask for further user input, including several carriage returns . Obviously, I would prefer to automate this interaction for the 2,000 rows in the original CSV.
Am I on the right track with subprocess, or should I be looking elsewhere to automate this interaction with the external executable?
Many thanks in advance!
# Begin inner loop to fetch Lorenz curve data for each survey
for i in range(int(L_points_number)):
index = 3 * i
line = []
P = L_points[index]
line.append(P)
L = L_points[index + 1]
line.append(L)
with open('buffer.csv', 'a', newline='') as buffer:
writer = csv.writer(buffer, delimiter=',')
P=1
line.append(P)
L=1
line.append(L)
writer.writerow(line)
subprocess.call('povcallc.exe')
# TODO: CALL povcallc and compute results
# TODO: USE Regex to interpret results and append them to
# output file

If your program expects these arguments on the standard input (e.g. after running POVCALLC you type csv filenames into the console), you could use subprocess.Popen() [see https://docs.python.org/3/library/subprocess.html#subprocess.Popen ] with stdin redirection (stdin=PIPE), and use the returned object to send data to stdin.
It would looks something like this:
my_proc = subprocess.Popen('povcallc.exe', stdin=subprocess.PIPE)
my_proc.communicate(input="my_filename_which_is_expected_by_the_program.csv")
You can also use the tuple returned by communicate to automatically check the programs stdout and stderr (see the link to docs for more).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Python to insert multiple rows into a Hive table - python

Related

Register python custom UDF in hive

Extract only additions from diff in python

File parsing using Unix Shell Scripting

Python splitlines result changes based on terminal screen size?

subprocess, Popen, and stdin: Seeking practical advice on automating user input to .exe

Categories

Resources