Python: Summing a pig tuple containing float values - python

I'm fairly new to Pig/Python and in need of help. Trying to write a Pig Script that reconciles financial data. The parameters used follow a syntax like (grand_tot, x1, x2,... xn), meaning that the first value should equal the sum of remaining values.
I don't know of a way to accomplish this using Pig alone, so I've been trying to write a Python UDF. Pig passes a tuple to Python; if the sum of x1:xn equals grand_tot, then Python should return a "1" to Pig to show that the numbers match, otherwise it returns a "0".
Here is what I have so far:
register 'myudf.py' using jython as myfuncs;
A = LOAD '$file_nm' USING PigStorage(',') AS (grand_tot,west_region,east_region,prod_line_a,prod_line_b, prod_line_c, prod_line_d);
A1 = GROUP A ALL;
B = FOREACH A1 GENERATE TOTUPLE($recon1) as flds;
C = FOREACH B GENERATE myfuncs.isReconciled(flds) AS res;
DUMP C;
$recon1 is passed as a parameter, and defined as:
grand_tot, west_region, east_region
I will later pass $recon2 as:
grand_tot, prod_line_a, prod_line_b, prod_line_c, prod_line_d
Sample row of data (in $file_nm) looks like:
grand_tot,west_region,east_region,prod_line_a,prod_line_b, prod_line_c, prod_line_d
10000,4500,5500,900,2200,450,3700,2750
12500,7500,5000,3180,2770,300,3950,2300
9900,7425,2475,1320,460,3070,4630,1740
Lastly... here is what I'm trying to do with Python UDF code:
#outputSchema("result")
def isReconciled(arrTuple):
arrTemp = []
arrNew = []
string1 = ""
result = 0
## the first element of the Tuple should be the sum of remaining values
varGrandTot = arrTuple[0]
## create a new array with the remaining Tuple values
arrTemp = arrTuple[1:]
for item in arrTuple:
arrNew.append(item)
## sum the second to the nth values
varSum = sum(arrNew)
## if the first value in the tuple equals the sum of all remaining values
if varGrandTot = varSum then:
#reconciled to the penny
result = 1
else:
result = 0
return result
The error message I receive is:
unsupported operand type(s) for +: 'int' and 'array.array'
I've tried numerous things attempting to convert the array values into numeric and convert to float so that I can sum, but with no success.
Any ideas??? Thanks for looking!

You can do this in PIG itself.
First, specify the datatype in the schema. PigStorage will use bytearray as default data type.Hence your python script is throwing the error.Looks like your sample data has int but in your question you have mentioned float.
Second, add the fields starting from the second field or the fields of your choice.
Third, use the bincond operator to check the first field value with the sum.
A = LOAD '$file_nm' USING PigStorage(',') AS (grand_tot:float,west_region:float,east_region:float,prod_line_a:float,prod_line_b:float, prod_line_c:float, prod_line_d:float);
A1 = FOREACH A GENERATE grand_tot,SUM(TOBAG(prod_line_a,prod_line_b,prod_line_c,prod_line_d)) as SUM_ALL;
B = FOREACH A1 GENERATE (grand_tot == SUM_ALL ? 1 : 0);
DUMP B;

It is very likely, that your arrTuple is not an array of numbers, but some item is an array.
To check it, modify your code by adding some checks:
#outputSchema("result")
def isReconciled(arrTuple):
# some checks
tmpl = "Item # {i} shall be a number (has value {itm} of type {tp})"
for i, num in enumerate(arrTuple):
msg = templ.format(i=i, itm=itm, tp=type(itm))
assert isinstance(arrTuple[0], (int, long, float)), msg
# end of checks
arrTemp = []
arrNew = []
string1 = ""
result = 0
## the first element of the Tuple should be the sum of remaining values
varGrandTot = arrTuple[0]
## create a new array with the remaining Tuple values
arrTemp = arrTuple[1:]
for item in arrTuple:
arrNew.append(item)
## sum the second to the nth values
varSum = sum(arrNew)
## if the first value in the tuple equals the sum of all remaining values
if varGrandTot = varSum then:
#reconciled to the penny
result = 1
else:
result = 0
return result
It is very likely, that it will throw an AssertionFailed exception on one of the items. Read the
assertion message to learn, which item is making the troubles.
Anyway, if you want to return 0 or 1 if first number equals sum of the rest of the array, following
would work too:
#outputSchema("result")
def isReconciled(arrTuple):
if arrTuple[0] == sum(arrTuple[1:]):
return 1
else:
return 0
and in case, you would live happy with getting True instead of 1 and False instead of 0:
#outputSchema("result")
def isReconciled(arrTuple):
return arrTuple[0] == sum(arrTuple[1:])

Related

Calling arguments from 2 separate lists into one function

I'm trying to calculate the alcohol by volume (abv) of some beer by using variables from 2 separate lists (which I took from a dictionary entry). I'm having trouble getting the values from both lists to be applied to the equation that I have for abv (and it's probably not possible to have a for loop with an and statement like the one I have below). Is it possible to get variables from two separate lists to be subbed into the same equation in one for loop?
Right now it's telling me that I have a type error where 'bool' object is not iterable. Here's what I've tried so far in terms of coding:
beers = {"SG": [1.050, 1.031, 1.077, 1.032, 1.042, 1.055, 1.019, 1.089, 1.100, 1.032],
"FG": [1.010, 1.001, 1.044, 1.003, 1.003, 1.013, 1.002, 1.020, 1.056, 1.000],
"grad student 1": [5.264, 3.983, 4.101, 7.216, 2.313, 4.876, 2.255, 8.991, 5.537, 4.251],
"grad student 2": [5.211, 3.008, 4.117, 3.843, 5.168, 5.511, 3.110, 8.903, 5.538, 4.255]}
#separating the SG and FG values from the dictionary entry
SG_val = beers["SG"]
FG_val = beers['FG']
def find_abv(SG = SG_val, FG = FG_val):
abv_list = []
i = 0.0
j = 0.0
for i in SG_val and j in FG_val:
abv = (((1.05/0.79)*((i - j)/j))*100)
abv_list.append(abv)
return abv_list
find_abv()
print(abv_list)```
You cannot use and to iterate two variables in a single for loop. You can use the zip function to do that:
def find_abv(SG = SG_val, FG = FG_val):
abv_list = []
i = 0.0
j = 0.0
for i, j in zip(SG,FG):
abv = (((1.05/0.79)*((i - j)/j))*100)
abv_list.append(abv)
return abv_list
abv_list = find_abv()
print(abv_list)
You also need to assign the result of find_abv() to a variable in order to print it, which you don't, as it seems in your code.
Another thing is that the use of SG_val and FG_val in the loop of your find_abv is pointless, since you have the SG an FG parameters in your function.
You can't use a for loop to directly iterate through multiple lists. Currently, your function is trying to iterate through (SG_val and j in FG_val), which itself is a boolean and can therefore not be iterated through.
If the two lists will always have the same number of items, then you could simply iterate through the indexes:
# len(SG_val) returns the length of SG_val
for i in range(len(SG_val)):
abv = (((1.05/0.79)*((SG_val[i] - FG_val[i])/FG_val[i]))*100)
abv_list.append(abv)
# put the return outside of the for loop so that it can finish iterating before returning the value
return abv_list
If the lists aren't always going to be the same length then you can write for i in range(len(SG_val) if len(SG_val) <= len(FG_val) else len(SG_val)): instead of for i in range(len(SG_val)):so that it iterates until it reaches the end of the smallest list.
Also, to output the value returned by the function you have to assign it to something and then print it or just print it directly:
abv_list = find_abv()
print(abv_list)
# or
print(find_abv())

Nested for loop producing more number of values than expected-Python

Background:I have two catalogues consisting of positions of spatial objects. My aim is to find the similar ones in both catalogues with a maximum difference in angular distance of certain value. One of them is called bss and another one is called super.
Here is the full code I wrote
import numpy as np
def crossmatch(bss_cat, super_cat, max_dist):
matches=[]
no_matches=[]
def find_closest(bss_cat,super_cat):
dist_list=[]
def angular_dist(ra1, dec1, ra2, dec2):
r1 = np.radians(ra1)
d1 = np.radians(dec1)
r2 = np.radians(ra2)
d2 = np.radians(dec2)
a = np.sin(np.abs(d1-d2)/2)**2
b = np.cos(d1)*np.cos(d2)*np.sin(np.abs(r1 - r2)/2)**2
rad = 2*np.arcsin(np.sqrt(a + b))
d = np.degrees(rad)
return d
for i in range(len(bss_cat)): #The problem arises here
for j in range(len(super_cat)):
distance = angular_dist(bss_cat[i][1], bss_cat[i][2], super_cat[j][1], super_cat[j][2]) #While this is supposed to produce single floating point values, it produces numpy.ndarray consisting of three entries
dist_list.append(distance) #This list now contains numpy.ndarrays instead of numpy.float values
for k in range(len(dist_list)):
if dist_list[k] < max_dist:
element = (bss_cat[i], super_cat[j], dist_list[k])
matches.append(element)
else:
element = bss_cat[i]
no_matches.append(element)
return (matches,no_matches)
When put seperately, the function angular_dist(ra1, dec1, ra2, dec2) produces a single numpy.float value as expected. But when used inside the for loop in this crossmatch(bss_cat, super_cat, max_dist) function, it produces numpy.ndarrays instead of numpy.float. I've stated this inside the code also. I don't know where the code goes wrong. Please help

How can I take the lowest value in this code?

how are you?
I'm trying to take the lowest value of the following code, my idea is that for example the result will be like. country,price,date
im using python for the code
valores= ["al[8075]['2019-05-27']", "de[2177]['2019-05-27']", "at[3946]['2019-05-27']", "be[3019]['2019-05-26']", "by[5741]['2019-05-27']", "ba[0]['2019-05-26', '2019-05-27']", "bg[3223]['2019-05-26']", "hr[4358]['2019-05-26']", "dk[5006]['2019-05-27']", "sk[4964]['2019-05-27']", "si[5253]['2019-05-26']", "es[3813]['2019-05-27']", "ee[4699]['2019-05-27']", "ru[4889]['2019-05-27']", "fi[5410]['2019-05-26']", "fr[2506]['2019-05-26']", "gi[0]['2019-05-26', '2019-05-27']", "gr[1468]['2019-05-26']", "hu[3475]['2019-05-27']", "ie[5360]['2019-05-26']", "is[0]['2019-05-26']", "it[2970]['2019-05-26']", "lv[2482]['2019-05-27']", "lt[1276]['2019-05-27']", "lu[0]['2019-05-26']", "mk[5417]['2019-05-26']", "mt[3532]['2019-05-26']", "md[6158]['2019-05-27']", "me[11080]['2019-05-26']", "no[2967]['2019-05-27']", "nl[3640]['2019-05-27']", "pl[2596]['2019-05-27']", "pt[5409]['2019-05-27']", "uk[5010]['2019-05-27']", "cz[5493]['2019-05-26']", "ro[1017]['2019-05-27']", "rs[6535]['2019-05-27']", "se[3971]['2019-05-26']", "ch[5112]['2019-05-26']", "tr[3761]['2019-05-26']", "ua[5187]['2019-05-26']"]
the idea in this example will be like
as you see country(ro) price(1017) date('2019-05-27') is the lowest
valores= "ro[1017]['2019-05-27']"
Python's max() and min() functions take a key argument. So, whenever you need a minimum or maximum you can often leverage these built-ins. The only code you have to write something to convert a value to the corresponding representation for max/min purposes.
def f(s):
return int(s.split('[')[1].split(']')[0]) or float('inf')
lowest = min(valores, key = f) # ro[1017]['2019-05-27']
There are more than one way of coding this. The following will do this:
lowest = 1000000
target = " "
for i in valores:
ix = i.find("[") + 1
iy = i.find("]")
value = int(i[ix:iy])
if value < lowest and value != 0:
lowest = value
target = i
print(target)
It will output
"ro[1017]['2019-05-27]"
However, here I am assuming you do not want 0 values, otherwise the answer would be
"ba[0]['2019-05-26', '2019-05-27']"
If you want to include 0, just modify the if block.
This should work for you. I assume you want the lowest non-zero price.
I split every string in the lists into sublists via square brackets [ and strip away the extra brackets [ and ] for each item, hence each sublist will have [state, price, dates] .
I then sort on the price, which is the second item of each sublist, and filter out the 0 prices,
The result will then be the first element of the filtered list
import re
import re
valores= ["al[8075]['2019-05-27']", "de[2177]['2019-05-27']", "at[3946]['2019-05-27']", "be[3019]['2019-05-26']", "by[5741]['2019-05-27']", "ba[0]['2019-05-26', '2019-05-27']", "bg[3223]['2019-05-26']", "hr[4358]['2019-05-26']", "dk[5006]['2019-05-27']", "sk[4964]['2019-05-27']", "si[5253]['2019-05-26']", "es[3813]['2019-05-27']", "ee[4699]['2019-05-27']", "ru[4889]['2019-05-27']", "fi[5410]['2019-05-26']", "fr[2506]['2019-05-26']", "gi[0]['2019-05-26', '2019-05-27']", "gr[1468]['2019-05-26']", "hu[3475]['2019-05-27']", "ie[5360]['2019-05-26']", "is[0]['2019-05-26']", "it[2970]['2019-05-26']", "lv[2482]['2019-05-27']", "lt[1276]['2019-05-27']", "lu[0]['2019-05-26']", "mk[5417]['2019-05-26']", "mt[3532]['2019-05-26']", "md[6158]['2019-05-27']", "me[11080]['2019-05-26']", "no[2967]['2019-05-27']", "nl[3640]['2019-05-27']", "pl[2596]['2019-05-27']", "pt[5409]['2019-05-27']", "uk[5010]['2019-05-27']", "cz[5493]['2019-05-26']", "ro[1017]['2019-05-27']", "rs[6535]['2019-05-27']", "se[3971]['2019-05-26']", "ch[5112]['2019-05-26']", "tr[3761]['2019-05-26']", "ua[5187]['2019-05-26']"]
results = []
#Iterate through valores
for item in valores:
#Extract elements from each string by splitting on [ and then stripping extra square brackets
items = [it.strip('][') for it in item.split('[')]
results.append(items)
#Sort on the second element which is price, and filter prices with are 0
res = list(
filter(lambda x: int(x[1]) > 0,
sorted(results, key=lambda x:int(x[1])))
)
#This is your lowest non-zero price
print(res[0])
The output will be
['ro', '1017', "'2019-05-27'"]

Permutation of pandas series with all elements in it (itertools)

I trying to get a back a series (or data frame) with permutations of the elements in that list:
stmt_stream_ticker = ("SELECT * FROM ticker;")
ticker = pd.read_sql(stmt_stream_ticker, eng)
which gives me my ticker series
ticker
0 ALJ
1 ALDW
2 BP
3 BPT
then via function I'd like to work my list:
def permus_maker(x):
global i # I tried nonlocal i but this gives me: nonbinding error
permus = itertools.permutations(x, 2)
permus_pairs = []
for i in permus:
permus_pairs.append(i)
return permus_pairs.append(i)
test = permus_maker(ticker)
print(test)
This gives me 'None' back. Any idea what I do wrong?
edit1:
I tested user defined function vs integrated function (%timeit): it takes 5x as long.
list.append() returns None, so return permus_pairs instead of permus_pairs.append(i)
Demo:
In [126]: [0].append(1) is None
Out[126]: True
but you don't really need that function:
In [124]: list(permutations(ticker.ticker, 2))
Out[124]:
[('ALJ', 'ALDW'),
('ALJ', 'BP'),
('ALJ', 'BPT'),
('ALDW', 'ALJ'),
('ALDW', 'BP'),
('ALDW', 'BPT'),
('BP', 'ALJ'),
('BP', 'ALDW'),
('BP', 'BPT'),
('BPT', 'ALJ'),
('BPT', 'ALDW'),
('BPT', 'BP')]
Be aware - if you are working with huge lists/Series/DataFrames, then it would make sense to use permutations iterator iteratively instead of exploding it in memory:
for pair in permutations(ticker.ticker, 2):
# process pair of tickers here ...

Fuzzy Match List with Column in a data frame

I have a list of strings that I am trying to match to values in a column. If it is a low match (below 95) I want to return the current column value if it is above 95 then I want to return the best fuzzy match from the list . I am trying to put all returned values into a new column. I keep getting the error "tuple index out of range", I think this maybe because it wants to return a tuple with the score and name but I only want the name. Here is my current code:
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
L = [ducks, frogs, doggies]
df
FOO PETS
a duckz
b frags
c doggies
def fuzz_m(column, pet_list, score_t):
for c in column:
new_name, score = process.extractOne(c, pet_list, score_t)
if score<95:
return c
else:
return new_name
df['NEW_PETS'] = fuzz_m(df,L, fuzz.ratio)
Desired output:
FOO PETS NEW_PETS
a duckz ducks
b frags frogs
c doggies doggies
Several corrections.
Change
df['NEW_PETS'] = fuzz_m(df,L, fuzz.ratio)
to
df['NEW_PETS'] = fuzz_m(df['PETS'], L, fuzz.ratio)
Make your list elements strings.
Fuzzywuzzy's extractOne method accepts both a processor and a scorer, in that order (link to source code.). Your positional argument of fuzz.ratio is mistakenly interpreted as a processor, when it's really a scorer. Change process.extractOne(c, pet_list, score_t) to process.extractOne(c, pet_list, scorer=score_t).
This loop-based code will not work as expected. fuzz_m is only called once, and its return value will be broadcast into all entries of the series df['NEW_PETS'].
A more pandas-friendly way:
L = ['ducks', 'frogs', 'doggies']
def fuzz_m(col, pet_list, score_t):
new_name, score = process.extractOne(col, pet_list, scorer=score_t)
if score<95:
return col
else:
return new_name
df['NEW_PETS'] = df['PETS'].apply(fuzz_m, pet_list=L, score_t=fuzz.ratio)

Categories