Permutation of pandas series with all elements in it (itertools) - python

I trying to get a back a series (or data frame) with permutations of the elements in that list:
stmt_stream_ticker = ("SELECT * FROM ticker;")
ticker = pd.read_sql(stmt_stream_ticker, eng)
which gives me my ticker series
ticker
0 ALJ
1 ALDW
2 BP
3 BPT
then via function I'd like to work my list:
def permus_maker(x):
global i # I tried nonlocal i but this gives me: nonbinding error
permus = itertools.permutations(x, 2)
permus_pairs = []
for i in permus:
permus_pairs.append(i)
return permus_pairs.append(i)
test = permus_maker(ticker)
print(test)
This gives me 'None' back. Any idea what I do wrong?
edit1:
I tested user defined function vs integrated function (%timeit): it takes 5x as long.

list.append() returns None, so return permus_pairs instead of permus_pairs.append(i)
Demo:
In [126]: [0].append(1) is None
Out[126]: True
but you don't really need that function:
In [124]: list(permutations(ticker.ticker, 2))
Out[124]:
[('ALJ', 'ALDW'),
('ALJ', 'BP'),
('ALJ', 'BPT'),
('ALDW', 'ALJ'),
('ALDW', 'BP'),
('ALDW', 'BPT'),
('BP', 'ALJ'),
('BP', 'ALDW'),
('BP', 'BPT'),
('BPT', 'ALJ'),
('BPT', 'ALDW'),
('BPT', 'BP')]
Be aware - if you are working with huge lists/Series/DataFrames, then it would make sense to use permutations iterator iteratively instead of exploding it in memory:
for pair in permutations(ticker.ticker, 2):
# process pair of tickers here ...

Related

Nested for loop producing more number of values than expected-Python

Background:I have two catalogues consisting of positions of spatial objects. My aim is to find the similar ones in both catalogues with a maximum difference in angular distance of certain value. One of them is called bss and another one is called super.
Here is the full code I wrote
import numpy as np
def crossmatch(bss_cat, super_cat, max_dist):
matches=[]
no_matches=[]
def find_closest(bss_cat,super_cat):
dist_list=[]
def angular_dist(ra1, dec1, ra2, dec2):
r1 = np.radians(ra1)
d1 = np.radians(dec1)
r2 = np.radians(ra2)
d2 = np.radians(dec2)
a = np.sin(np.abs(d1-d2)/2)**2
b = np.cos(d1)*np.cos(d2)*np.sin(np.abs(r1 - r2)/2)**2
rad = 2*np.arcsin(np.sqrt(a + b))
d = np.degrees(rad)
return d
for i in range(len(bss_cat)): #The problem arises here
for j in range(len(super_cat)):
distance = angular_dist(bss_cat[i][1], bss_cat[i][2], super_cat[j][1], super_cat[j][2]) #While this is supposed to produce single floating point values, it produces numpy.ndarray consisting of three entries
dist_list.append(distance) #This list now contains numpy.ndarrays instead of numpy.float values
for k in range(len(dist_list)):
if dist_list[k] < max_dist:
element = (bss_cat[i], super_cat[j], dist_list[k])
matches.append(element)
else:
element = bss_cat[i]
no_matches.append(element)
return (matches,no_matches)
When put seperately, the function angular_dist(ra1, dec1, ra2, dec2) produces a single numpy.float value as expected. But when used inside the for loop in this crossmatch(bss_cat, super_cat, max_dist) function, it produces numpy.ndarrays instead of numpy.float. I've stated this inside the code also. I don't know where the code goes wrong. Please help

How to count elements in an array withtin a given increasing interval?

I have an array of time values. I want to know how many values are in each 0.05 seconds window.
For example, some values of my array are: -1.9493, -1.9433, -1.911 , -1.8977, -1.8671,..
In the first interval of 0.050 seconds (from -1.9493 to -1.893) I´m expecting to have 3 elements
I already create another array with the 0.050 seconds steps.
a=max(array)
b=min(array)
ventanalinea1=np.arange(b,a,0.05)
v1=np.array(ventanalinea1)
In other words, I would like to compare my original array with this one.
I would like to know if there is a way to ask python to evaluate my array within a given dynamic range.
One of the variants:
import numpy as np
# original array
a = [-1.9493, -1.9433, -1.911 , -1.8977, -1.8671]
step = 0.05
bounds = np.arange(min(a), max(a) + step, step)
result = [
list(filter(lambda x: b[i] <= x <= b[i+1], a))
for i in range(len(b)-1)
]
I have found a cool python library python-intervals that simplify your problem a lot:
Install it with pip install python-intervals and try the code below.
import intervals as I
# This is a recursive function
def counter(timevalues, w=0.050):
if not timevalues:
return "" # stops recursion when timevalues is empty
# Make an interval object that provides convenient interval operations like 'contains'
window = I.closed(
timevalues[0], timevalues[0] + w)
interval = list(
filter(window.contains, timevalues))
count = len(interval)
timevalues = timevalues[count:]
print(f"[{interval[0]} : {interval[-1]}] : {count}")
return counter(timevalues)
if __name__ == "__main__":
times = [-1.9493, -1.9433, -1.911, -1.8977, -1.8671]
print(counter(times))
Adapt it as you wish, for example you might want to return a dictionary rather that a string.
You could still get around this without using the python-intervals library here but i have introduced it here because it will be very likely that you would need other complex operations along the way on your code.

openpyxl possible lambda usage

I was trying to assign value of a cell equal to sum of few other cell. I could achieve this using two line like
dailycell=45
while sheet['E' + str(dailycell)].value is not None:
mysum+=sheet['E'+str(dailycell)].value
sheet['B13']=mysum
dailycell+=1
With my limited knowledge, I tried lambda. Then I get all sort of error. I tried few more iteration but none got me the same result.
sheet['B13']=lambda weeklybudget,sheet['E'+str(dailycell)].value:weeklybudget+=sheet['E'+str(dailycell)]
Is there an alternative to this?
Why don't you just iterate over the whole E column and sum up the values:
sheet["B13"] = sum(c.value or 0 for c in sheet['E'])
If you have a restriction on what row to start from, then just grab the appropriate slice:
sheet["B13"] = sum(c.value or 0 for c in sheet['E'][46:])
If you need to stop summing at the first occurrence of None, you'll need a bit of external help - itertools.takewhile() comes to mind:
from itertools import takewhile
sheet["B13"] = sum(c.value for c in takewhile(lambda x: x.value is not None, sheet['E'][46:]))

Python: Summing a pig tuple containing float values

I'm fairly new to Pig/Python and in need of help. Trying to write a Pig Script that reconciles financial data. The parameters used follow a syntax like (grand_tot, x1, x2,... xn), meaning that the first value should equal the sum of remaining values.
I don't know of a way to accomplish this using Pig alone, so I've been trying to write a Python UDF. Pig passes a tuple to Python; if the sum of x1:xn equals grand_tot, then Python should return a "1" to Pig to show that the numbers match, otherwise it returns a "0".
Here is what I have so far:
register 'myudf.py' using jython as myfuncs;
A = LOAD '$file_nm' USING PigStorage(',') AS (grand_tot,west_region,east_region,prod_line_a,prod_line_b, prod_line_c, prod_line_d);
A1 = GROUP A ALL;
B = FOREACH A1 GENERATE TOTUPLE($recon1) as flds;
C = FOREACH B GENERATE myfuncs.isReconciled(flds) AS res;
DUMP C;
$recon1 is passed as a parameter, and defined as:
grand_tot, west_region, east_region
I will later pass $recon2 as:
grand_tot, prod_line_a, prod_line_b, prod_line_c, prod_line_d
Sample row of data (in $file_nm) looks like:
grand_tot,west_region,east_region,prod_line_a,prod_line_b, prod_line_c, prod_line_d
10000,4500,5500,900,2200,450,3700,2750
12500,7500,5000,3180,2770,300,3950,2300
9900,7425,2475,1320,460,3070,4630,1740
Lastly... here is what I'm trying to do with Python UDF code:
#outputSchema("result")
def isReconciled(arrTuple):
arrTemp = []
arrNew = []
string1 = ""
result = 0
## the first element of the Tuple should be the sum of remaining values
varGrandTot = arrTuple[0]
## create a new array with the remaining Tuple values
arrTemp = arrTuple[1:]
for item in arrTuple:
arrNew.append(item)
## sum the second to the nth values
varSum = sum(arrNew)
## if the first value in the tuple equals the sum of all remaining values
if varGrandTot = varSum then:
#reconciled to the penny
result = 1
else:
result = 0
return result
The error message I receive is:
unsupported operand type(s) for +: 'int' and 'array.array'
I've tried numerous things attempting to convert the array values into numeric and convert to float so that I can sum, but with no success.
Any ideas??? Thanks for looking!
You can do this in PIG itself.
First, specify the datatype in the schema. PigStorage will use bytearray as default data type.Hence your python script is throwing the error.Looks like your sample data has int but in your question you have mentioned float.
Second, add the fields starting from the second field or the fields of your choice.
Third, use the bincond operator to check the first field value with the sum.
A = LOAD '$file_nm' USING PigStorage(',') AS (grand_tot:float,west_region:float,east_region:float,prod_line_a:float,prod_line_b:float, prod_line_c:float, prod_line_d:float);
A1 = FOREACH A GENERATE grand_tot,SUM(TOBAG(prod_line_a,prod_line_b,prod_line_c,prod_line_d)) as SUM_ALL;
B = FOREACH A1 GENERATE (grand_tot == SUM_ALL ? 1 : 0);
DUMP B;
It is very likely, that your arrTuple is not an array of numbers, but some item is an array.
To check it, modify your code by adding some checks:
#outputSchema("result")
def isReconciled(arrTuple):
# some checks
tmpl = "Item # {i} shall be a number (has value {itm} of type {tp})"
for i, num in enumerate(arrTuple):
msg = templ.format(i=i, itm=itm, tp=type(itm))
assert isinstance(arrTuple[0], (int, long, float)), msg
# end of checks
arrTemp = []
arrNew = []
string1 = ""
result = 0
## the first element of the Tuple should be the sum of remaining values
varGrandTot = arrTuple[0]
## create a new array with the remaining Tuple values
arrTemp = arrTuple[1:]
for item in arrTuple:
arrNew.append(item)
## sum the second to the nth values
varSum = sum(arrNew)
## if the first value in the tuple equals the sum of all remaining values
if varGrandTot = varSum then:
#reconciled to the penny
result = 1
else:
result = 0
return result
It is very likely, that it will throw an AssertionFailed exception on one of the items. Read the
assertion message to learn, which item is making the troubles.
Anyway, if you want to return 0 or 1 if first number equals sum of the rest of the array, following
would work too:
#outputSchema("result")
def isReconciled(arrTuple):
if arrTuple[0] == sum(arrTuple[1:]):
return 1
else:
return 0
and in case, you would live happy with getting True instead of 1 and False instead of 0:
#outputSchema("result")
def isReconciled(arrTuple):
return arrTuple[0] == sum(arrTuple[1:])

TypeError: 'filter' object is not subscriptable

I am receiving the error
TypeError: 'filter' object is not subscriptable
When trying to run the following block of code
bonds_unique = {}
for bond in bonds_new:
if bond[0] < 0:
ghost_atom = -(bond[0]) - 1
bond_index = 0
elif bond[1] < 0:
ghost_atom = -(bond[1]) - 1
bond_index = 1
else:
bonds_unique[repr(bond)] = bond
continue
if sheet[ghost_atom][1] > r_length or sheet[ghost_atom][1] < 0:
ghost_x = sheet[ghost_atom][0]
ghost_y = sheet[ghost_atom][1] % r_length
image = filter(lambda i: abs(i[0] - ghost_x) < 1e-2 and
abs(i[1] - ghost_y) < 1e-2, sheet)
bond[bond_index] = old_to_new[sheet.index(image[0]) + 1 ]
bond.sort()
#print >> stderr, ghost_atom +1, bond[bond_index], image
bonds_unique[repr(bond)] = bond
# Removing duplicate bonds
bonds_unique = sorted(bonds_unique.values())
And
sheet_new = []
bonds_new = []
old_to_new = {}
sheet=[]
bonds=[]
The error occurs at the line
bond[bond_index] = old_to_new[sheet.index(image[0]) + 1 ]
I apologise that this type of question has been posted on SO many times, but I am fairly new to Python and do not fully understand dictionaries. Am I trying to use a dictionary in a way in which it should not be used, or should I be using a dictionary where I am not using it?
I know that the fix is probably very simple (albeit not to me), and I will be very grateful if someone could point me in the right direction.
Once again, I apologise if this question has been answered already
Thanks,
Chris.
I am using Python IDLE 3.3.1 on Windows 7 64-bit.
filter() in python 3 does not return a list, but an iterable filter object. Use the next() function on it to get the first filtered item:
bond[bond_index] = old_to_new[sheet.index(next(image)) + 1 ]
There is no need to convert it to a list, as you only use the first value.
Iterable objects like filter() produce results on demand rather than all in one go. If your sheet list is very large, it might take a long time and a lot of memory to put all the filtered results into a list, but filter() only needs to evaluate your lambda condition until one of the values from sheet produces a True result to produce one output. You tell the filter() object to scan through sheet for that first value by passing it to the next() function. You could do so multiple times to get multiple values, or use other tools that take iterables to do more complex things; the itertools library is full of such tools. The Python for loop is another such a tool, it too takes values from an iterable one by one.
If you must have access to all filtered results together, because you have to, say, index into the results at will (e.g. because this time your algorithm needed to access index 223, index 17 then index 42) only then convert the iterable object to a list, by using list():
image = list(filter(lambda i: ..., sheet))
The ability to access any of the values of an ordered sequence of values is called random access; a list is such a sequence, and so is a tuple or a numpy array. Iterables do not provide random access.
Use list before filter condtion then it works fine. For me it resolved the issue.
For example
list(filter(lambda x: x%2!=0, mylist))
instead of
filter(lambda x: x%2!=0, mylist)
image = list(filter(lambda i: abs(i[0] - ghost_x) < 1e-2 and abs(i[1] - ghost_y) < 1e-2, sheet))

Categories