Pandas: extract number from calculation within loop - python

I'm trying to do calculations within a loop from multiple columns within a pandas dataframe. I want the output to be just a number, but it is in the form [index number dtype: int64]. It seems like it should be easy to get just that number, but I can't figure it out. Here is a simple example of some data and a basic calculation
import pandas as pd
# create a little dataframe
df = pd.DataFrame({
'A': [1,2],
'B': [3,4]
})
# create a list to hold results
l1 = []
# run a loop to do a simple example calculation
for i,_ in enumerate(df.A):
val = df.A[[i]] + df.B[[i]]
l1.append(val)
This is what I get for l1:
[0 4
dtype: int64,
1 6
dtype: int64]
My desired output is
[4, 6]
I can take the second value from each element in the list, but I want to do something faster, because my dataset is large, and it seems like I should be able to return a calculation without the index and dtype. Thank you in advance.

Change you last line within for loop, the original one return Series which will cause the 'issue' you mentioned
l1 = []
# run a loop to do a simple example calculation
for i,_ in enumerate(df.A):
val = df.A[[i]] + df.B[[i]]
l1.append(val.iloc[0])
l1
Out[154]: [4, 6]

Related

Pandas transform: assign result to each element of group

I am currently using pandas groupby and transform to calculate smth for each group (once) and then assign the result to each row of the group.
If the result of calculations is scalar it can be obtained like:
df['some_col'] = df.groupby('id')['some_col'].transform(lambda x:process(x))
The problem is that the result of my calculations is vector, and pd tries to make element-wise assignment of result vector to the group (quote from pandas docs):
The transform function must:
Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk (e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).
I could hardcode external function, creating a group-sized list, that will contain copies of result (currently on python 3.6, so it's not possible to use assignment inside lambda):
def return_group(x):
result = process(x)
return [result for item in x]
But I think that it's possible to solve this somehow "smarter". Remember that it's necessary to make calculations only once for each group.
Is it possible to force pd.transform work with array-like result of lambda function like with scalars (just copy it n-times)?
Would be grateful for any advices.
P. S. I understand, that it's possible to use combination of apply and join to solve the original requirement, but the solution with transform has more priority in my case.
Sometime transform is a pain to work with If that's not a problem for you I'd suggest you to use groupby + a left pd.merge as in this example:
import pandas as pd
df = pd.DataFrame({"id":[1,1,2,2,2],
"col":[1,2,3,4,5]})
# this return a list for every group
grp = df.groupby("id")["col"]\
.apply(lambda x: list(x))\
.reset_index(name="out")
# Then you merge it to the original df
df = pd.merge(df, grp, how="left")
And print(df) returns
id col out
0 1 1 [1, 2]
1 1 2 [1, 2]
2 2 3 [3, 4, 5]
3 2 4 [3, 4, 5]
4 2 5 [3, 4, 5]

Retrieve data in Pandas

I am using pandas and uproot to read data from a .root file, and I get a table like the following one:
The aforementioned table is made with the following code:
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
branches = ['Z1Flav', 'Z2Flav', 'nCleanedJetsPt30', 'LepPt', 'LepLepId']
df = ttree.pandas.df(branches, flatten=False)
I need to find the maximum value in LepPt, and, once found the maximum, I also need to retrieve the LepLepId of that maximum value.
I have no problem in finding the maximum values:
Pt_l1 = [max(i) for i in df.LepPt]
In this way I get an array with all the maximum values. However, I have to separate such values according to the LepLepId. So I need an array with the maximum LepPt and |LepLepId|=11 and one with the maximum LepPt and |LepLepId|=13.
If someone could give me any hint, advice and/or suggestion, I would be very grateful.
I made some mock data since you didn't provide yours in any easy format. I think this is what you are looking for.
import pandas as pd
df = pd.DataFrame.from_records(
[ [[1,2,3], [4,5,6]],
[[4,6,5], [7,8,9]]
],
columns=['LepPt', 'LepLepld']
)
df['max_LepPt'] = [max(i) for i in df.LepPt]
def f(row):
# get index position within list
pos = row['LepPt'].index(row['max_LepPt']).tolist()
return row['LepLepld'][pos]
df['same_index_LepLepld'] = df.apply(lambda x: f(x), axis=1)
returns:
LepPt LepLepld max_LepPt same_index_LepLepld
0 [1, 2, 3] [4, 5, 6] 3 6
1 [4, 6, 5] [7, 8, 9] 6 8
You could use the awkward.JaggedArray interface for this (one of the dependencies of uproot), which allows you to have irregularly sized arrays.
For this you would need to slightly change the way you load the data, but it allows you to use the same methods you would use with a normal numpy array, namely argmax:
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
# branches = ['Z1Flav', 'Z2Flav', 'nCleanedJetsPt30', 'LepPt', 'LepLepId']
branches = ['LepPt', 'LepLepId'] # to save memory, only load what you need
# df = ttree.pandas.df(branches, flatten=False)
a = ttree.arrays(branches) # use awkward array interface
max_pt_idx = a[b'LepPt'].argmax()
max_pt_lepton_id = a[b'LepLepld'][max_pt_idx].flatten()
This is then just a normal numpy array, which you can assign to a column of a pandas dataframe if you want to. It should have the right dimensionality and order. It should also be faster than using the built-in Python functions.
Note that the keys are bytestrings, instead of normal strings and that you will have to take some extra steps if there are events with no leptons (in which case the flatten will ignore those empty events, destroying the alignment).
Alternatively, you can also convert the columns afterwards:
import awkward
df = ttree.pandas.df(branches, flatten=False)
max_pt_idx = awkward.fromiter(df["LepPt"]).argmax()
lepton_id = awkward.fromiter(df["LepLepld"])
df["max_pt_lepton_id"] = lepton_id[max_pt_idx].flatten()
The former will be faster if you don't need the columns again afterwards, otherwise the latter might be better.

Get index of elements in first Series within the second series

I want to get the index of all values in the smaller series for the larger series. The answer is in the code snippet below stored in the ans variable.
import pandas as pd
smaller = pd.Series(["a","g","b","k"])
larger = pd.Series(["a","b","c","d","e","f","g","h","i","j","k","l","m"])
# ans to be generated by some unknown combination of functions
ans = [0,6,1,10]
print(larger.iloc[ans,])
print(smaller)
assert(smaller.tolist() == larger.iloc[ans,].tolist())
Context: Series larger serves as an index for the columns in a numpy matrix, and series smaller serves as an index for the columns in a numpy vector. I need indexes for the matrix and vector to match.
You can reverse your larger series, then index this with smaller:
larger_rev = pd.Series(larger.index, larger.values)
res = larger_rev[smaller].values
print(res)
array([ 0, 6, 1, 10], dtype=int64)
for i in list(smaller):
if i in list(larger):
print((list(larger).index(i)))
This will get you the desired output
Using Series get
pd.Series(larger.index, larger.values).get(smaller)
Out[8]:
a 0
g 6
b 1
k 10
dtype: int64
try this :)
import pandas as pd
larger = pd.Series(["a","b","c","d","e","f","g","h","i","j","k","l","m"])
smaller = pd.Series(["a","g","b","k"])
res = pd.Series(larger.index, larger.values).reindex(smaller.values, copy=True)
print(res)

finding the max of a column in an array

def maxvalues():
for n in range(1,15):
dummy=[]
for k in range(len(MotionsAndMoorings)):
dummy.append(MotionsAndMoorings[k][n])
max(dummy)
L = [x + [max(dummy)]] ## to be corrected (adding columns with value max(dummy))
## suggest code to add new row to L and for next function call, it should save values here.
i have an array of size (k x n) and i need to pick the max values of the first column in that array. Please suggest if there is a simpler way other than what i tried? and my main aim is to append it to L in columns rather than rows. If i just append, it is adding values at the end. I would like to this to be done in columns for row 0 in L, because i'll call this function again and add a new row to L and do the same. Please suggest.
General suggestions for your code
First of all it's not very handy to access globals in a function. It works but it's not considered good style. So instead of using:
def maxvalues():
do_something_with(MotionsAndMoorings)
you should do it with an argument:
def maxvalues(array):
do_something_with(array)
MotionsAndMoorings = something
maxvalues(MotionsAndMoorings) # pass it to the function.
The next strange this is you seem to exlude the first row of your array:
for n in range(1,15):
I think that's unintended. The first element of a list has the index 0 and not 1. So I guess you wanted to write:
for n in range(0,15):
or even better for arbitary lengths:
for n in range(len(array[0])): # I chose the first row length here not the number of columns
Alternatives to your iterations
But this would not be very intuitive because the max function already implements some very nice keyword (the key) so you don't need to iterate over the whole array:
import operator
column = 2
max(array, key=operator.itemgetter(column))[column]
this will return the row where the i-th element is maximal (you just define your wanted column as this element). But the maximum will return the whole row so you need to extract just the i-th element.
So to get a list of all your maximums for each column you could do:
[max(array, key=operator.itemgetter(column))[column] for column in range(len(array[0]))]
For your L I'm not sure what this is but for that you should probably also pass it as argument to the function:
def maxvalues(array, L): # another argument here
but since I don't know what x and L are supposed to be I'll not go further into that. But it looks like you want to make the columns of MotionsAndMoorings to rows and the rows to columns. If so you can just do it with:
dummy = [[MotionsAndMoorings[j][i] for j in range(len(MotionsAndMoorings))] for i in range(len(MotionsAndMoorings[0]))]
that's a list comprehension that converts a list like:
[[1, 2, 3], [4, 5, 6], [0, 2, 10], [0, 2, 10]]
to an "inverted" column/row list:
[[1, 4, 0, 0], [2, 5, 2, 2], [3, 6, 10, 10]]
Alternative packages
But like roadrunner66 already said sometimes it's easiest to use a library like numpy or pandas that already has very advanced and fast functions that do exactly what you want and are very easy to use.
For example you convert a python list to a numpy array simple by:
import numpy as np
Motions_numpy = np.array(MotionsAndMoorings)
you get the maximum of the columns by using:
maximums_columns = np.max(Motions_numpy, axis=0)
you don't even need to convert it to a np.array to use np.max or transpose it (make rows to columns and the colums to rows):
transposed = np.transpose(MotionsAndMoorings)
I hope this answer is not to unstructured. Some parts are suggestions to your function and some are alternatives. You should pick the parts that you need and if you have any trouble with it, just leave a comment or ask another question. :-)
An example with a random input array, showing that you can take the max in either axis easily with one command.
import numpy as np
aa= np.random.random([4,3])
print aa
print
print np.max(aa,axis=0)
print
print np.max(aa,axis=1)
Output:
[[ 0.51972266 0.35930957 0.60381998]
[ 0.34577217 0.27908173 0.52146593]
[ 0.12101346 0.52268843 0.41704152]
[ 0.24181773 0.40747905 0.14980534]]
[ 0.51972266 0.52268843 0.60381998]
[ 0.60381998 0.52146593 0.52268843 0.40747905]

making multiple pandas data frames using a loop or list comprehension

I have a Python data frame that I want to subdivide by row BUT in 32 different slices (think of a large data set chopped by row into 32 smaller data sets). I can manually divide the data frames in this way:
df_a = df[df['Type']=='BROKEN PELVIS']
df_b = df[df['Type']=='ABDOMINAL STRAIN']
I'm assuming there is a much more Pythonic expression someone might like to share. I'm looking for something along the lines of:
for i in new1:
df_%s= df[df['#RIC']=='%s'] , %i
Hope that makes sense.
In these kind of situations I think it's more pythonic to store the DataFrames in a python dictionary:
injuries = {injury: df[df['Type'] == injury] for injury in df['Type'].unique()}
injuries['BROKEN PELVIS'] # is the same as df_a above
Most of the time you don't need to create a new DataFrame but can use a groupby (it depends what you're doing next), see http://pandas.pydata.org/pandas-docs/stable/groupby.html:
g = df.groupby('Type')
Update: in fact there is a method get_group to access these:
In [21]: df = pd.DataFrame([['A', 2], ['A', 4], ['B', 6]])
In [22]: g = df.groupby(0)
In [23]: g.get_group('A')
Out[23]:
0 1
0 A 2
1 A 4
Note: most of the time you don't need to do this, apply, aggregate and transform are your friends!

Categories