Retrieve data in Pandas - python

I am using pandas and uproot to read data from a .root file, and I get a table like the following one:
The aforementioned table is made with the following code:
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
branches = ['Z1Flav', 'Z2Flav', 'nCleanedJetsPt30', 'LepPt', 'LepLepId']
df = ttree.pandas.df(branches, flatten=False)
I need to find the maximum value in LepPt, and, once found the maximum, I also need to retrieve the LepLepId of that maximum value.
I have no problem in finding the maximum values:
Pt_l1 = [max(i) for i in df.LepPt]
In this way I get an array with all the maximum values. However, I have to separate such values according to the LepLepId. So I need an array with the maximum LepPt and |LepLepId|=11 and one with the maximum LepPt and |LepLepId|=13.
If someone could give me any hint, advice and/or suggestion, I would be very grateful.

I made some mock data since you didn't provide yours in any easy format. I think this is what you are looking for.
import pandas as pd
df = pd.DataFrame.from_records(
[ [[1,2,3], [4,5,6]],
[[4,6,5], [7,8,9]]
],
columns=['LepPt', 'LepLepld']
)
df['max_LepPt'] = [max(i) for i in df.LepPt]
def f(row):
# get index position within list
pos = row['LepPt'].index(row['max_LepPt']).tolist()
return row['LepLepld'][pos]
df['same_index_LepLepld'] = df.apply(lambda x: f(x), axis=1)
returns:
LepPt LepLepld max_LepPt same_index_LepLepld
0 [1, 2, 3] [4, 5, 6] 3 6
1 [4, 6, 5] [7, 8, 9] 6 8

You could use the awkward.JaggedArray interface for this (one of the dependencies of uproot), which allows you to have irregularly sized arrays.
For this you would need to slightly change the way you load the data, but it allows you to use the same methods you would use with a normal numpy array, namely argmax:
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
# branches = ['Z1Flav', 'Z2Flav', 'nCleanedJetsPt30', 'LepPt', 'LepLepId']
branches = ['LepPt', 'LepLepId'] # to save memory, only load what you need
# df = ttree.pandas.df(branches, flatten=False)
a = ttree.arrays(branches) # use awkward array interface
max_pt_idx = a[b'LepPt'].argmax()
max_pt_lepton_id = a[b'LepLepld'][max_pt_idx].flatten()
This is then just a normal numpy array, which you can assign to a column of a pandas dataframe if you want to. It should have the right dimensionality and order. It should also be faster than using the built-in Python functions.
Note that the keys are bytestrings, instead of normal strings and that you will have to take some extra steps if there are events with no leptons (in which case the flatten will ignore those empty events, destroying the alignment).
Alternatively, you can also convert the columns afterwards:
import awkward
df = ttree.pandas.df(branches, flatten=False)
max_pt_idx = awkward.fromiter(df["LepPt"]).argmax()
lepton_id = awkward.fromiter(df["LepLepld"])
df["max_pt_lepton_id"] = lepton_id[max_pt_idx].flatten()
The former will be faster if you don't need the columns again afterwards, otherwise the latter might be better.

Related

Pandas interval multiindex

I need to create a data stucture allowing indexing via a tuple of floats. Each dimension of the tuple represents one parameter. Each parameter spans a continuous range and to be able to perform my work, I binned the range to categories.
Then, I want to create a dataframe with a MultiIndex, each dimension of the index referring to a parameter with the defined categories
import pandas as pd
import numpy as np
index = pd.interval_range(start=0, end=10, periods = 5, closed='both')
index2 = pd.interval_range(start=20, end=30, periods = 3, closed='both')
index3 = pd.MultiIndex.from_product([index,index2])
dataStructure = pd.DataFrame(np.zeros((5*3,1)), index = index3)
print(Qhat)
I checked that the interval_range provides me with the necessary methods e.g.
index.get_loc(2.5)
would provide me the right answer. However I can't extend this with the dataframe nor the multiIndex
index3.get_loc((2.5,21))
does not work. Any idea ? I managed to get that working yesterday somehow therefore I am 99% convinced there is a simple way to make this work. But my jupyter notebook was in the cloud and the server crashed and notebook has been lost. I became dumber overnight apparently.
I think select by tuple is not implemented yet, possible solution is get position for each level separately with Index.get_level_values, get intersection by intersect1d and last select by iloc:
idx1 = df.index.get_level_values(0).get_loc(2.5)
idx2 = df.index.get_level_values(1).get_loc(21)
df1 = df.iloc[np.intersect1d(idx1, idx2)]
print (df1)
0
[2, 4] [20.0, 23.333333333333332] 0.0

How to insert a multidimensional numpy array to pandas column?

I have some numpy array, whose number of rows (axis=0) is the same as a pandas dataframe's number of rows.
I want to create a new column in the dataframe, for which each entry would be a numpy array of a lesser dimension.
Code:
some_df = pd.DataFrame(columns=['A'])
for i in range(10):
some_df.loc[i] = [np.random.rand(4, 6, 8)
data = np.stack(some_df['A'].values) #shape (10, 4, 6, 8)
processed = np.max(data, axis=1) # shape (10, 6, 8)
some_df['B'] = processed # This fails
I want the new column 'B' to contain numpy arrays of shape (6, 8)
How can this be done?
This is not recommended, it is pain, slow and later processing is not easy.
One possible solution is use list comprehension:
some_df['B'] = [x for x in processed]
Or convert to list and assign:
some_df['B'] = processed.tolist()
Coming back to this after 2 years, here is a much better practice:
from itertools import product, chain
import pandas as pd
import numpy as np
from typing import Dict
def calc_col_names(named_shape):
*prefix, shape = named_shape
names = [map(str, range(i)) for i in shape]
return map('_'.join, product(prefix, *names))
def create_flat_columns_df_from_dict_of_numpy(
named_np: Dict[str, np.array],
n_samples_per_np: int,
):
named_np_correct_lenth = {k: v for k, v in named_np.items() if len(v) == n_samples_per_np}
flat_nps = [a.reshape(n_samples_per_np, -1) for a in named_np_correct_lenth.values()]
stacked_nps = np.column_stack(flat_nps)
named_shapes = [(name, arr.shape[1:]) for name, arr in named_np_correct_lenth.items()]
col_names = [*chain.from_iterable(calc_col_names(named_shape) for named_shape in named_shapes)]
df = pd.DataFrame(stacked_nps, columns=col_names)
df = df.convert_dtypes()
return df
def parse_series_into_np(df, col_name, shp):
# can parse the shape from the col names
n_samples = len(df)
col_names = sorted(c for c in df.columns if col_name in c)
col_names = list(filter(lambda c: c.startswith(col_name + "_") or len(col_names) == 1, col_names))
col_as_np = df[col_names].astype(np.float).values.reshape((n_samples, *shp))
return col_as_np
usage to put a ndarray into a Dataframe:
full_rate_df = create_flat_columns_df_from_dict_of_numpy(
named_np={name: np.array(d[name]) for name in ["name1", "name2"]},
n_samples_per_np=d["name1"].shape[0]
)
where d is a dict of nd arrays of the same shape[0], hashed by ["name1", "name2"].
The reverse operation can be obtained by parse_series_into_np.
The accepted answer remains, as it answers the original question, but this one is a much better practice.
I know this question already has an answer to it, but I would like to add a much more scalable way of doing this. As mentioned in the comments above it is in general not recommended to store arrays as "field"-values in a pandas-Dataframe column (I actually do not know why?). Nevertheless, in my day to day work this is an extermely important functionality when working with time-series data and a bunch of related meta-data.
In general I organize my experimantal time-series in form of pandas dataframes with one column holding same-length numpy arrays and the other columns containing information on meta-data with respect to certain measurement conditions etc.
The proposed solution by jezrael works very well, and I used this for the last 4 years on a regular basis. But this method potentially encounters huge memory problems. In my case I came across these problems working with dataframes beyond 5 Million rows and time-series with approx. 100 data points.
The solution to these problems is extremely simple, since I did not find it anywhere I just wanted to share it here: Simply transform your 2D array to a pandas-Series object and assign this to a column of your dataframe:
df["new_list_column"] = pd.Series(list(numpy_array_2D))

All indices of each unique element in a list python

I'm working with a very large data set (about 75 million entries) and I'm trying to shorten the length of time that running my code takes by a significant margin (with a loop right now it will take a couple days) and keep memory usage extremely low.
I have two numpy arrays (clients and units) of the same length. My goal is to get a list of every index that a value occurs in my first list (clients) and then find a sum of the entries in my second list at each of those indices.
This is what I've tried (np is the previously imported numpy library)
# create a list of each value that appears in clients
unq = np.unique(clients)
arr = np.zeros(len(unq))
tmp = np.arange(len(clients))
# for each unique value i in clients
for i in range(len(unq)) :
#create a list inds of all the indices that i occurs in clients
inds = tmp[clients==unq[i]]
# add the sum of all the elements in units at the indices inds to a list
arr[i] = sum(units[inds])
Does anyone know a method that will allow me to find these sums without looping through each element in unq?
With Pandas, this can easily be done using the grouby() function:
import pandas as pd
# some fake data
df = pd.DataFrame({'clients': ['a', 'b', 'a', 'a'], 'units': [1, 1, 1, 1]})
print df.groupby(['clients'], sort=False).sum()
which gives you the desired output:
units
clients
a 3
b 1
I use the sort=False option since that might lead to a speed-up (by default the entries will be sorted which can take some time for huge datsets).
This is a typical group-by type operation, which can be performed elegantly and efficiently using the numpy-indexed package (disclaimer: I am its author):
import numpy_indexed as npi
unique_clients, units_per_client = npi.group_by(clients).sum(units)
Note that unlike the pandas approach, there is no need to create a temporary datastructure just to perform this kind of elementary operation.

finding the max of a column in an array

def maxvalues():
for n in range(1,15):
dummy=[]
for k in range(len(MotionsAndMoorings)):
dummy.append(MotionsAndMoorings[k][n])
max(dummy)
L = [x + [max(dummy)]] ## to be corrected (adding columns with value max(dummy))
## suggest code to add new row to L and for next function call, it should save values here.
i have an array of size (k x n) and i need to pick the max values of the first column in that array. Please suggest if there is a simpler way other than what i tried? and my main aim is to append it to L in columns rather than rows. If i just append, it is adding values at the end. I would like to this to be done in columns for row 0 in L, because i'll call this function again and add a new row to L and do the same. Please suggest.
General suggestions for your code
First of all it's not very handy to access globals in a function. It works but it's not considered good style. So instead of using:
def maxvalues():
do_something_with(MotionsAndMoorings)
you should do it with an argument:
def maxvalues(array):
do_something_with(array)
MotionsAndMoorings = something
maxvalues(MotionsAndMoorings) # pass it to the function.
The next strange this is you seem to exlude the first row of your array:
for n in range(1,15):
I think that's unintended. The first element of a list has the index 0 and not 1. So I guess you wanted to write:
for n in range(0,15):
or even better for arbitary lengths:
for n in range(len(array[0])): # I chose the first row length here not the number of columns
Alternatives to your iterations
But this would not be very intuitive because the max function already implements some very nice keyword (the key) so you don't need to iterate over the whole array:
import operator
column = 2
max(array, key=operator.itemgetter(column))[column]
this will return the row where the i-th element is maximal (you just define your wanted column as this element). But the maximum will return the whole row so you need to extract just the i-th element.
So to get a list of all your maximums for each column you could do:
[max(array, key=operator.itemgetter(column))[column] for column in range(len(array[0]))]
For your L I'm not sure what this is but for that you should probably also pass it as argument to the function:
def maxvalues(array, L): # another argument here
but since I don't know what x and L are supposed to be I'll not go further into that. But it looks like you want to make the columns of MotionsAndMoorings to rows and the rows to columns. If so you can just do it with:
dummy = [[MotionsAndMoorings[j][i] for j in range(len(MotionsAndMoorings))] for i in range(len(MotionsAndMoorings[0]))]
that's a list comprehension that converts a list like:
[[1, 2, 3], [4, 5, 6], [0, 2, 10], [0, 2, 10]]
to an "inverted" column/row list:
[[1, 4, 0, 0], [2, 5, 2, 2], [3, 6, 10, 10]]
Alternative packages
But like roadrunner66 already said sometimes it's easiest to use a library like numpy or pandas that already has very advanced and fast functions that do exactly what you want and are very easy to use.
For example you convert a python list to a numpy array simple by:
import numpy as np
Motions_numpy = np.array(MotionsAndMoorings)
you get the maximum of the columns by using:
maximums_columns = np.max(Motions_numpy, axis=0)
you don't even need to convert it to a np.array to use np.max or transpose it (make rows to columns and the colums to rows):
transposed = np.transpose(MotionsAndMoorings)
I hope this answer is not to unstructured. Some parts are suggestions to your function and some are alternatives. You should pick the parts that you need and if you have any trouble with it, just leave a comment or ask another question. :-)
An example with a random input array, showing that you can take the max in either axis easily with one command.
import numpy as np
aa= np.random.random([4,3])
print aa
print
print np.max(aa,axis=0)
print
print np.max(aa,axis=1)
Output:
[[ 0.51972266 0.35930957 0.60381998]
[ 0.34577217 0.27908173 0.52146593]
[ 0.12101346 0.52268843 0.41704152]
[ 0.24181773 0.40747905 0.14980534]]
[ 0.51972266 0.52268843 0.60381998]
[ 0.60381998 0.52146593 0.52268843 0.40747905]

Should pandas dataframes be nested?

I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)

Categories