In Pandas, I can use .apply to apply functions to two columns. For example,
df = pd.DataFrame({'A':['a', 'a', 'a', 'b'], 'B':[3, 3, 2, 5], 'C':[2, 2, 2, 8]})
formula = lambda x: (x.B + x.C)**2
df.apply(formula, axis=1)
But, notice that results on the first two rows are the same since all the inputs are the same. In large dataset with complicated operations. These repeated calculations is likely to slow down my program. Is there a way that I can program it so that I can save time with these repeated calculations?
You can use a technique called memoization. For functions which accept hashable arguments, you can use the built-in functools.lru_cache.
from functools import lru_cache
#lru_cache(maxsize=None)
def cached_function(B, C):
return (B + C)**2
def formula(x):
return cached_function(x.B, x.C)
Notice that I had to pass the values through to the cached function for lru_cache to work correctly because Series objects aren't hashable.
You could use np.unique to create a copy of the dataframe consisting of only the unique rows, then do the calculation on those, and construct the full results.
For example:
import numpy as np
# convert to records for use with numpy
rec = df.to_records(index=False)
arr, ind = np.unique(rec, return_inverse=True)
# find dataframe of unique rows
df_small = pd.DataFrame(arr)
# Apply the formula & construct the full result
df_small.apply(formula, axis=1).iloc[ind].reset_index()
Even faster than using apply here would be to use broadcasting: for example, simply compute
(df.B + df.C) ** 2
If this is still too slow, you can use this method on the de-duplicated dataframe, as above.
Related
I am using pandas and uproot to read data from a .root file, and I get a table like the following one:
The aforementioned table is made with the following code:
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
branches = ['Z1Flav', 'Z2Flav', 'nCleanedJetsPt30', 'LepPt', 'LepLepId']
df = ttree.pandas.df(branches, flatten=False)
I need to find the maximum value in LepPt, and, once found the maximum, I also need to retrieve the LepLepId of that maximum value.
I have no problem in finding the maximum values:
Pt_l1 = [max(i) for i in df.LepPt]
In this way I get an array with all the maximum values. However, I have to separate such values according to the LepLepId. So I need an array with the maximum LepPt and |LepLepId|=11 and one with the maximum LepPt and |LepLepId|=13.
If someone could give me any hint, advice and/or suggestion, I would be very grateful.
I made some mock data since you didn't provide yours in any easy format. I think this is what you are looking for.
import pandas as pd
df = pd.DataFrame.from_records(
[ [[1,2,3], [4,5,6]],
[[4,6,5], [7,8,9]]
],
columns=['LepPt', 'LepLepld']
)
df['max_LepPt'] = [max(i) for i in df.LepPt]
def f(row):
# get index position within list
pos = row['LepPt'].index(row['max_LepPt']).tolist()
return row['LepLepld'][pos]
df['same_index_LepLepld'] = df.apply(lambda x: f(x), axis=1)
returns:
LepPt LepLepld max_LepPt same_index_LepLepld
0 [1, 2, 3] [4, 5, 6] 3 6
1 [4, 6, 5] [7, 8, 9] 6 8
You could use the awkward.JaggedArray interface for this (one of the dependencies of uproot), which allows you to have irregularly sized arrays.
For this you would need to slightly change the way you load the data, but it allows you to use the same methods you would use with a normal numpy array, namely argmax:
fname = 'ZZ4lAnalysis_VBFH.root'
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
# branches = ['Z1Flav', 'Z2Flav', 'nCleanedJetsPt30', 'LepPt', 'LepLepId']
branches = ['LepPt', 'LepLepId'] # to save memory, only load what you need
# df = ttree.pandas.df(branches, flatten=False)
a = ttree.arrays(branches) # use awkward array interface
max_pt_idx = a[b'LepPt'].argmax()
max_pt_lepton_id = a[b'LepLepld'][max_pt_idx].flatten()
This is then just a normal numpy array, which you can assign to a column of a pandas dataframe if you want to. It should have the right dimensionality and order. It should also be faster than using the built-in Python functions.
Note that the keys are bytestrings, instead of normal strings and that you will have to take some extra steps if there are events with no leptons (in which case the flatten will ignore those empty events, destroying the alignment).
Alternatively, you can also convert the columns afterwards:
import awkward
df = ttree.pandas.df(branches, flatten=False)
max_pt_idx = awkward.fromiter(df["LepPt"]).argmax()
lepton_id = awkward.fromiter(df["LepLepld"])
df["max_pt_lepton_id"] = lepton_id[max_pt_idx].flatten()
The former will be faster if you don't need the columns again afterwards, otherwise the latter might be better.
I have some numpy array, whose number of rows (axis=0) is the same as a pandas dataframe's number of rows.
I want to create a new column in the dataframe, for which each entry would be a numpy array of a lesser dimension.
Code:
some_df = pd.DataFrame(columns=['A'])
for i in range(10):
some_df.loc[i] = [np.random.rand(4, 6, 8)
data = np.stack(some_df['A'].values) #shape (10, 4, 6, 8)
processed = np.max(data, axis=1) # shape (10, 6, 8)
some_df['B'] = processed # This fails
I want the new column 'B' to contain numpy arrays of shape (6, 8)
How can this be done?
This is not recommended, it is pain, slow and later processing is not easy.
One possible solution is use list comprehension:
some_df['B'] = [x for x in processed]
Or convert to list and assign:
some_df['B'] = processed.tolist()
Coming back to this after 2 years, here is a much better practice:
from itertools import product, chain
import pandas as pd
import numpy as np
from typing import Dict
def calc_col_names(named_shape):
*prefix, shape = named_shape
names = [map(str, range(i)) for i in shape]
return map('_'.join, product(prefix, *names))
def create_flat_columns_df_from_dict_of_numpy(
named_np: Dict[str, np.array],
n_samples_per_np: int,
):
named_np_correct_lenth = {k: v for k, v in named_np.items() if len(v) == n_samples_per_np}
flat_nps = [a.reshape(n_samples_per_np, -1) for a in named_np_correct_lenth.values()]
stacked_nps = np.column_stack(flat_nps)
named_shapes = [(name, arr.shape[1:]) for name, arr in named_np_correct_lenth.items()]
col_names = [*chain.from_iterable(calc_col_names(named_shape) for named_shape in named_shapes)]
df = pd.DataFrame(stacked_nps, columns=col_names)
df = df.convert_dtypes()
return df
def parse_series_into_np(df, col_name, shp):
# can parse the shape from the col names
n_samples = len(df)
col_names = sorted(c for c in df.columns if col_name in c)
col_names = list(filter(lambda c: c.startswith(col_name + "_") or len(col_names) == 1, col_names))
col_as_np = df[col_names].astype(np.float).values.reshape((n_samples, *shp))
return col_as_np
usage to put a ndarray into a Dataframe:
full_rate_df = create_flat_columns_df_from_dict_of_numpy(
named_np={name: np.array(d[name]) for name in ["name1", "name2"]},
n_samples_per_np=d["name1"].shape[0]
)
where d is a dict of nd arrays of the same shape[0], hashed by ["name1", "name2"].
The reverse operation can be obtained by parse_series_into_np.
The accepted answer remains, as it answers the original question, but this one is a much better practice.
I know this question already has an answer to it, but I would like to add a much more scalable way of doing this. As mentioned in the comments above it is in general not recommended to store arrays as "field"-values in a pandas-Dataframe column (I actually do not know why?). Nevertheless, in my day to day work this is an extermely important functionality when working with time-series data and a bunch of related meta-data.
In general I organize my experimantal time-series in form of pandas dataframes with one column holding same-length numpy arrays and the other columns containing information on meta-data with respect to certain measurement conditions etc.
The proposed solution by jezrael works very well, and I used this for the last 4 years on a regular basis. But this method potentially encounters huge memory problems. In my case I came across these problems working with dataframes beyond 5 Million rows and time-series with approx. 100 data points.
The solution to these problems is extremely simple, since I did not find it anywhere I just wanted to share it here: Simply transform your 2D array to a pandas-Series object and assign this to a column of your dataframe:
df["new_list_column"] = pd.Series(list(numpy_array_2D))
Lists or numpy arrays can be unpacked to multiple variables if the dimensions match. For a 3xN array, the following will work:
import numpy as np
a,b = [[1,2,3],[4,5,6]]
a,b = np.array([[1,2,3],[4,5,6]])
# result: a=[1,2,3], b=[4,5,6]
How can I achieve a similar behaviour for the columns of a pandas DataFrame? Extending the above example:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.columns = ['A','B','C'] # Rename cols and
df.index = ['i', 'ii'] # rows for clarity
The following does not work as expected:
a,b = df.T
# result: a='i', b='ii'
a,b,c = df
# result: a='A', b='B', c='C'
However, what I would like to get is the following:
a,b,c = unpack(df)
result: a=df['A'], b=df['B'], c=df['C']
Is the function unpack already available in pandas? Or can it be mimicked in an easy way?
I just figured that the following works, which is already close to what I try to achieve:
a,b,c = df.T.values # Common
a,b,c = df.T.to_numpy() # Recommended
# a,b,c = df.T.as_matrix() # Deprecated
Details: As always, things are a little more complicated than one thinks. Note that a pd.DataFrame stores columns separately in Series. Calling df.values (or better: df.to_numpy()) is potentially expensive, as it combines the columns in a single ndarray, which likely involves copying actions and type conversions. Also, the resulting container has a single dtype able to accommodate all data in the data frame.
In summary, the above approach loses the per-column dtype information and is potentially expensive. It is technically cleaner to iterate the columns in one of the following ways (there are more options):
# The following alternatives create VIEWS!
a,b,c = (v for _,v in df.items()) # returns pd.Series
a,b,c = (df[c] for c in df) # returns pd.Series
Note that the above creates views! Modifying the data likely will trigger a SettingWithCopyWarning.
a.iloc[0] = "blabla" # raises SettingWithCopyWarning
If you want to modify the unpacked variables, you have to copy the columns.
# The following alternatives create COPIES!
a,b,c = (v.copy() for _,v in df.items()) # returns pd.Series
a,b,c = (df[c].copy() for c in df) # returns pd.Series
a,b,c = (df[c].to_numpy() for c in df) # returns np.ndarray
While this is cleaner, it requires more characters. I personally do not recommend the above approach for production code. But to avoid typing (e.g., in interactive shell sessions), it is still a fair option...
# More verbose and explicit alternatives
a,b,c = df["the first col"], df["the second col"], df["the third col"]
a,b,c = df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]
The dataframe.values shown method is indeed a good solution, but it involves building a numpy array.
In the case you want to access pandas series methods after unpacking, I personally use a different approach.
For the people like me that use a lot of chained methods, I have a solution by adding a custom unpacking method to pandas. Note that this may not be very good for production pipelines, but it is very handy in ad-hoc data analyses.
df = pd.DataFrame({
"lat": [30, 40],
"lon": [0, 1],
})
This approach involves returning a generator on a .unpack() call.
from typing import Tuple
def unpack(self: pd.DataFrame) -> Tuple[pd.Series]:
return (
self[col]
for col in self.columns
)
pd.DataFrame.unpack = unpack
This can be used in two major ways.
Either directly as a solution to your problem:
lat, lon = df.unpack()
Or, can be used in a method chaining.
Imagine a geo function which has to take a latitude serie in the first arg and a longitude in the second arg, named do_something_geographical(lat, lon)
df_result = (
df
.(...some method chaining...)
.assign(
geographic_result=lambda dataframe: do_something_geographical(dataframe[["lat", "lon"]].unpack())
)
.(...some method chaining...)
)
I'm working with a very large data set (about 75 million entries) and I'm trying to shorten the length of time that running my code takes by a significant margin (with a loop right now it will take a couple days) and keep memory usage extremely low.
I have two numpy arrays (clients and units) of the same length. My goal is to get a list of every index that a value occurs in my first list (clients) and then find a sum of the entries in my second list at each of those indices.
This is what I've tried (np is the previously imported numpy library)
# create a list of each value that appears in clients
unq = np.unique(clients)
arr = np.zeros(len(unq))
tmp = np.arange(len(clients))
# for each unique value i in clients
for i in range(len(unq)) :
#create a list inds of all the indices that i occurs in clients
inds = tmp[clients==unq[i]]
# add the sum of all the elements in units at the indices inds to a list
arr[i] = sum(units[inds])
Does anyone know a method that will allow me to find these sums without looping through each element in unq?
With Pandas, this can easily be done using the grouby() function:
import pandas as pd
# some fake data
df = pd.DataFrame({'clients': ['a', 'b', 'a', 'a'], 'units': [1, 1, 1, 1]})
print df.groupby(['clients'], sort=False).sum()
which gives you the desired output:
units
clients
a 3
b 1
I use the sort=False option since that might lead to a speed-up (by default the entries will be sorted which can take some time for huge datsets).
This is a typical group-by type operation, which can be performed elegantly and efficiently using the numpy-indexed package (disclaimer: I am its author):
import numpy_indexed as npi
unique_clients, units_per_client = npi.group_by(clients).sum(units)
Note that unlike the pandas approach, there is no need to create a temporary datastructure just to perform this kind of elementary operation.
I have a large (about 12M rows) DataFrame df:
df.columns = ['word','documents','frequency']
The following ran in a timely fashion:
word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']
However, this is taking an unexpectedly long time to run:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame?
df.word.describe()
ran pretty well, so I really did not expect this Occurrences_of_Words DataFrame to take very long to build.
I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.
When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()
-Source.
Just an addition to the previous answers. Let's not forget that when dealing with real data there might be null values, so it's useful to also include those in the counting by using the option dropna=False (default is True)
An example:
>>> df['Embarked'].value_counts(dropna=False)
S 644
C 168
Q 77
NaN 2
Other possible approaches to count occurrences could be to use (i) Counter from collections module, (ii) unique from numpy library and (iii) groupby + size in pandas.
To use collections.Counter:
from collections import Counter
out = pd.Series(Counter(df['word']))
To use numpy.unique:
import numpy as np
i, c = np.unique(df['word'], return_counts = True)
out = pd.Series(c, index = i)
To use groupby + size:
out = pd.Series(df.index, index=df['word']).groupby(level=0).size()
One very nice feature of value_counts that's missing in the above methods is that it sorts the counts. If having the counts sorted is absolutely necessary, then value_counts is the best method given its simplicity and performance (even though it still gets marginally outperformed by other methods especially for very large Series).
Benchmarks
(if having the counts sorted is not important):
If we look at runtimes, it depends on the data stored in the DataFrame columns/Series.
If the Series is dtype object, then the fastest method for very large Series is collections.Counter, but in general value_counts is very competitive.
However, if it is dtype int, then the fastest method is numpy.unique:
Code used to produce the plots:
import perfplot
import numpy as np
import pandas as pd
from collections import Counter
def creator(n, dt='obj'):
s = pd.Series(np.random.randint(2*n, size=n))
return s.astype(str) if dt=='obj' else s
def plot_perfplot(datatype):
perfplot.show(
setup = lambda n: creator(n, datatype),
kernels = [lambda s: s.value_counts(),
lambda s: pd.Series(Counter(s)),
lambda s: pd.Series((ic := np.unique(s, return_counts=True))[1], index = ic[0]),
lambda s: pd.Series(s.index, index=s).groupby(level=0).size()
],
labels = ['value_counts', 'Counter', 'np_unique', 'groupby_size'],
n_range = [2 ** k for k in range(5, 25)],
equality_check = lambda *x: (d:= pd.concat(x, axis=1)).eq(d[0], axis=0).all().all(),
xlabel = '~len(s)',
title = f'dtype {datatype}'
)
plot_perfplot('obj')
plot_perfplot('int')