I have a dictionary that is filled with multiple dataframes. Now I am searching for an efficient way for changing the key structure, but the solution I have found is rather slow when more dataframes / bigger dataframes are involved. Thats why I wanted to ask if anyone might know a more convenient / efficient / faster way or approach than mine. So first, I created this example to show where I initially started:
import pandas as pd
import numpy as np
# assign keys to dic
teams = ["Arsenal", "Chelsea", "Manchester United"]
dic_teams = {}
# fill dic with random entries
for t1 in teams:
dic_teams[t1] = pd.DataFrame({'date': pd.date_range("20180101", periods=30),
'Goals': pd.Series(np.random.randint(0,5, size = 30)),
'Chances': pd.Series(np.random.randint(0,15, size = 30)),
'Fouls': pd.Series(np.random.randint(0, 20, size = 30)),
'Offside': pd.Series(np.random.randint(0, 10, size = 30))})
dic_teams[t1] = dic_teams[t1].set_index('date')
dic_teams[t1].index.name = None
Now I basically have a dictionary where every key is a team, which means I have a dataframe for every team with information on their game performance over time. Now I would prefer to change this particular dictionary so I get a structure where the key is the date, instead of a team. This would mean that I have a dataframe for every date, which is filled with the performance of each team on that date. I managed to do that using the following code, which works but is really slow once I add more teams and performance factors:
# prepare lists for looping
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}
# new structure where key = date
for d in dates:
dic_dates[d] = pd.DataFrame(index = teams, columns = perf)
for t2 in teams:
dic_dates[d].loc[t2] = dic_teams[t2].loc[d]
Because I am using a nested loop, the restructuring of my dictionary is slow. Does anyone have an idea how I could improve the second piece of code? I'm not necessarily searching just for a solution, also for a logic or idea how to do better.
Thanks in advance, any help is highly appreciated
Creating a Pandas dataframes the way you do is (strangely) awfully slow, as well as direct indexing.
Copying a dataframe is surprisingly quite fast. Thus you can use an empty reference dataframe copied multiple times. Here is the code:
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
zygote = pd.DataFrame(index = teams, columns = perf)
dic_dates = {}
# new structure where key = date
for d in dates:
dic_dates[d] = zygote.copy()
for t2 in teams:
dic_dates[d].loc[t2] = dic_teams[t2].loc[d]
This is about 2 times faster than the reference on my machine.
Overcoming the slow dataframe direct indexing is tricky. We can use numpy to do that. Indeed, we can convert the dataframe to a 3D numpy array, use numpy to perform the transposition, and finally convert the slices into dataframes again. Note that this approach assumes that all values are integers and that the input dataframe are well structured.
Here is the final implementation:
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}
# Create a numpy array from Pandas dataframes
# Assume the order of the `dates` and `perf` indices are the same in all dataframe (and their order)
full = np.empty(shape=(len(teams), len(dates), len(perf)), dtype=int)
for tId,tName in enumerate(teams):
full[tId,:,:] = dic_teams[tName].to_numpy()
# New structure where key = date, created from the numpy array
for dId,dName in enumerate(dates):
dic_dates[dName] = pd.DataFrame({pName: full[:,dId,pId] for pId,pName in enumerate(perf)}, index = teams)
This implementation is 6.4 times faster than the reference on my machine. Note that about 75% of the time is sadly spent in the pd.DataFrame calls. Thus, if you want a faster code, use a basic 3D numpy array!
Related
I have a large pandas dataframe that countains data similar to the image attached.
I want to get a count of how many unique TN exist within each 2 second window of the data. I've done this with a simple loop, but it is incredibly slow. Is there a better technique I can use to get this?
My original code is:
uniqueTN = []
tmstart = 5400; tmstop = 86400
for tm in range(int(tmstart), int(tmstop), 2):
df = rundf[(rundf['time']>=(tm-2))&rundf['time']<tm)]
uniqueTN.append(df['TN'].unique())
This solution would be fine it the set of data was not so large.
Here is how you can implement groupby() method and nunique().
rundf['time'] = (rundf['time'] // 2) * 2
grouped = rundf.groupby('time')['TN'].nunique()
Another alternative is to use the resample() method of pandas and then the nunique() method.
grouped = rundf.resample('2S', on='time')['TN'].nunique()
I have some numpy array, whose number of rows (axis=0) is the same as a pandas dataframe's number of rows.
I want to create a new column in the dataframe, for which each entry would be a numpy array of a lesser dimension.
Code:
some_df = pd.DataFrame(columns=['A'])
for i in range(10):
some_df.loc[i] = [np.random.rand(4, 6, 8)
data = np.stack(some_df['A'].values) #shape (10, 4, 6, 8)
processed = np.max(data, axis=1) # shape (10, 6, 8)
some_df['B'] = processed # This fails
I want the new column 'B' to contain numpy arrays of shape (6, 8)
How can this be done?
This is not recommended, it is pain, slow and later processing is not easy.
One possible solution is use list comprehension:
some_df['B'] = [x for x in processed]
Or convert to list and assign:
some_df['B'] = processed.tolist()
Coming back to this after 2 years, here is a much better practice:
from itertools import product, chain
import pandas as pd
import numpy as np
from typing import Dict
def calc_col_names(named_shape):
*prefix, shape = named_shape
names = [map(str, range(i)) for i in shape]
return map('_'.join, product(prefix, *names))
def create_flat_columns_df_from_dict_of_numpy(
named_np: Dict[str, np.array],
n_samples_per_np: int,
):
named_np_correct_lenth = {k: v for k, v in named_np.items() if len(v) == n_samples_per_np}
flat_nps = [a.reshape(n_samples_per_np, -1) for a in named_np_correct_lenth.values()]
stacked_nps = np.column_stack(flat_nps)
named_shapes = [(name, arr.shape[1:]) for name, arr in named_np_correct_lenth.items()]
col_names = [*chain.from_iterable(calc_col_names(named_shape) for named_shape in named_shapes)]
df = pd.DataFrame(stacked_nps, columns=col_names)
df = df.convert_dtypes()
return df
def parse_series_into_np(df, col_name, shp):
# can parse the shape from the col names
n_samples = len(df)
col_names = sorted(c for c in df.columns if col_name in c)
col_names = list(filter(lambda c: c.startswith(col_name + "_") or len(col_names) == 1, col_names))
col_as_np = df[col_names].astype(np.float).values.reshape((n_samples, *shp))
return col_as_np
usage to put a ndarray into a Dataframe:
full_rate_df = create_flat_columns_df_from_dict_of_numpy(
named_np={name: np.array(d[name]) for name in ["name1", "name2"]},
n_samples_per_np=d["name1"].shape[0]
)
where d is a dict of nd arrays of the same shape[0], hashed by ["name1", "name2"].
The reverse operation can be obtained by parse_series_into_np.
The accepted answer remains, as it answers the original question, but this one is a much better practice.
I know this question already has an answer to it, but I would like to add a much more scalable way of doing this. As mentioned in the comments above it is in general not recommended to store arrays as "field"-values in a pandas-Dataframe column (I actually do not know why?). Nevertheless, in my day to day work this is an extermely important functionality when working with time-series data and a bunch of related meta-data.
In general I organize my experimantal time-series in form of pandas dataframes with one column holding same-length numpy arrays and the other columns containing information on meta-data with respect to certain measurement conditions etc.
The proposed solution by jezrael works very well, and I used this for the last 4 years on a regular basis. But this method potentially encounters huge memory problems. In my case I came across these problems working with dataframes beyond 5 Million rows and time-series with approx. 100 data points.
The solution to these problems is extremely simple, since I did not find it anywhere I just wanted to share it here: Simply transform your 2D array to a pandas-Series object and assign this to a column of your dataframe:
df["new_list_column"] = pd.Series(list(numpy_array_2D))
I have a 10 GB csv file with 170,000,000 rows and 23 columns that I read in to a dataframe as follows:
import pandas as pd
d = pd.read_csv(f, dtype = {'tax_id': str})
I also have a list of strings with nearly 20,000 unique elements:
h = ['1123787', '3345634442', '2342345234', .... ]
I want to create a new column called class in the dataframe d. I want to assign d['class'] = 'A' whenever d['tax_id'] has a value that is found in the list of strings h. Otherwise, I want d['class'] = 'B'.
The following code works very quickly on a 1% sample of my dataframe d:
d['class'] = 'B'
d.loc[d['tax_num'].isin(h), 'class'] = 'A'
However, on the complete dataframe d, this code takes over 48 hours (and counting) to run on a 32 core server in batch mode. I suspect that indexing with loc is slowing down the code, but I'm not sure what it could really be.
In sum: Is there a more efficient way of creating the class column?
If your tax numbers are unique, I would recommend setting tax_num to the index and then indexing on that. As it stands, you call isin which is a linear operation. However fast your machine is, it can't do a linear search on 170 million records in a reasonable amount of time.
df.set_index('tax_num', inplace=True) # df = df.set_index('tax_num')
df['class'] = 'B'
df.loc[h, 'class'] = 'A'
If you're still suffering from performance issues, I'd recommend switching to distributed processing with dask.
"I also have a list of strings with nearly 20,000 unique elements"
Well, for starters, you should make that list a set if you are going to be using it for membership testing. list objects have linear time membership testing, set objects have very optimized constant-time performance for membership testing. That is the lowest hanging fruit here. So use
h = set(h) # convert list to set
d['class'] = 'B'
d.loc[d['tax_num'].isin(h), 'class'] = 'A'
I have a dataframe of ~20M lines
I have a column called A that gives me an id (there are ~10K ids in total).
The value of this id defines a random distribution's parameters.
Now I want to generate a column B, that is randomly drawn from the distribution that is defined by the value in the column A
What is the fastest way to do this? Doing something with iterrows or apply is extremely slow. Another possiblity is to group by A, and generate all my data for each value of A (so I only draw from one distribution). But then I don't end up with a Dataframe but with a "groupBy" object, and I don't know how to go back to having the initial dataframe, plus my new column.
I think this approach is similar to what you were describing, where you generate the samples for each id. On my machine, it appears this would take around 5 minutes to run. I assume you can trivially get the ids.
import numpy as np
num_ids = 10000
num_rows = 20000000
ids = np.arange(num_ids)
loc_params = np.random.random(num_ids)
A = np.random.randint(0, num_ids, num_rows)
B = np.zeros(A.shape)
for idx in ids:
A_idxs = A == idx
B[A_idxs] = np.random.normal(np.sum(A_idxs), loc_params[idx])
This question is pretty vague, but how would this work for you?
df['B'] = df.apply(lambda row: distribution(row.A), axis=1)
Editing from question edits (apply is too slow):
You could create a mapping dictionary for the 10k ids to their generated value, then do something like
df['B'] = df['A'].map(dictionary)
I'm unsure if this will be faster than apply, but it will require fewer calls to your random distribution generator
This operation needs to be applied as fast as possible as the actual arrays which contain millions of elements. This is a simple version of the problem.
So, I have a random array of unique integers (normally millions of elements).
totalIDs = [5,4,3,1,2,9,7,6,8 ...]
I have another array (normally a tens of thousands) of unique integers which I can create a mask.
subsampleIDs1 = [5,1,9]
subsampleIDs2 = [3,7,8]
subsampleIDs3 = [2,6,9]
...
I can use numpy to do
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
I can then extract the information I want of another array using the mask (say column 0 contains the one I want).
variable = allvariables[mask][:,0]
Now given that the IDs are unique in both arrays, is there any way to speed this up significantly. It takes a long time to construct the mask for a few thousand points (subsampleIDs) matching against millions of IDs (totalIDs).
I thought of going through it once and writing out a binary file of an index (to speed up future searches).
for i in range(0,3):
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
index[mask] = i
where X is in subsampleIDsX. Then I can just do:
for i in range(0,3):
if index[i] == i:
rowmatch = i
break
variable = allvariables[rowmatch:len(subsampleIDs),0]
right? But this is also slow because there is a conditional in the loop to find when it first matches. Is there a faster way to find when a number first appears in an ordered array so the conditional doesn't slow the loop?
I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs].
Create some test data first:
import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))
subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)
Then convert you data in to a DataFrame:
import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs)
df.ix[subsampleIDs]
DataFrame use a hashtable to map the index to it's location, it's very fast.
Often this kind of indexing is best performed using a DB (with proper column-indexing).
Another idea is to sort totalIDs once, as a preprocessing stage, and implement your own version of in1d, which avoids sorting everything. The numpy implementation of in1d (at least in the version that I have installed) is fairly simple, and should be easy to copy and modify.
EDIT:
Or, even better, use bucket sort (or radix sort). That should give you O(N+M), N being the size of totalIDs, and M the size of sampleIDs (times a constant you can play with by changing the number of buckets). Here too, you can split totalIDs to buckets only once, which gives you a nifty O(N+M1+M2+...).
Unfortunately, I'm not aware of a numpy implementation, but I did find this: http://en.wikipedia.org/wiki/Radix_sort#Example_in_Python