Speeding up the insertion of null rows into a large Pandas DataFrame? - python

I have a Pandas data frame with hundreds of millions of rows that looks like this:
Date Attribute A Attribute B Value
01/01/16 A 1 50
01/05/16 A 1 60
01/02/16 B 1 59
01/04/16 B 1 90
01/10/16 B 1 84
For each unique combination (call it b) of Attribute A x Attribute B, I need to fill in empty dates starting from the oldest date for that unique group b to the maximum date in the entire dataframe df. That is, so it looks like this:
Date Attribute A Attribute B Value
01/01/16 A 1 50
01/02/16 A 1 0
01/03/16 A 1 0
01/04/16 A 1 0
01/05/16 A 1 60
01/02/16 B 1 59
01/03/16 B 1 0
01/04/16 B 1 90
01/05/16 B 1 0
01/06/16 B 1 0
01/07/16 B 1 0
01/08/16 B 1 84
and then calculate the coefficient of variation (standard deviation/mean) for each unique combination's values (after inserting 0s). My code is this:
final = pd.DataFrame()
max_date = df['Date'].max()
for name, group in df.groupby(['Attribute_A','Attribute_B']):
idx = pd.date_range(group['Date'].min(),
max_date)
temp = group.set_index('Date').reindex(idx, fill_value=0)
coeff_var = temp['Value'].std()/temp['Value'].mean()
final = pd.concat([final, pd.DataFrame({'Attribute_A':[name[0]], 'Attribute_B':[name[1]],'Coeff_Var':[coeff_var]})])
This runs insanely slow, and I'm looking for a way to speed it up.
Suggestions?

This runs insanely slow, and I'm looking for a way to speed it up.
Suggestions?
I don't have a ready solution, however this is how I suggest you approach the problem:
Understand what makes this slow
Find ways to make the critical parts faster
Or, alternatively, find a new approach
Here's the analysis of your code using line profiler:
Timer unit: 1e-06 s
Total time: 0.028074 s
File: <ipython-input-54-ad49822d490b>
Function: foo at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def foo():
2 1 875 875.0 3.1 final = pd.DataFrame()
3 1 302 302.0 1.1 max_date = df['Date'].max()
4 3 3343 1114.3 11.9 for name, group in df.groupby(['Attribute_A','Attribute_B']):
5 2 836 418.0 3.0 idx = pd.date_range(group['Date'].min(),
6 2 3601 1800.5 12.8 max_date)
7
8 2 6713 3356.5 23.9 temp = group.set_index('Date').reindex(idx, fill_value=0)
9 2 1961 980.5 7.0 coeff_var = temp['Value'].std()/temp['Value'].mean()
10 2 10443 5221.5 37.2 final = pd.concat([final, pd.DataFrame({'Attribute_A':[name[0]], 'Attribute_B':[name[1]],'Coeff_Var':[coeff_var]})])
In conclusion, the .reindex and concat statements take 60% of the time.
A first approach that saves 42% of time in my measurement is to collect the data for the final data frame as a list of rows, and create the dataframe as the very last step. Like so:
newdata = []
max_date = df['Date'].max()
for name, group in df.groupby(['Attribute_A','Attribute_B']):
idx = pd.date_range(group['Date'].min(),
max_date)
temp = group.set_index('Date').reindex(idx, fill_value=0)
coeff_var = temp['Value'].std()/temp['Value'].mean()
newdata.append({'Attribute_A': name[0], 'Attribute_B': name[1],'Coeff_Var':coeff_var})
final = pd.DataFrame.from_records(newdata)
Using timeit to measure best execution times I get
your solution: 100 loops, best of 3: 11.5 ms per loop
improved concat: 100 loops, best of 3: 6.67 ms per loop
Details see this ipython notebook
Note: Your mileage may vary - I used the sample data provided in the original post. You should run the line profiler on a subset of your real data - the dominating factor in regards to time use may well be something else then.

I am not sure if my way is faster than the way that you set up, but here goes:
df = pd.DataFrame({'Date': ['1/1/2016', '1/5/2016', '1/2/2016', '1/4/2016', '1/10/2016'],
'Attribute A': ['A', 'A', 'B', 'B', 'B'],
'Attribute B': [1, 1, 1, 1, 1],
'Value': [50, 60, 59, 90, 84]})
unique_attributes = df['Attribute A'].unique()
groups = []
for i in unique_attributes:
subset = df[df['Attribute A'] ==i]
dates = subset['Date'].tolist()
Dates = pd.date_range(dates[0], dates[-1])
subset.set_index('Date', inplace=True)
subset.index = pd.DatetimeIndex(subset.index)
subset = subset.reindex(Dates)
subset['Attribute A'].fillna(method='ffill', inplace=True)
subset['Attribute B'].fillna(method='ffill', inplace=True)
subset['Value'].fillna(0, inplace=True)
groups.append(subset)
result = pd.concat(groups)

Related

Obtaining timeframe for ID groups based on state change

First off, my apologies, I'm a complete novice when it comes to Python. I use it extremely infrequently but require it for this problem.
I have a set of data which looks like the below:
id
state
dt
101
0
2022-15
101
1
2022-22
101
0
2022-26
102
0
2022-01
102
1
2022-41
103
1
2022-03
103
0
2022-12
I need to provide an output which displays the amount of time each ID was state = "1". E.G for ID 101 - state1_start_dt = "2022_22", state1_end_dt = "2022_25".
The data is in .CSV format. I've attempted to bring this in via Pandas, utilise groupby on the df and then loop over this - however this seems extremely slow.
I've come across Finite State Machines which seem to link to my requirements, however I'm in way over my head attempting to create a Finite State Machine in Python which accepts .CSV inputs, provides output per each ID group as well as incorporates logic to account for scenarios where the last entry for an ID is state = "1" - therefore we'd assume the time frame was until the end of 2022.
If anyone can provide some sources or sample code which I can break down to get a better understanding - that would be great.
EDIT
Some examples to be clearer on what I'd like to achieve:
-For IDs that have no ending 0 in the state sequence, the state1_end_dt should be entered as '2022-52' (the final week in 2022)
-For IDs which have alternating states, we can incorporate a second, third, forth etc.. set of columns (E.G state1_start_dt_2, state1_end_dt_2). This will allow each window to be accounted for. For any entries that only have one window, these extra columns can be NULL.
-For IDs which have no "1" present in the state column, these can be skipped.
-For IDs which do not have any 0 states present, the minimum dt value should be taken as the state1_start_dt and '2022-52' can be entered for state1_end_dt
IIUC, here are some functions to perform the aggregation you are looking for.
First, we convert the strings '%Y-%W' (e.g. '2022-15') into a DateTime (the Monday of that week), e.g. '2022-04-11', as it is easier to deal with actual dates than these strings. This makes this solution generic in that it can have arbitrary dates in it, not just for a single year.
Second, we augment the df with a "sentinel": a row for each id that is on the first week of the next year (next year being max year of all dates, plus 1) with state = 0. That allows us to not worry whether a sequence ends with 0 or not.
Then, we essentially group by id and apply the following logic: keep only transitions, so, e.g., [1,1,1,0,0,1,0] becomes [1,.,.,0,.,1,0] (where '.' indicates dropped values). That gives us the spans we are looking for (after subtracting one week for the 0 states).
Edit: speedup: instead of applying the masking logic to each group, we detect transitions globally (on the sentinel-augmented df, sorted by ['id', 'dt', 'state']). Since each id sequence in the augmented df ends with the sentinel (0), we are guaranteed to catch the first 1 of the next id.
Putting it all together, including a postproc() to convert dates back into strings of year-week:
def preproc(df):
df = df.assign(dt=pd.to_datetime(df['dt'] + '-Mon', format='%Y-%W-%a'))
max_year = df['dt'].max().year
# first week next year:
tmax = pd.Timestamp(f'{max_year}-12-31') + pd.offsets.Week(1)
sentinel = pd.DataFrame(
pd.unique(df['id']),
columns=['id']).assign(state=0, dt=tmax)
df = pd.concat([df, sentinel])
df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
return df
# speed up
def proc(df):
mask = df['state'] != df['state'].shift(fill_value=0)
df = df[mask]
z = df.assign(c=df.groupby('id').cumcount()).set_index(['c', 'id'])['dt'].unstack('c')
z[z.columns[1::2]] -= pd.offsets.Week(1)
cols = [
f'{x}_{i}'
for i in range(len(z.columns) // 2)
for x in ['start', 'end']
]
return z.set_axis(cols, axis=1)
def asweeks_str(t, nat='--'):
return f'{t:%Y-%W}' if t and t == t else nat
def postproc(df):
# convert dates into strings '%Y-%W'
return df.applymap(asweeks_str)
Examples
First, let's use the example that is in the original question. Note that this doesn't exemplifies some of the corner cases we are able to handle (more on that in a minute).
df = pd.DataFrame({
'id': [101, 101, 101, 102, 102, 103, 103],
'state': [0, 1, 0, 0, 1, 1, 0],
'dt': ['2022-15', '2022-22', '2022-26', '2022-01', '2022-41', '2022-03', '2022-12'],
})
>>> postproc(proc(preproc(df)))
start_0 end_0
id
101 2022-22 2022-25
102 2022-41 2022-52
103 2022-03 2022-11
But let's generate some random data, to observe some corner cases:
def gen(n, nids=2):
wk = np.random.randint(1, 53, n*nids)
st = np.random.choice([0, 1], n*nids)
ids = np.repeat(np.arange(nids) + 101, n)
df = pd.DataFrame({
'id': ids,
'state': st,
'dt': [f'2022-{w:02d}' for w in wk],
})
df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
return df
Now:
np.random.seed(0) # reproducible example
df = gen(6, 3)
>>> df
id state dt
0 101 0 2022-01
1 101 0 2022-04
2 101 1 2022-04
3 101 1 2022-40
4 101 1 2022-45
5 101 1 2022-48
6 102 1 2022-10
7 102 1 2022-20
8 102 0 2022-22
9 102 1 2022-24
10 102 0 2022-37
11 102 1 2022-51
12 103 1 2022-02
13 103 0 2022-07
14 103 0 2022-13
15 103 1 2022-25
16 103 1 2022-25
17 103 1 2022-39
There are several interesting things here. First, 101 starts with a 0 state, whereas 102 and 103 both start with 1. Then, there are repeated ones for all ids. There are also repeated weeks: '2022-04' for 101 and '2022-25' for 103.
Nevertheless, the aggregation works just fine and produces:
>>> postproc(proc(preproc(df)))
start_0 end_0 start_1 end_1 start_2 end_2
id
101 2022-04 2022-52 -- -- -- --
102 2022-10 2022-21 2022-24 2022-36 2022-51 2022-52
103 2022-02 2022-06 2022-25 2022-52 -- --
Speed
np.random.seed(0)
n = 10
k = 100_000
df = gen(n, k)
%timeit preproc(df)
483 ms ± 4.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The processing itself takes less than 200ms for 1 million rows:
a = preproc(df)
%timeit proc(a)
185 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As for the post-processing (converting dates back to year-week strings), it is the slowest thing of all:
b = proc(a)
%timeit postproc(b)
1.63 s ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For a speed-up of that post-processing, we can rely on the fact that there are only a small number of distinct dates that are week-starts (52 per year, plus NaT for the blank cells):
def postproc2(df, nat='--'):
dct = {
t: f'{t:%Y-%W}' if t and t == t else nat
for t in df.stack().reset_index(drop=True).drop_duplicates()
}
return df.applymap(dct.get)
%timeit postproc2(b)
542 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
We could of course do something similar for preproc().
If the csv file called one_zero.csv is this
id,state,dt
100,0,2022-15
100,1,2022-22
100,0,2022-26
101,0,2022-01
101,1,2022-41
102,1,2022-03
102,0,2022-12
102,1,2022-33
(I've added one additional item to the end.)
Then this code gives you what you want.
import pandas as pd
df = pd.read_csv("one_zero.csv")
result = {}
for id_, sub_df in df.groupby('id'):
sub_df = sub_df.sort_values("dt")
intervals = []
start_dt = None
for state, dt in zip(sub_df["state"], sub_df["dt"]):
if state == 1:
start_dt = dt
if state == 0 and start_dt is not None:
week = int(dt.split("-", maxsplit=1)[1])
intervals.append((start_dt, f"2022-{week-1:02d}"))
start_dt = None
if start_dt is not None:
intervals.append((start_dt, "2022-52"))
result[id_] = intervals
At the end the result dictionary will contain this:
{
100: [('2022-22', '2022-25')],
101: [('2022-41', '2022-52')],
102: [('2022-03', '2022-11'), ('2022-33', '2022-52')]
}
With this groupby and sort_values it works even if you shuffle the lines in the csv file. I've used formatted string to fix the week number. 02d there means there, that the week will be always two digits, starting with 0 for the first 9 weeks.
I guess you need less memory if you iterate on the rows like this, but for me the zip version is more familiar.
for _, row in sub_df.iterrows():
state = row["state"]
dt = row["dt"]
Another alternative:
res = (
df.drop(columns="dt")
.assign(week=df["dt"].str.split("-").str[1].astype("int"))
.sort_values(["id", "week"])
.assign(group=lambda df:
df.groupby("id")["state"].diff().fillna(1).ne(0).cumsum()
)
.drop_duplicates(subset="group", keep="first")
.loc[lambda df: df["state"].eq(1) | df["id"].eq(df["id"].shift())]
.assign(no=lambda df: df.groupby("id")["state"].cumsum())
.pivot(index=["id", "no"], columns="state", values="week")
.rename(columns={0: "end", 1: "start"}).fillna("52").astype("int")
)[["start", "end"]]
First add new column week and sort along id and week. (The sorting might not be necessary if the data already come sorted.)
Then look id-group-wise for blocks of consecutive 0 or 1 and based on the result (stored in the new column group) drop all resp. duplicates while keeping the firsts (the others aren't relevant according to the logic you've layed out).
Afterwards also remove the 0-states at the start of an id-group.
On the result identify id-group-wise the connected start-end groups (store in new group no).
Then .pivot the thing: pull id and no in the index and state into the columns.
Afterwards fill the NaN with 52 and do some casting, renaminig, and sorting to get the result in better shape.
If you really want to move the various start-end-combinations into columns then replace below the pivot line as follows:
res = (
...
.pivot(index=["id", "no"], columns="state", values="week")
.rename(columns={0: 1, 1: 0}).fillna("52").astype("int")
.unstack().sort_index(level=1, axis=1)
)
res.columns = [f"{'start' if s == 0 else 'end'}_{n}" for s, n in res.columns]
Results with the dataframe from #Pierre's answer:
state start end
id no
101 1 4 52
102 1 10 22
2 24 37
3 51 52
103 1 2 7
2 25 52
or
start_1 end_1 start_2 end_2 start_3 end_3
id
101 4.0 52.0 NaN NaN NaN NaN
102 10.0 22.0 24.0 37.0 51.0 52.0
103 2.0 7.0 25.0 52.0 NaN NaN

How to create N dataframes during for [duplicate]

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).
I would like to split the dataframe into 60 dataframes (a dataframe for each participant).
In the dataframe, data, there is a variable called 'name', which is the unique code for each participant.
I have tried the following, but nothing happens (or execution does not stop within an hour). What I intend to do is to split the data into smaller dataframes, and append these to a list (datalist):
import pandas as pd
def splitframe(data, name='name'):
n = data[name][0]
df = pd.DataFrame(columns=data.columns)
datalist = []
for i in range(len(data)):
if data[name][i] == n:
df = df.append(data.iloc[i])
else:
datalist.append(df)
df = pd.DataFrame(columns=data.columns)
n = data[name][i]
df = df.append(data.iloc[i])
return datalist
I do not get an error message, the script just seems to run forever!
Is there a smart way to do it?
Can I ask why not just do it by slicing the data frame. Something like
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame() for elem in UniqueNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names == key]
Hey presto you have a dictionary of data frames just as (I think) you want them. Need to access one? Just enter
DataFrameDict['Joe']
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.
Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
You can convert groupby object to tuples and then to dict:
df = pd.DataFrame({'Name':list('aabbef'),
'A':[4,5,4,5,5,4],
'B':[7,8,9,4,2,3],
'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])
print (df)
Name A B C
0 a 4 7 1
1 a 5 8 3
2 b 4 9 5
3 b 5 4 7
4 e 5 2 1
5 f 4 3 0
d = dict(tuple(df.groupby('Name')))
print (d)
{'b': Name A B C
2 b 4 9 5
3 b 5 4 7, 'e': Name A B C
4 e 5 2 1, 'a': Name A B C
0 a 4 7 1
1 a 5 8 3, 'f': Name A B C
5 f 4 3 0}
print (d['a'])
Name A B C
0 a 4 7 1
1 a 5 8 3
It is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby('Name'):
globals()['df_' + str(i)] = g
print (df_a)
Name A B C
0 a 4 7 1
1 a 5 8 3
Easy:
[v for k, v in df.groupby('name')]
Groupby can helps you:
grouped = data.groupby(['name'])
Then you can work with each group like with a dataframe for each participant. And DataFrameGroupBy object methods such as (apply, transform, aggregate, head, first, last) return a DataFrame object.
Or you can make list from grouped and get all DataFrame's by index:
l_grouped = list(grouped)
l_grouped[0][1] - DataFrame for first group with first name.
In addition to Gusev Slava's answer, you might want to use groupby's groups:
{key: df.loc[value] for key, value in df.groupby("name").groups.items()}
This will yield a dictionary with the keys you have grouped by, pointing to the corresponding partitions. The advantage is that the keys are maintained and don't vanish in the list index.
The method in the OP works, but isn't efficient. It may have seemed to run forever, because the dataset was long.
Use .groupby on the 'method' column, and create a dict of DataFrames with unique 'method' values as the keys, with a dict-comprehension.
.groupby returns a groupby object, that contains information about the groups, where g is the unique value in 'method' for each group, and d is the DataFrame for that group.
The value of each key in df_dict, will be a DataFrame, which can be accessed in the standard way, df_dict['key'].
The original question wanted a list of DataFrames, which can be done with a list-comprehension
df_list = [d for _, d in df.groupby('method')]
import pandas as pd
import seaborn as sns # for test dataset
# load data for example
df = sns.load_dataset('planets')
# display(df.head())
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
# Using a dict-comprehension, the unique 'method' value will be the key
df_dict = {g: d for g, d in df.groupby('method')}
print(df_dict.keys())
[out]:
dict_keys(['Astrometry', 'Eclipse Timing Variations', 'Imaging', 'Microlensing', 'Orbital Brightness Modulation', 'Pulsar Timing', 'Pulsation Timing Variations', 'Radial Velocity', 'Transit', 'Transit Timing Variations'])
# or a specific name for the key, using enumerate (e.g. df1, df2, etc.)
df_dict = {f'df{i}': d for i, (g, d) in enumerate(df.groupby('method'))}
print(df_dict.keys())
[out]:
dict_keys(['df0', 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9'])
df_dict['df1].head(3) or df_dict['Astrometry'].head(3)
There are only 2 in this group
method number orbital_period mass distance year
113 Astrometry 1 246.36 NaN 20.77 2013
537 Astrometry 1 1016.00 NaN 14.98 2010
df_dict['df2].head(3) or df_dict['Eclipse Timing Variations'].head(3)
method number orbital_period mass distance year
32 Eclipse Timing Variations 1 10220.0 6.05 NaN 2009
37 Eclipse Timing Variations 2 5767.0 NaN 130.72 2008
38 Eclipse Timing Variations 2 3321.0 NaN 130.72 2008
df_dict['df3].head(3) or df_dict['Imaging'].head(3)
method number orbital_period mass distance year
29 Imaging 1 NaN NaN 45.52 2005
30 Imaging 1 NaN NaN 165.00 2007
31 Imaging 1 NaN NaN 140.00 2004
For more information about the seaborn datasets
NASA Exoplanets
Alternatively
This is a manual method to create separate DataFrames using pandas: Boolean Indexing
This is similar to the accepted answer, but .loc is not required.
This is an acceptable method for creating a couple extra DataFrames.
The pythonic way to create multiple objects, is by placing them in a container (e.g. dict, list, generator, etc.), as shown above.
df1 = df[df.method == 'Astrometry']
df2 = df[df.method == 'Eclipse Timing Variations']
In [28]: df = DataFrame(np.random.randn(1000000,10))
In [29]: df
Out[29]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0 1000000 non-null values
1 1000000 non-null values
2 1000000 non-null values
3 1000000 non-null values
4 1000000 non-null values
5 1000000 non-null values
6 1000000 non-null values
7 1000000 non-null values
8 1000000 non-null values
9 1000000 non-null values
dtypes: float64(10)
In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
1 loops, best of 3: 849 ms per loop
In [32]: len(frames)
Out[32]: 16667
Here's a groupby way (and you could do an arbitrary apply rather than sum)
In [9]: g = df.groupby(lambda x: x/60)
In [8]: g.sum()
Out[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16667 entries, 0 to 16666
Data columns (total 10 columns):
0 16667 non-null values
1 16667 non-null values
2 16667 non-null values
3 16667 non-null values
4 16667 non-null values
5 16667 non-null values
6 16667 non-null values
7 16667 non-null values
8 16667 non-null values
9 16667 non-null values
dtypes: float64(10)
Sum is cythonized that's why this is so fast
In [10]: %timeit g.sum()
10 loops, best of 3: 27.5 ms per loop
In [11]: %timeit df.groupby(lambda x: x/60)
1 loops, best of 3: 231 ms per loop
The method based on list comprehension and groupby- Which stores all the split dataframe in list variable and can be accessed using the index.
Example
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]
ans[0]
ans[0].column_name
You can use the groupby command, if you already have some labels for your data.
out_list = [group[1] for group in in_series.groupby(label_series.values)]
Here's a detailed example:
Let's say we want to partition a pd series using some labels into a list of chunks
For example, in_series is:
2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 5, dtype: float64
And its corresponding label_series is:
2019-07-01 08:00:00 1
2019-07-01 08:02:00 1
2019-07-01 08:04:00 2
2019-07-01 08:06:00 2
2019-07-01 08:08:00 2
Length: 5, dtype: float64
Run
out_list = [group[1] for group in in_series.groupby(label_series.values)]
which returns out_list a list of two pd.Series:
[2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
Length: 2, dtype: float64,
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 3, dtype: float64]
Note that you can use some parameters from in_series itself to group the series, e.g., in_series.index.day
here's a small function which might help some (efficiency not perfect probably, but compact + more or less easy to understand):
def get_splited_df_dict(df: 'pd.DataFrame', split_column: 'str'):
"""
splits a pandas.DataFrame on split_column and returns it as a dict
"""
df_dict = {value: df[df[split_column] == value].drop(split_column, axis=1) for value in df[split_column].unique()}
return df_dict
it converts a DataFrame to multiple DataFrames, by selecting each unique value in the given column and putting all those entries into a separate DataFrame.
the .drop(split_column, axis=1) is just for removing the column which was used to split the DataFrame. the removal is not necessary, but can help a little to cut down on memory usage after the operation.
the result of get_splited_df_dict is a dict, meaning one can access each DataFrame like this:
splitted = get_splited_df_dict(some_df, some_column)
# accessing the DataFrame with 'some_column_value'
splitted[some_column_value]
The existing answers cover all good cases and explains fairly well how the groupby object is like a dictionary with keys and values that can be accessed via .groups. Yet more methods to do the same job as the existing answers are:
Create a list by unpacking the groupby object and casting it to a dictionary:
dict([*df.groupby('Name')]) # same as dict(list(df.groupby('Name')))
Create a tuple + dict (this is the same as #jezrael's answer):
dict((*df.groupby('Name'),))
If we only want the DataFrames, we could get the values of the dictionary (created above):
[*dict([*df.groupby('Name')]).values()]
I had similar problem. I had a time series of daily sales for 10 different stores and 50 different items. I needed to split the original dataframe in 500 dataframes (10stores*50stores) to apply Machine Learning models to each of them and I couldn't do it manually.
This is the head of the dataframe:
I have created two lists;
one for the names of dataframes
and one for the couple of array [item_number, store_number].
list=[]
for i in range(1,len(items)*len(stores)+1):
global list
list.append('df'+str(i))
list_couple_s_i =[]
for item in items:
for store in stores:
global list_couple_s_i
list_couple_s_i.append([item,store])
And once the two lists are ready you can loop on them to create the dataframes you want:
for name, it_st in zip(list,list_couple_s_i):
globals()[name] = df.where((df['item']==it_st[0]) &
(df['store']==(it_st[1])))
globals()[name].dropna(inplace=True)
In this way I have created 500 dataframes.
Hope this will be helpful!

How to get the datetime index corresponding to the n smallest values of a column in python

I am creating a variable 'spike' as an indicator variable that is 1 for the date corresponding to old column, Cost, n smallest values and a 0 otherwise. The code illustrated below is apart of a larger for loop.
I can only get results using the idxmin() function. I would like help in getting the index for the n smallest values.
import pandas as pd
import numpy as np
df3 = pd.DataFrame({'Dept':['A', 'A', 'B', 'B'],
'Benefit':[2000,25,55,400],
'Cost':[1000, 500, 1500, 2000]})
# Let's create an index using Timestamps
index_ = [pd.Timestamp('01-06-2018'), pd.Timestamp('04-06-2018'),
pd.Timestamp('07-06-2018'), pd.Timestamp('10-06-2018')]
df3.index = index_
print(df3)
df3.index = index_
print(df3)
df3['spike'] = np.where(df3.index.isin(lookup), 1, 0)
If you sort, then you can get the top-3 with standard Python / numpy array slicing.
low_cost = df3.sort_values('Cost')[:3]
low_cost
# Dept Benefit Cost
# 2018-04-06 A 25 500
# 2018-01-06 A 2000 1000
# 2018-07-06 B 55 1500
To get the spike column, for efficiency I would recommend a join.
spikes = low_cost.assign(spike=1)[['spike']]
spikes
# spike
# 2018-04-06 1
# 2018-01-06 1
# 2018-07-06 1
df3.join(spikes, how='left').fillna(0)
# Dept Benefit Cost spike
# 2018-01-06 A 2000 1000 1.0
# 2018-04-06 A 25 500 1.0
# 2018-07-06 B 55 1500 1.0
# 2018-10-06 B 400 2000 0.0

Find when value repeats in pandas

I have a data frame of temperatures (Temp) with two columns:
Time
Temperature
And I have another data frame (Activated) with a single column that describes times when some equipment activated:
Time
Both of these data frames are over the same period, and each "TimeStamp" exist in both.
I want to know, for each time in the Activated data frame how long it passed till the temperature risen.
I know how to do it "C style", with two for loops (one iterates over Activated finds the time in Temp and then loops forward till finds higher temperature).
But I'm sure there is a much better cleaner and efficient way using pandas. Example data:
Temp
1 25.6
2 25.4
3 24
4 25.8
5 23.2
6 24.3
7 28.1
Activated
2
4
For the first row in Activated the time is 2 - the temperature in time 2 was 25.4, the result should be 2 because only on time 4 it risen.
For the second row in Activated the time is 4 - the temperature in time 4 was 25.8, the result should be 3 because only on time 7 it risen.
output for 2 is 2 and for 4 is 3
Here's a partially vectorised solution. Note pd.Series.apply is just a Python-level loop.
temp = pd.DataFrame({'Time': list(range(1, 8)),
'Temperature': [25.6, 25.4, 24, 25.8, 23.2, 24.3, 28.1]})
activated = pd.DataFrame({'Time': [2, 4]})
def time_until_higher(x):
current_temp = temp_s.at[x]
diffs = temp_s[temp_s.index > x] - current_temp
return (diffs < 0).values.argmin() + 1
temp_s = temp.set_index('Time')['Temperature'] # assume times are sorted
activated['Next'] = activated['Time'].apply(time_until_higher)
print(activated)
Time Next
0 2 2
1 4 3

Create different dataframe from one given condition [duplicate]

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).
I would like to split the dataframe into 60 dataframes (a dataframe for each participant).
In the dataframe, data, there is a variable called 'name', which is the unique code for each participant.
I have tried the following, but nothing happens (or execution does not stop within an hour). What I intend to do is to split the data into smaller dataframes, and append these to a list (datalist):
import pandas as pd
def splitframe(data, name='name'):
n = data[name][0]
df = pd.DataFrame(columns=data.columns)
datalist = []
for i in range(len(data)):
if data[name][i] == n:
df = df.append(data.iloc[i])
else:
datalist.append(df)
df = pd.DataFrame(columns=data.columns)
n = data[name][i]
df = df.append(data.iloc[i])
return datalist
I do not get an error message, the script just seems to run forever!
Is there a smart way to do it?
Can I ask why not just do it by slicing the data frame. Something like
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame() for elem in UniqueNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names == key]
Hey presto you have a dictionary of data frames just as (I think) you want them. Need to access one? Just enter
DataFrameDict['Joe']
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.
Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
You can convert groupby object to tuples and then to dict:
df = pd.DataFrame({'Name':list('aabbef'),
'A':[4,5,4,5,5,4],
'B':[7,8,9,4,2,3],
'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])
print (df)
Name A B C
0 a 4 7 1
1 a 5 8 3
2 b 4 9 5
3 b 5 4 7
4 e 5 2 1
5 f 4 3 0
d = dict(tuple(df.groupby('Name')))
print (d)
{'b': Name A B C
2 b 4 9 5
3 b 5 4 7, 'e': Name A B C
4 e 5 2 1, 'a': Name A B C
0 a 4 7 1
1 a 5 8 3, 'f': Name A B C
5 f 4 3 0}
print (d['a'])
Name A B C
0 a 4 7 1
1 a 5 8 3
It is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby('Name'):
globals()['df_' + str(i)] = g
print (df_a)
Name A B C
0 a 4 7 1
1 a 5 8 3
Easy:
[v for k, v in df.groupby('name')]
Groupby can helps you:
grouped = data.groupby(['name'])
Then you can work with each group like with a dataframe for each participant. And DataFrameGroupBy object methods such as (apply, transform, aggregate, head, first, last) return a DataFrame object.
Or you can make list from grouped and get all DataFrame's by index:
l_grouped = list(grouped)
l_grouped[0][1] - DataFrame for first group with first name.
In addition to Gusev Slava's answer, you might want to use groupby's groups:
{key: df.loc[value] for key, value in df.groupby("name").groups.items()}
This will yield a dictionary with the keys you have grouped by, pointing to the corresponding partitions. The advantage is that the keys are maintained and don't vanish in the list index.
The method in the OP works, but isn't efficient. It may have seemed to run forever, because the dataset was long.
Use .groupby on the 'method' column, and create a dict of DataFrames with unique 'method' values as the keys, with a dict-comprehension.
.groupby returns a groupby object, that contains information about the groups, where g is the unique value in 'method' for each group, and d is the DataFrame for that group.
The value of each key in df_dict, will be a DataFrame, which can be accessed in the standard way, df_dict['key'].
The original question wanted a list of DataFrames, which can be done with a list-comprehension
df_list = [d for _, d in df.groupby('method')]
import pandas as pd
import seaborn as sns # for test dataset
# load data for example
df = sns.load_dataset('planets')
# display(df.head())
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
# Using a dict-comprehension, the unique 'method' value will be the key
df_dict = {g: d for g, d in df.groupby('method')}
print(df_dict.keys())
[out]:
dict_keys(['Astrometry', 'Eclipse Timing Variations', 'Imaging', 'Microlensing', 'Orbital Brightness Modulation', 'Pulsar Timing', 'Pulsation Timing Variations', 'Radial Velocity', 'Transit', 'Transit Timing Variations'])
# or a specific name for the key, using enumerate (e.g. df1, df2, etc.)
df_dict = {f'df{i}': d for i, (g, d) in enumerate(df.groupby('method'))}
print(df_dict.keys())
[out]:
dict_keys(['df0', 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9'])
df_dict['df1].head(3) or df_dict['Astrometry'].head(3)
There are only 2 in this group
method number orbital_period mass distance year
113 Astrometry 1 246.36 NaN 20.77 2013
537 Astrometry 1 1016.00 NaN 14.98 2010
df_dict['df2].head(3) or df_dict['Eclipse Timing Variations'].head(3)
method number orbital_period mass distance year
32 Eclipse Timing Variations 1 10220.0 6.05 NaN 2009
37 Eclipse Timing Variations 2 5767.0 NaN 130.72 2008
38 Eclipse Timing Variations 2 3321.0 NaN 130.72 2008
df_dict['df3].head(3) or df_dict['Imaging'].head(3)
method number orbital_period mass distance year
29 Imaging 1 NaN NaN 45.52 2005
30 Imaging 1 NaN NaN 165.00 2007
31 Imaging 1 NaN NaN 140.00 2004
For more information about the seaborn datasets
NASA Exoplanets
Alternatively
This is a manual method to create separate DataFrames using pandas: Boolean Indexing
This is similar to the accepted answer, but .loc is not required.
This is an acceptable method for creating a couple extra DataFrames.
The pythonic way to create multiple objects, is by placing them in a container (e.g. dict, list, generator, etc.), as shown above.
df1 = df[df.method == 'Astrometry']
df2 = df[df.method == 'Eclipse Timing Variations']
In [28]: df = DataFrame(np.random.randn(1000000,10))
In [29]: df
Out[29]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0 1000000 non-null values
1 1000000 non-null values
2 1000000 non-null values
3 1000000 non-null values
4 1000000 non-null values
5 1000000 non-null values
6 1000000 non-null values
7 1000000 non-null values
8 1000000 non-null values
9 1000000 non-null values
dtypes: float64(10)
In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
1 loops, best of 3: 849 ms per loop
In [32]: len(frames)
Out[32]: 16667
Here's a groupby way (and you could do an arbitrary apply rather than sum)
In [9]: g = df.groupby(lambda x: x/60)
In [8]: g.sum()
Out[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16667 entries, 0 to 16666
Data columns (total 10 columns):
0 16667 non-null values
1 16667 non-null values
2 16667 non-null values
3 16667 non-null values
4 16667 non-null values
5 16667 non-null values
6 16667 non-null values
7 16667 non-null values
8 16667 non-null values
9 16667 non-null values
dtypes: float64(10)
Sum is cythonized that's why this is so fast
In [10]: %timeit g.sum()
10 loops, best of 3: 27.5 ms per loop
In [11]: %timeit df.groupby(lambda x: x/60)
1 loops, best of 3: 231 ms per loop
The method based on list comprehension and groupby- Which stores all the split dataframe in list variable and can be accessed using the index.
Example
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]
ans[0]
ans[0].column_name
You can use the groupby command, if you already have some labels for your data.
out_list = [group[1] for group in in_series.groupby(label_series.values)]
Here's a detailed example:
Let's say we want to partition a pd series using some labels into a list of chunks
For example, in_series is:
2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 5, dtype: float64
And its corresponding label_series is:
2019-07-01 08:00:00 1
2019-07-01 08:02:00 1
2019-07-01 08:04:00 2
2019-07-01 08:06:00 2
2019-07-01 08:08:00 2
Length: 5, dtype: float64
Run
out_list = [group[1] for group in in_series.groupby(label_series.values)]
which returns out_list a list of two pd.Series:
[2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
Length: 2, dtype: float64,
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 3, dtype: float64]
Note that you can use some parameters from in_series itself to group the series, e.g., in_series.index.day
here's a small function which might help some (efficiency not perfect probably, but compact + more or less easy to understand):
def get_splited_df_dict(df: 'pd.DataFrame', split_column: 'str'):
"""
splits a pandas.DataFrame on split_column and returns it as a dict
"""
df_dict = {value: df[df[split_column] == value].drop(split_column, axis=1) for value in df[split_column].unique()}
return df_dict
it converts a DataFrame to multiple DataFrames, by selecting each unique value in the given column and putting all those entries into a separate DataFrame.
the .drop(split_column, axis=1) is just for removing the column which was used to split the DataFrame. the removal is not necessary, but can help a little to cut down on memory usage after the operation.
the result of get_splited_df_dict is a dict, meaning one can access each DataFrame like this:
splitted = get_splited_df_dict(some_df, some_column)
# accessing the DataFrame with 'some_column_value'
splitted[some_column_value]
The existing answers cover all good cases and explains fairly well how the groupby object is like a dictionary with keys and values that can be accessed via .groups. Yet more methods to do the same job as the existing answers are:
Create a list by unpacking the groupby object and casting it to a dictionary:
dict([*df.groupby('Name')]) # same as dict(list(df.groupby('Name')))
Create a tuple + dict (this is the same as #jezrael's answer):
dict((*df.groupby('Name'),))
If we only want the DataFrames, we could get the values of the dictionary (created above):
[*dict([*df.groupby('Name')]).values()]
I had similar problem. I had a time series of daily sales for 10 different stores and 50 different items. I needed to split the original dataframe in 500 dataframes (10stores*50stores) to apply Machine Learning models to each of them and I couldn't do it manually.
This is the head of the dataframe:
I have created two lists;
one for the names of dataframes
and one for the couple of array [item_number, store_number].
list=[]
for i in range(1,len(items)*len(stores)+1):
global list
list.append('df'+str(i))
list_couple_s_i =[]
for item in items:
for store in stores:
global list_couple_s_i
list_couple_s_i.append([item,store])
And once the two lists are ready you can loop on them to create the dataframes you want:
for name, it_st in zip(list,list_couple_s_i):
globals()[name] = df.where((df['item']==it_st[0]) &
(df['store']==(it_st[1])))
globals()[name].dropna(inplace=True)
In this way I have created 500 dataframes.
Hope this will be helpful!

Categories