Index Slicing with Float64Index not working in pandas - python

I have the following dataframe
p12Diff
Pump Time
3 -2.90 -0.000919
-2.89 -0.000795
-2.88 -0.000814
-2.87 -0.000700
-2.86 -0.000847
-2.85 -0.000769
-2.84 -0.000681
-2.83 -0.000888
-2.82 -0.000815
-2.81 -0.000764
-2.80 -0.000879
-2.70 -0.000757
-2.60 -0.000758
-2.50 -0.000707
Oddly enough, when I slice with idx=IndexSlice for certain ranges, I get a KeyError, whereas for others it simply works. For example, df.loc[idx[:,-2.90:-2.52],:] cuts at -2.60, whereas df.loc[idx[:,-2.90:-2.62],:] raises KeyError: -2.62.
Might this be a bug?

This was fixed for 0.15.0 (RC1 is out now), see here: http://pandas.pydata.org/. 0.14.1 was a bit buggy with this type of indexing.
In [13]: df = DataFrame({'value' : np.arange(11)},index=pd.MultiIndex.from_product([[1],np.linspace(-2.9,-2.3,11)]))
In [14]: df
Out[14]:
value
1 -2.90 0
-2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
-2.48 7
-2.42 8
-2.36 9
-2.30 10
In [15]: idx = pd.IndexSlice
In [16]: df.loc[idx[:,-2.9:-2.42],]
Out[16]:
value
1 -2.90 0
-2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
-2.48 7
-2.42 8
In [17]: df.loc[idx[:,-2.9:-2.52],]
Out[17]:
value
1 -2.90 0
-2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
In [18]: df.loc[idx[:,-2.84:-2.52],]
Out[18]:
value
1 -2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
In [19]: df.loc[idx[:,-2.85:-2.52],]
Out[19]:
value
1 -2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6

Related

Better way to create modied copies of pandas rows based on condition [duplicate]

I have a dataframe where some cells contain lists of multiple values. Rather than storing multiple
values in a cell, I'd like to expand the dataframe so that each item in the list gets its own row (with the same values in all other columns). So if I have:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'trial_num': [1, 2, 3, 1, 2, 3],
'subject': [1, 1, 1, 2, 2, 2],
'samples': [list(np.random.randn(3).round(2)) for i in range(6)]
}
)
df
Out[10]:
samples subject trial_num
0 [0.57, -0.83, 1.44] 1 1
1 [-0.01, 1.13, 0.36] 1 2
2 [1.18, -1.46, -0.94] 1 3
3 [-0.08, -4.22, -2.05] 2 1
4 [0.72, 0.79, 0.53] 2 2
5 [0.4, -0.32, -0.13] 2 3
How do I convert to long form, e.g.:
subject trial_num sample sample_num
0 1 1 0.57 0
1 1 1 -0.83 1
2 1 1 1.44 2
3 1 2 -0.01 0
4 1 2 1.13 1
5 1 2 0.36 2
6 1 3 1.18 0
# etc.
The index is not important, it's OK to set existing
columns as the index and the final ordering isn't
important.
Pandas >= 0.25
Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.
df = pd.DataFrame({
'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan],
'var2': [1, 2, 3, 4]
})
df
var1 var2
0 [a, b, c] 1
1 [d, e] 2
2 [] 3
3 NaN 4
df.explode('var1')
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
2 NaN 3 # empty list converted to NaN
3 NaN 4 # NaN entry preserved as-is
# to reset the index to be monotonically increasing...
df.explode('var1').reset_index(drop=True)
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 NaN 3
6 NaN 4
Note that this also handles mixed columns of lists and scalars, as well as empty lists and NaNs appropriately (this is a drawback of repeat-based solutions).
However, you should note that explode only works on a single column (for now).
P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode. See this (very much) related answer by me.
A bit longer than I expected:
>>> df
samples subject trial_num
0 [-0.07, -2.9, -2.44] 1 1
1 [-1.52, -0.35, 0.1] 1 2
2 [-0.17, 0.57, -0.65] 1 3
3 [-0.82, -1.06, 0.47] 2 1
4 [0.79, 1.35, -0.09] 2 2
5 [1.17, 1.14, -1.79] 2 3
>>>
>>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
>>> s.name = 'sample'
>>>
>>> df.drop('samples', axis=1).join(s)
subject trial_num sample
0 1 1 -0.07
0 1 1 -2.90
0 1 1 -2.44
1 1 2 -1.52
1 1 2 -0.35
1 1 2 0.10
2 1 3 -0.17
2 1 3 0.57
2 1 3 -0.65
3 2 1 -0.82
3 2 1 -1.06
3 2 1 0.47
4 2 2 0.79
4 2 2 1.35
4 2 2 -0.09
5 2 3 1.17
5 2 3 1.14
5 2 3 -1.79
If you want sequential index, you can apply reset_index(drop=True) to the result.
update:
>>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()
>>> res = res.reset_index()
>>> res.columns = ['subject','trial_num','sample_num','sample']
>>> res
subject trial_num sample_num sample
0 1 1 0 1.89
1 1 1 1 -2.92
2 1 1 2 0.34
3 1 2 0 0.85
4 1 2 1 0.24
5 1 2 2 0.72
6 1 3 0 -0.96
7 1 3 1 -2.72
8 1 3 2 -0.11
9 2 1 0 -1.33
10 2 1 1 3.13
11 2 1 2 -0.65
12 2 2 0 0.10
13 2 2 1 0.65
14 2 2 2 0.15
15 2 3 0 0.64
16 2 3 1 -0.10
17 2 3 2 -0.76
UPDATE: the solution below was helpful for older Pandas versions, because the DataFrame.explode() wasn’t available. Starting from Pandas 0.25.0 you can simply use DataFrame.explode().
lst_col = 'samples'
r = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
Result:
In [103]: r
Out[103]:
samples subject trial_num
0 0.10 1 1
1 -0.20 1 1
2 0.05 1 1
3 0.25 1 2
4 1.32 1 2
5 -0.17 1 2
6 0.64 1 3
7 -0.22 1 3
8 -0.71 1 3
9 -0.03 2 1
10 -0.65 2 1
11 0.76 2 1
12 1.77 2 2
13 0.89 2 2
14 0.65 2 2
15 -0.98 2 3
16 0.65 2 3
17 -0.30 2 3
PS here you may find a bit more generic solution
UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:
in the following line we are repeating values in one column N times where N - is the length of the corresponding list:
In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)
this can be generalized for all columns, containing scalar values:
In [11]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: )
Out[11]:
trial_num subject
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 3 1
.. ... ...
11 1 2
12 2 2
13 2 2
14 2 2
15 3 2
16 3 2
17 3 2
[18 rows x 2 columns]
using np.concatenate() we can flatten all values in the list column (samples) and get a 1D vector:
In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32, 0.82, -0.59, -0.34, 0.25, 2.09, 0.12, 0.83, -0.88, 0.68, 0.55, -0.56, 0.65, -0.04, 0.36, -0.31])
putting all this together:
In [13]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
trial_num subject samples
0 1 1 -1.04
1 1 1 -0.58
2 1 1 -1.32
3 2 1 0.82
4 2 1 -0.59
5 2 1 -0.34
6 3 1 0.25
.. ... ... ...
11 1 2 0.68
12 2 2 0.55
13 2 2 -0.56
14 2 2 0.65
15 3 2 -0.04
16 3 2 0.36
17 3 2 -0.31
[18 rows x 3 columns]
using pd.DataFrame()[df.columns] will guarantee that we are selecting columns in the original order...
you can also use pd.concat and pd.melt for this:
>>> objs = [df, pd.DataFrame(df['samples'].tolist())]
>>> pd.concat(objs, axis=1).drop('samples', axis=1)
subject trial_num 0 1 2
0 1 1 -0.49 -1.00 0.44
1 1 2 -0.28 1.48 2.01
2 1 3 -0.52 -1.84 0.02
3 2 1 1.23 -1.36 -1.06
4 2 2 0.54 0.18 0.51
5 2 3 -2.18 -0.13 -1.35
>>> pd.melt(_, var_name='sample_num', value_name='sample',
... value_vars=[0, 1, 2], id_vars=['subject', 'trial_num'])
subject trial_num sample_num sample
0 1 1 0 -0.49
1 1 2 0 -0.28
2 1 3 0 -0.52
3 2 1 0 1.23
4 2 2 0 0.54
5 2 3 0 -2.18
6 1 1 1 -1.00
7 1 2 1 1.48
8 1 3 1 -1.84
9 2 1 1 -1.36
10 2 2 1 0.18
11 2 3 1 -0.13
12 1 1 2 0.44
13 1 2 2 2.01
14 1 3 2 0.02
15 2 1 2 -1.06
16 2 2 2 0.51
17 2 3 2 -1.35
last, if you need you can sort base on the first the first three columns.
Trying to work through Roman Pekar's solution step-by-step to understand it better, I came up with my own solution, which uses melt to avoid some of the confusing stacking and index resetting. I can't say that it's obviously a clearer solution though:
items_as_cols = df.apply(lambda x: pd.Series(x['samples']), axis=1)
# Keep original df index as a column so it's retained after melt
items_as_cols['orig_index'] = items_as_cols.index
melted_items = pd.melt(items_as_cols, id_vars='orig_index',
var_name='sample_num', value_name='sample')
melted_items.set_index('orig_index', inplace=True)
df.merge(melted_items, left_index=True, right_index=True)
Output (obviously we can drop the original samples column now):
samples subject trial_num sample_num sample
0 [1.84, 1.05, -0.66] 1 1 0 1.84
0 [1.84, 1.05, -0.66] 1 1 1 1.05
0 [1.84, 1.05, -0.66] 1 1 2 -0.66
1 [-0.24, -0.9, 0.65] 1 2 0 -0.24
1 [-0.24, -0.9, 0.65] 1 2 1 -0.90
1 [-0.24, -0.9, 0.65] 1 2 2 0.65
2 [1.15, -0.87, -1.1] 1 3 0 1.15
2 [1.15, -0.87, -1.1] 1 3 1 -0.87
2 [1.15, -0.87, -1.1] 1 3 2 -1.10
3 [-0.8, -0.62, -0.68] 2 1 0 -0.80
3 [-0.8, -0.62, -0.68] 2 1 1 -0.62
3 [-0.8, -0.62, -0.68] 2 1 2 -0.68
4 [0.91, -0.47, 1.43] 2 2 0 0.91
4 [0.91, -0.47, 1.43] 2 2 1 -0.47
4 [0.91, -0.47, 1.43] 2 2 2 1.43
5 [-1.14, -0.24, -0.91] 2 3 0 -1.14
5 [-1.14, -0.24, -0.91] 2 3 1 -0.24
5 [-1.14, -0.24, -0.91] 2 3 2 -0.91
For those looking for a version of Roman Pekar's answer that avoids manual column naming:
column_to_explode = 'samples'
res = (df
.set_index([x for x in df.columns if x != column_to_explode])[column_to_explode]
.apply(pd.Series)
.stack()
.reset_index())
res = res.rename(columns={
res.columns[-2]:'exploded_{}_index'.format(column_to_explode),
res.columns[-1]: '{}_exploded'.format(column_to_explode)})
I found the easiest way was to:
Convert the samples column into a DataFrame
Joining with the original df
Melting
Shown here:
df.samples.apply(lambda x: pd.Series(x)).join(df).\
melt(['subject','trial_num'],[0,1,2],var_name='sample')
subject trial_num sample value
0 1 1 0 -0.24
1 1 2 0 0.14
2 1 3 0 -0.67
3 2 1 0 -1.52
4 2 2 0 -0.00
5 2 3 0 -1.73
6 1 1 1 -0.70
7 1 2 1 -0.70
8 1 3 1 -0.29
9 2 1 1 -0.70
10 2 2 1 -0.72
11 2 3 1 1.30
12 1 1 2 -0.55
13 1 2 2 0.10
14 1 3 2 -0.44
15 2 1 2 0.13
16 2 2 2 -1.44
17 2 3 2 0.73
It's worth noting that this may have only worked because each trial has the same number of samples (3). Something more clever may be necessary for trials of different sample sizes.
import pandas as pd
df = pd.DataFrame([{'Product': 'Coke', 'Prices': [100,123,101,105,99,94,98]},{'Product': 'Pepsi', 'Prices': [101,104,104,101,99,99,99]}])
print(df)
df = df.assign(Prices=df.Prices.str.split(',')).explode('Prices')
print(df)
Try this in pandas >=0.25 version
Very late answer but I want to add this:
A fast solution using vanilla Python that also takes care of the sample_num column in OP's example. On my own large dataset with over 10 million rows and a result with 28 million rows this only takes about 38 seconds. The accepted solution completely breaks down with that amount of data and leads to a memory error on my system that has 128GB of RAM.
df = df.reset_index(drop=True)
lstcol = df.lstcol.values
lstcollist = []
indexlist = []
countlist = []
for ii in range(len(lstcol)):
lstcollist.extend(lstcol[ii])
indexlist.extend([ii]*len(lstcol[ii]))
countlist.extend([jj for jj in range(len(lstcol[ii]))])
df = pd.merge(df.drop("lstcol",axis=1),pd.DataFrame({"lstcol":lstcollist,"lstcol_num":countlist},
index=indexlist),left_index=True,right_index=True).reset_index(drop=True)
Also very late, but here is an answer from Karvy1 that worked well for me if you don't have pandas >=0.25 version: https://stackoverflow.com/a/52511166/10740287
For the example above you may write:
data = [(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples]
data = pd.DataFrame(data, columns=['subject', 'trial_num', 'samples'])
Speed test:
%timeit data = pd.DataFrame([(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples], columns=['subject', 'trial_num', 'samples'])
1.33 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit data = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack().reset_index()
4.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data = pd.DataFrame({col:np.repeat(df[col].values, df['samples'].str.len())for col in df.columns.drop('samples')}).assign(**{'samples':np.concatenate(df['samples'].values)})
1.38 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Creating new pandas dataframe from pivottable condition

I have a dataframe that looks like that:
df
Out[42]:
Unnamed: 0 Unnamed: 0.1 Region GeneID DistanceValue
0 25520 25520 Olfactory areas 69835573 -1.000000
1 25521 25521 Olfactory areas 583846 -1.000000
2 25522 25522 Olfactory areas 68667661 -1.000000
3 25523 25523 Olfactory areas 70474965 -1.000000
4 25524 25524 Olfactory areas 68341920 -1.000000
... ... ... ... ...
15662 1072369 1072369 Cerebellum unspecific 74743327 -0.960186
15663 1072370 1072370 Cerebellum unspecific 69530983 -0.960139
15664 1072371 1072371 Cerebellum unspecific 68442853 -0.960129
15665 1072372 1072372 Cerebellum unspecific 74514339 -0.960038
15666 1072373 1072373 Cerebellum unspecific 70724637 -0.960003
[15667 rows x 5 columns]
I want to count 'GeneID's, and create a new df, that only contains the rows with GeneID's that are there more than 5 times.. so I did
genelist = df.pivot_table(index=['GeneID'], aggfunc='size')
sort_genelist = genelist.sort_values(axis=0,ascending=False)
sort_genelist
Out[44]:
GeneID
631707 11
68269286 10
633269 10
70302366 9
74357905 9
..
70784714 1
70784824 1
70784898 1
70784916 1
70528527 1
Length: 7875, dtype: int64
So now I want my df dataframe to just contain the rows with the ID's that were counted more than 5 times..
Use Series.isin for mask by index values of values of sort_genelist with length more like 5 and filter by boolean indexing:
df = df[df['GeneID'].isin(sort_genelist.index[sort_genelist > 5])]
I think that the best way to do what you have asked is:
df['gene_id_count'] = df.groupby('GeneID').transform(len)
df.loc[df['gene_id_count'] > 5, :]
Lets take this tiny example:
>>> df = pd.DataFrame({'GeneID': [1,1,1,3,4,5,5,4], 'ID': range(8)})
>>> df
GeneID ID
0 1 0
1 1 1
2 1 2
3 3 3
4 4 4
5 5 5
6 5 6
7 4 7
And consider 2 occurrences (instead of 5)
min_gene_id_count = 2
>>> df['gene_id_count'] = df.groupby('GeneID').transform(len)
>>> df
GeneID ID gene_id_count
0 1 0 3
1 1 1 3
2 1 2 3
3 3 3 1
4 4 4 2
5 5 5 2
6 5 6 2
7 4 7 2
>>> df.loc[df['gene_id_count'] > min_gene_id_count , :]
GeneID ID gene_id_count
0 1 0 3
1 1 1 3
2 1 2 3

Python Pandas Memory Error during Drop

I have a df of 825468 rows.
I am performing this over it.
frame = frame.drop(frame.loc[(
frame['RR'].str.contains(r"^([23])[^-]*-\1[^-]*$")), 'RR'].str.replace("[23]([^-]*)-[23]([^-]*)", r"\1-\2").isin(
series1.str.replace("1([^-]*)-1([^-]*)", r"\1-\2"))[lambda d: d].index)
where
series1 = frame.loc[frame['RR'].str.contains("^1[^-]*-1"), 'RR']
So what it does it
prepares a series of where RR has value like 1abc-1bcd and then if in frame there is an RR like 2abc-2bcd which after replacement becomes abc-bcd and its there in series as well after replacement,its dropped.
But it gives Memory Error.Is there a more efficient way to perform the same.
For ex.
if in a df ..
RR
0 2abc-2abc
1 1abc-1abc
2 3abc-3abc
3 2def-2def
4 3def-3def
5 def-dfd
6 sdsd-sdsd
7 1def-1def
Then from this frame 2abc-2abc and 3abc-3abc should be dropped,as after removing 2,3 it becomes abc-abc and when we remove 1 from 1abc-1abc it also is abc-abc.2def-2def should not be dropped as there is no 1def-1def
Output:
RR
0 1abc-1abc
1 def-dfd
2 sdsd-sdsd
3 1def-1def
UPDATE2:
In [176]: df
Out[176]:
RR
0 2abc-2abc
1 3abc-3abc
2 2def-2def
3 3def-3def
4 def-dfd
5 sdsd-sdsd
6 1def-1def
7 abc-abc
8 def-def
In [177]: df[['d1','s','s2']] = df.RR.str.extract(r'^(?P<d1>\d+)(?P<s1>[^-]*)-\1(?P<s2>[^-]*)', expand=True)
In [178]: df
Out[178]:
RR d1 s s2
0 2abc-2abc 2 abc abc
1 3abc-3abc 3 abc abc
2 2def-2def 2 def def
3 3def-3def 3 def def
4 def-dfd NaN NaN NaN
5 sdsd-sdsd NaN NaN NaN
6 1def-1def 1 def def
7 abc-abc NaN NaN NaN
8 def-def NaN NaN NaN
In [179]: df.s += df.pop('s2')
In [180]: df
Out[180]:
RR d1 s
0 2abc-2abc 2 abcabc
1 3abc-3abc 3 abcabc
2 2def-2def 2 defdef
3 3def-3def 3 defdef
4 def-dfd NaN NaN
5 sdsd-sdsd NaN NaN
6 1def-1def 1 defdef
7 abc-abc NaN NaN
8 def-def NaN NaN
In [181]: result = df.loc[~df.s.isin(df.loc[df.d1 == '1', 's']) | (~df.d1.isin(['2','3'])), 'RR']
In [182]: result
Out[182]:
0 2abc-2abc
1 3abc-3abc
4 def-dfd
5 sdsd-sdsd
6 1def-1def
7 abc-abc
8 def-def
Name: RR, dtype: object
UPDATE:
In [171]: df
Out[171]:
RR
0 2abc-2abc
1 1abc-1abc
2 3abc-3abc
3 2def-2def
4 3def-3def
5 def-dfd
6 sdsd-sdsd
7 1def-1def
8 abc-abc
NOTE: I have intentionally added 8th row: abc-abc, which should NOT be dropped (if i understood your question correctly)
Solution 1: using .str.replace() and drop_duplicates() methods:
In [178]: (df.sort_values('RR')
...: .RR
...: .str.replace("[23]([^-]*)-[23]([^-]*)", r"1\1-1\2")
...: .drop_duplicates()
...: )
...:
Out[178]:
1 1abc-1abc
7 1def-1def
8 abc-abc
5 def-dfd
6 sdsd-sdsd
Name: RR, dtype: object
Solution 2: using .str.replace() and .str.contains() methods and boolean indexing:
In [172]: df.loc[~df.sort_values('RR')
...: .RR
...: .str.replace("[23]([^-]*)-[23]([^-]*)", r"_\1-_\2")
...: .str.contains(r"^_[^-]*-_")]
...:
Out[172]:
RR
1 1abc-1abc
5 def-dfd
6 sdsd-sdsd
7 1def-1def
8 abc-abc
NOTE: you may want to replace '_' with another symbol(s), which will never occur in the RR column

Most efficient way to pass data from one pandas DataFrame to another

I'm trying to find a more efficient way of transferring information from one DataFrame to another by iterating rows. I have 2 DataFrames, one containing unique values called 'id' in a column and a value called 'region' in another column:
dfkey = DataFrame({'id':[1122,3344,3467,1289,7397,1209,5678,1792,1928,4262,9242],
'region': [1,2,3,4,5,6,7,8,9,10,11]})
id region
0 1122 1
1 3344 2
2 3467 3
3 1289 4
4 7397 5
5 1209 6
6 5678 7
7 1792 8
8 1928 9
9 4262 10
10 9242 11
...the other DataFrame contains these same ids, but now sometimes repeated and without any order:
df2 = DataFrame({'id':[1792,1122,3344,1122,3467,1289,7397,1209,5678],
'other': [3,2,3,4,3,5,7,3,1]})
id other
0 1792 3
1 1122 2
2 3344 3
3 1122 4
4 3467 3
5 1289 5
6 7397 7
7 1209 3
8 5678 1
I want to use the dfkey DataFrame as a key to input the region of each id in the df2 DataFrame. I already found a way to do this with iterrows(), but it involves nested loops:
df2['region']=0
for i, rowk in dfkey.iterrows():
for j, rowd in df2.iterrows():
if rowk['id'] == rowd['id']:
rowd['region'] = rowk['region']
id other region
0 1792 3 8
1 1122 2 1
2 3344 3 2
3 1122 4 1
4 3467 3 3
5 1289 5 4
6 7397 7 5
7 1209 3 6
8 5678 1 7
The actual dfkey I have has 43K rows and the df2 600K rows. The code has been running for an hour now so I'm wondering if there's a more efficient way of doing this...
pandas.merge could be another solution.
newdf = pandas.merge(df2, dfkey, on='id')
In [22]: newdf
Out[22]:
id other region
0 1792 3 8
1 1122 2 1
2 1122 4 1
3 3344 3 2
4 3467 3 3
5 1289 5 4
6 7397 7 5
7 1209 3 6
8 5678 1 7
I would use map() method:
In [268]: df2['region'] = df2['id'].map(dfkey.set_index('id').region)
In [269]: df2
Out[269]:
id other region
0 1792 3 8
1 1122 2 1
2 3344 3 2
3 1122 4 1
4 3467 3 3
5 1289 5 4
6 7397 7 5
7 1209 3 6
8 5678 1 7
Timing for 900K rows df2 DF:
In [272]: df2 = pd.concat([df2] * 10**5, ignore_index=True)
In [273]: df2.shape
Out[273]: (900000, 3)
In [274]: dfkey.shape
Out[274]: (11, 2)
In [275]: %timeit df2['id'].map(dfkey.set_index('id').region)
10 loops, best of 3: 176 ms per loop

Python / pandas: Fastest way to set and retrieve data, without chained assaigment

I am doing som routines that acces scalars and vectors from a pandas dataframe, and then sets the results after some calculations.
Initially I used the form df[var][index] to do this, but encountered problems with chained assaignment (http://pandas.pydata.org/pandas-docs/dev/indexing.html%23indexing-view-versus-copy)
So I change it to use the df.loc[index,var]. Which solved the view/copy problem but it is very slow. For arrays I convert it to a pandas series and uses the builtin df.update(). I am now searching for the fastest/best way of doing this, without having to worry about chained assaingment. In the documentation they say that for example df.at[] is the quickest way to access scalars. Does anyone have any experience with this ? Or can point at some literature that can help ?
Thanks
Edit: Code looks like this, which I think is pretty standard.
def set_var(self,name,periode,value):
try:
if navn.upper() not in self.data:
self.data[name.upper()]=num.NaN
self.data.loc[periode,name.upper()]=value
except:
print('Fail to set'+navn])
def get_var(self,navn,periode):
''' Get value '''
try:
value=self.data.loc[periode,navn.upper()]
def set_series(data, index):
outputserie=pd.Series(data,index)
self.data.update(outputserie)
dataframe looks like this:
SC0.data
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 148 entries, 1980Q1 to 2016Q4
Columns: 3111 entries, CAP1 to CHH_DRD
dtypes: float64(3106), int64(2), object(3)
edit2:
a df could look like
var var1
2012Q4 0.462015 0.01585
2013Q1 0.535161 0.01577
2013Q2 0.735432 0.01401
2013Q3 0.845959 0.01638
2013Q4 0.776809 0.01657
2014Q1 0.000000 0.01517
2014Q2 0.000000 0.01593
and I basically want to perform two operations:
1) perhaps update var1 with the same scalar over all periodes
2) solve var in 2014Q1 as var,2013Q4 = var1,2013Q3/var2013Q4*var,2013Q4
This is done as part of a bigger model setup, which is read from a txt file. Since I doing loads of these calculations, the speed og setting and reading data matter
The example you gave above can be vectorized.
In [3]: df = DataFrame(dict(A = np.arange(10), B = np.arange(10)),index=pd.period_range('2012',freq='Q',periods=10))
In [4]: df
Out[4]:
A B
2012Q1 0 0
2012Q2 1 1
2012Q3 2 2
2012Q4 3 3
2013Q1 4 4
2013Q2 5 5
2013Q3 6 6
2013Q4 7 7
2014Q1 8 8
2014Q2 9 9
Assign a scalar
In [5]: df['A'] = 5
In [6]: df
Out[6]:
A B
2012Q1 5 0
2012Q2 5 1
2012Q3 5 2
2012Q4 5 3
2013Q1 5 4
2013Q2 5 5
2013Q3 5 6
2013Q4 5 7
2014Q1 5 8
2014Q2 5 9
Perform a shifted operation
In [8]: df['C'] = df['B'].shift()/df['B'].shift(2)
In [9]: df
Out[9]:
A B C
2012Q1 5 0 NaN
2012Q2 5 1 NaN
2012Q3 5 2 inf
2012Q4 5 3 2.000000
2013Q1 5 4 1.500000
2013Q2 5 5 1.333333
2013Q3 5 6 1.250000
2013Q4 5 7 1.200000
2014Q1 5 8 1.166667
2014Q2 5 9 1.142857
Using a vectorized assignment
In [10]: df.loc[df['B']>5,'D'] = 'foo'
In [11]: df
Out[11]:
A B C D
2012Q1 5 0 NaN NaN
2012Q2 5 1 NaN NaN
2012Q3 5 2 inf NaN
2012Q4 5 3 2.000000 NaN
2013Q1 5 4 1.500000 NaN
2013Q2 5 5 1.333333 NaN
2013Q3 5 6 1.250000 foo
2013Q4 5 7 1.200000 foo
2014Q1 5 8 1.166667 foo
2014Q2 5 9 1.142857 foo

Categories