Python / pandas: Fastest way to set and retrieve data, without chained assaigment - python

I am doing som routines that acces scalars and vectors from a pandas dataframe, and then sets the results after some calculations.
Initially I used the form df[var][index] to do this, but encountered problems with chained assaignment (http://pandas.pydata.org/pandas-docs/dev/indexing.html%23indexing-view-versus-copy)
So I change it to use the df.loc[index,var]. Which solved the view/copy problem but it is very slow. For arrays I convert it to a pandas series and uses the builtin df.update(). I am now searching for the fastest/best way of doing this, without having to worry about chained assaingment. In the documentation they say that for example df.at[] is the quickest way to access scalars. Does anyone have any experience with this ? Or can point at some literature that can help ?
Thanks
Edit: Code looks like this, which I think is pretty standard.
def set_var(self,name,periode,value):
try:
if navn.upper() not in self.data:
self.data[name.upper()]=num.NaN
self.data.loc[periode,name.upper()]=value
except:
print('Fail to set'+navn])
def get_var(self,navn,periode):
''' Get value '''
try:
value=self.data.loc[periode,navn.upper()]
def set_series(data, index):
outputserie=pd.Series(data,index)
self.data.update(outputserie)
dataframe looks like this:
SC0.data
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 148 entries, 1980Q1 to 2016Q4
Columns: 3111 entries, CAP1 to CHH_DRD
dtypes: float64(3106), int64(2), object(3)
edit2:
a df could look like
var var1
2012Q4 0.462015 0.01585
2013Q1 0.535161 0.01577
2013Q2 0.735432 0.01401
2013Q3 0.845959 0.01638
2013Q4 0.776809 0.01657
2014Q1 0.000000 0.01517
2014Q2 0.000000 0.01593
and I basically want to perform two operations:
1) perhaps update var1 with the same scalar over all periodes
2) solve var in 2014Q1 as var,2013Q4 = var1,2013Q3/var2013Q4*var,2013Q4
This is done as part of a bigger model setup, which is read from a txt file. Since I doing loads of these calculations, the speed og setting and reading data matter

The example you gave above can be vectorized.
In [3]: df = DataFrame(dict(A = np.arange(10), B = np.arange(10)),index=pd.period_range('2012',freq='Q',periods=10))
In [4]: df
Out[4]:
A B
2012Q1 0 0
2012Q2 1 1
2012Q3 2 2
2012Q4 3 3
2013Q1 4 4
2013Q2 5 5
2013Q3 6 6
2013Q4 7 7
2014Q1 8 8
2014Q2 9 9
Assign a scalar
In [5]: df['A'] = 5
In [6]: df
Out[6]:
A B
2012Q1 5 0
2012Q2 5 1
2012Q3 5 2
2012Q4 5 3
2013Q1 5 4
2013Q2 5 5
2013Q3 5 6
2013Q4 5 7
2014Q1 5 8
2014Q2 5 9
Perform a shifted operation
In [8]: df['C'] = df['B'].shift()/df['B'].shift(2)
In [9]: df
Out[9]:
A B C
2012Q1 5 0 NaN
2012Q2 5 1 NaN
2012Q3 5 2 inf
2012Q4 5 3 2.000000
2013Q1 5 4 1.500000
2013Q2 5 5 1.333333
2013Q3 5 6 1.250000
2013Q4 5 7 1.200000
2014Q1 5 8 1.166667
2014Q2 5 9 1.142857
Using a vectorized assignment
In [10]: df.loc[df['B']>5,'D'] = 'foo'
In [11]: df
Out[11]:
A B C D
2012Q1 5 0 NaN NaN
2012Q2 5 1 NaN NaN
2012Q3 5 2 inf NaN
2012Q4 5 3 2.000000 NaN
2013Q1 5 4 1.500000 NaN
2013Q2 5 5 1.333333 NaN
2013Q3 5 6 1.250000 foo
2013Q4 5 7 1.200000 foo
2014Q1 5 8 1.166667 foo
2014Q2 5 9 1.142857 foo

Related

Divide several columns in a python dataframe where the both the numerator and denominator columns will vary based on a picklist

I'm creating a dataframe by pairing down a very large dataframe (approximately 400 columns) based on a choices an enduser makes on a picklist. One of the picklist choices is the type of denominator that the enduser would like. Here is one example table with all the information before the final calculation is made.
county _tcount _tvote _f_npb_18_count _f_npb_18_vote
countycode
35 San Benito 28194 22335 2677 1741
36 San Bernardino 912653 661838 108724 61832
countycode _f_npb_30_count _f_npb_30_vote
35 384 288
36 76749 53013
However, I am trouble creating code that will automatically divide every column starting with the 5th (not including the index) by the column before it (skipping every other column). I've seen examples (Divide multiple columns by another column in pandas), but they all use fixed column names which is not achievable for this aspect. I've able to variable columns (based on positions) by fixed columns, but not variable columns by other variable columns based on position. I've tried modifying the code in the above link based on the column positions:
calculated_frame = [county_select_frame[county_select_frame.columns[5: : 2]].div(county_select_frame[4: :2], axis=0)]
output:
[ county _tcount _tvote _f_npb_18_count _f_npb_18_vote \
countycode
35 NaN NaN NaN NaN NaN
36 NaN NaN NaN NaN NaN]
RuntimeWarning: invalid value encountered in greater
(abs_vals > 0)).any()
The use of [5: :2] does work when the dividend is a fixed field.If I can't get this to work, it's not a big deal (But it would be great to have all options I wanted).
My preference would be to organize it by setting the index and using filter to split out a counts and votes dataframes separately. Then use join
d1 = df.set_index('county', append=True)
counts = d1.filter(regex='.*_\d+_count$').rename(columns=lambda x: x.replace('_count', ''))
votes = d1.filter(regex='.*_\d+_vote$').rename(columns=lambda x: x.replace('_vote', ''))
d1[['_tcount', '_tvote']].join(votes / counts)
_tcount _tvote _f_npb_18 _f_npb_30
countycode county
35 San Benito 28194 22335 0.650355 0.750000
36 San Bernardino 912653 661838 0.568706 0.690732
I think you can divide by numpy arrays created by values, because then not align columns names. Last create new DataFrame by constructor:
arr = county_select_frame.values
df1 = pd.DataFrame(arr[:,5::2] / arr[:,4::2], columns = county_select_frame.columns[5::2])
Sample:
np.random.seed(10)
county_select_frame = pd.DataFrame(np.random.randint(10, size=(10,10)),
columns=list('abcdefghij'))
print (county_select_frame)
a b c d e f g h i j
0 9 4 0 1 9 0 1 8 9 0
1 8 6 4 3 0 4 6 8 1 8
2 4 1 3 6 5 3 9 6 9 1
3 9 4 2 6 7 8 8 9 2 0
4 6 7 8 1 7 1 4 0 8 5
5 4 7 8 8 2 6 2 8 8 6
6 6 5 6 0 0 6 9 1 8 9
7 1 2 8 9 9 5 0 2 7 3
8 0 4 2 0 3 3 1 2 5 9
9 0 1 0 1 9 0 9 2 1 1
arr = county_select_frame.values
df1 = pd.DataFrame(arr[:,5::2] / arr[:,4::2], columns = county_select_frame.columns[5::2])
print (df1)
f h j
0 0.000000 8.000000 0.000000
1 inf 1.333333 8.000000
2 0.600000 0.666667 0.111111
3 1.142857 1.125000 0.000000
4 0.142857 0.000000 0.625000
5 3.000000 4.000000 0.750000
6 inf 0.111111 1.125000
7 0.555556 inf 0.428571
8 1.000000 2.000000 1.800000
9 0.000000 0.222222 1.000000
How about something like
cols = my_df.columns
for i in range(2, 6):
print(u'Creating new col %s', cols[i])
my_df['new_{0}'.format(cols[i]) = my_df[cols[i]] / my_df[cols[i-1]

Python Pandas Memory Error during Drop

I have a df of 825468 rows.
I am performing this over it.
frame = frame.drop(frame.loc[(
frame['RR'].str.contains(r"^([23])[^-]*-\1[^-]*$")), 'RR'].str.replace("[23]([^-]*)-[23]([^-]*)", r"\1-\2").isin(
series1.str.replace("1([^-]*)-1([^-]*)", r"\1-\2"))[lambda d: d].index)
where
series1 = frame.loc[frame['RR'].str.contains("^1[^-]*-1"), 'RR']
So what it does it
prepares a series of where RR has value like 1abc-1bcd and then if in frame there is an RR like 2abc-2bcd which after replacement becomes abc-bcd and its there in series as well after replacement,its dropped.
But it gives Memory Error.Is there a more efficient way to perform the same.
For ex.
if in a df ..
RR
0 2abc-2abc
1 1abc-1abc
2 3abc-3abc
3 2def-2def
4 3def-3def
5 def-dfd
6 sdsd-sdsd
7 1def-1def
Then from this frame 2abc-2abc and 3abc-3abc should be dropped,as after removing 2,3 it becomes abc-abc and when we remove 1 from 1abc-1abc it also is abc-abc.2def-2def should not be dropped as there is no 1def-1def
Output:
RR
0 1abc-1abc
1 def-dfd
2 sdsd-sdsd
3 1def-1def
UPDATE2:
In [176]: df
Out[176]:
RR
0 2abc-2abc
1 3abc-3abc
2 2def-2def
3 3def-3def
4 def-dfd
5 sdsd-sdsd
6 1def-1def
7 abc-abc
8 def-def
In [177]: df[['d1','s','s2']] = df.RR.str.extract(r'^(?P<d1>\d+)(?P<s1>[^-]*)-\1(?P<s2>[^-]*)', expand=True)
In [178]: df
Out[178]:
RR d1 s s2
0 2abc-2abc 2 abc abc
1 3abc-3abc 3 abc abc
2 2def-2def 2 def def
3 3def-3def 3 def def
4 def-dfd NaN NaN NaN
5 sdsd-sdsd NaN NaN NaN
6 1def-1def 1 def def
7 abc-abc NaN NaN NaN
8 def-def NaN NaN NaN
In [179]: df.s += df.pop('s2')
In [180]: df
Out[180]:
RR d1 s
0 2abc-2abc 2 abcabc
1 3abc-3abc 3 abcabc
2 2def-2def 2 defdef
3 3def-3def 3 defdef
4 def-dfd NaN NaN
5 sdsd-sdsd NaN NaN
6 1def-1def 1 defdef
7 abc-abc NaN NaN
8 def-def NaN NaN
In [181]: result = df.loc[~df.s.isin(df.loc[df.d1 == '1', 's']) | (~df.d1.isin(['2','3'])), 'RR']
In [182]: result
Out[182]:
0 2abc-2abc
1 3abc-3abc
4 def-dfd
5 sdsd-sdsd
6 1def-1def
7 abc-abc
8 def-def
Name: RR, dtype: object
UPDATE:
In [171]: df
Out[171]:
RR
0 2abc-2abc
1 1abc-1abc
2 3abc-3abc
3 2def-2def
4 3def-3def
5 def-dfd
6 sdsd-sdsd
7 1def-1def
8 abc-abc
NOTE: I have intentionally added 8th row: abc-abc, which should NOT be dropped (if i understood your question correctly)
Solution 1: using .str.replace() and drop_duplicates() methods:
In [178]: (df.sort_values('RR')
...: .RR
...: .str.replace("[23]([^-]*)-[23]([^-]*)", r"1\1-1\2")
...: .drop_duplicates()
...: )
...:
Out[178]:
1 1abc-1abc
7 1def-1def
8 abc-abc
5 def-dfd
6 sdsd-sdsd
Name: RR, dtype: object
Solution 2: using .str.replace() and .str.contains() methods and boolean indexing:
In [172]: df.loc[~df.sort_values('RR')
...: .RR
...: .str.replace("[23]([^-]*)-[23]([^-]*)", r"_\1-_\2")
...: .str.contains(r"^_[^-]*-_")]
...:
Out[172]:
RR
1 1abc-1abc
5 def-dfd
6 sdsd-sdsd
7 1def-1def
8 abc-abc
NOTE: you may want to replace '_' with another symbol(s), which will never occur in the RR column

Rolling sum in subgroups of a dataframe (pandas)

I have sessions dataframe that contains E-mail and Sessions (int) columns.
I need to calculate rolling sum of sessions per email (i.e. not globally).
Now, the following works, but it's painfully slow:
emails = set(list(sessions['E-mail']))
ses_sums = []
for em in emails:
email_sessions = sessions[sessions['E-mail'] == em]
email_sessions.is_copy = False
email_sessions['Session_Rolling_Sum'] = pd.rolling_sum(email_sessions['Sessions'], window=self.window).fillna(0)
ses_sums.append(email_sessions)
df = pd.concat(ses_sums, ignore_index=True)
Is there a way of achieving the same in pandas, but using pandas operators on a dataframe instead of creating separate dataframes for each email and then concatenating them?
(either that or some other way of making this faster)
Setup
np.random.seed([3,1415])
df = pd.DataFrame({'E-Mail': np.random.choice(list('AB'), 20),
'Session': np.random.randint(1, 10, 20)})
Solution
The current and proper way to do this is with rolling.sum that can b used on the result of a pd.Series group by object.
# Series Group By
# /------------------------\
df.groupby('E-Mail').Session.rolling(3).sum()
# \--------------/
# Method you want
E-Mail
A 0 NaN
2 NaN
4 11.0
5 7.0
7 10.0
12 16.0
15 16.0
17 16.0
18 17.0
19 18.0
B 1 NaN
3 NaN
6 18.0
8 14.0
9 16.0
10 12.0
11 13.0
13 16.0
14 20.0
16 22.0
Name: Session, dtype: float64
Details
df
E-Mail Session
0 A 9
1 B 7
2 A 1
3 B 3
4 A 1
5 A 5
6 B 8
7 A 4
8 B 3
9 B 5
10 B 4
11 B 4
12 A 7
13 B 8
14 B 8
15 A 5
16 B 6
17 A 4
18 A 8
19 A 6
Say you start with
In [58]: df = pd.DataFrame({'E-Mail': ['foo'] * 3 + ['bar'] * 3 + ['foo'] * 3, 'Session': range(9)})
In [59]: df
Out[59]:
E-Mail Session
0 foo 0
1 foo 1
2 foo 2
3 bar 3
4 bar 4
5 bar 5
6 foo 6
7 foo 7
8 foo 8
In [60]: df[['Session']].groupby(df['E-Mail']).apply(pd.rolling_sum, 3)
Out[60]:
Session
E-Mail
bar 3 NaN
4 NaN
5 12.0
foo 0 NaN
1 NaN
2 3.0
6 9.0
7 15.0
8 21.0
Incidentally, note that I just rearranged your rolling_sum, but it has been deprecated - you should now use rolling:
df[['Session']].groupby(df['E-Mail']).apply(lambda g: g.rolling(3).sum())

pandas.DataFrame set all string values to nan

I have a pandas.DataFrame that contain string, float and int types.
Is there a way to set all strings that cannot be converted to float to NaN ?
For example:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 "wajdi"
to:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 NaN
You can use pd.to_numeric and set errors='coerce'
pandas.to_numeric
df['D'] = pd.to_numeric(df.D, errors='coerce')
Which will give you:
A B C D
0 1 2 5.0 7.0
1 0 4 NaN 15.0
2 4 8 9.0 10.0
3 11 5 8.0 0.0
4 11 5 8.0 NaN
Deprecated solution (pandas <= 0.20 only):
df.convert_objects(convert_numeric=True)
pandas.DataFrame.convert_objects
Here's the dev note in the convert_objects source code: # TODO: Remove in 0.18 or 2017, which ever is sooner. So don't make this a long term solution if you use it.
Here is a way:
df['E'] = pd.to_numeric(df.D, errors='coerce')
And then you have:
A B C D E
0 1 2 5.0 7 7.0
1 0 4 NaN 15 15.0
2 4 8 9.0 10 10.0
3 11 5 8.0 0 0.0
4 11 5 8.0 wajdi NaN
You can use pd.to_numeric with errors='coerce'.
In [30]: df = pd.DataFrame({'a': [1, 2, 'NaN', 'bob', 3.2]})
In [31]: pd.to_numeric(df.a, errors='coerce')
Out[31]:
0 1.0
1 2.0
2 NaN
3 NaN
4 3.2
Name: a, dtype: float64
Here is one way to apply it to all columns:
for c in df.columns:
df[c] = pd.to_numeric(df[c], errors='coerce')
(See comment by NinjaPuppy for a better way.)

How to get log rate of change between rows in Pandas DataFrame effectively?

Let's say I have some DataFrame (with about 10000 rows in my case, this is just a minimal example)
>>> import pandas as pd
>>> sample_df = pd.DataFrame(
{'col1': list(range(1, 10)), 'col2': list(range(10, 19))})
>>> sample_df
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 6 15
6 7 16
7 8 17
8 9 18
For my purposes, I need to calculate the series represented by ln(col_i(n+1) / col_i(n)) for each col_i in my DataFrame, where n represents a row number.
How can I calculate this?
Background knowledge
I know that I can get the difference between each column in a very simple way using
>>> sample_df.diff()
col1 col2
0 NaN NaN
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
Or the percentage change, which is (col_i(n+1) - col_i(n))/col_i(n+1), using
>>> sample_df.pct_change()
col1 col2
0 NaN NaN
1 1.000000 0.100000
2 0.500000 0.090909
3 0.333333 0.083333
4 0.250000 0.076923
5 0.200000 0.071429
6 0.166667 0.066667
7 0.142857 0.062500
8 0.125000 0.058824
I have just been struggling with a straightforward way to get the direct division of each consecutive column by the previous. Were I to know how to do that even, I could just apply the natural logarithm to every element in the series after the fact.
Currently to solve my problem, I'm resorting to creating another column shifted with row elements down by 1 for each column and then applying the formula between the two columns. It seems messy and sub-optimal to me, though.
Any help would be greatly appreciated!
IIUC:
log of a ratio is the difference of logs:
sample_df.apply(np.log).diff()
Or better still:
np.log(sample_df).diff()
Timing
just use np.log:
np.log(df.col1 / df.col1.shift())
you can also use apply as suggested by #nikita but that will be slower.
in addition, if you wanted to do it for the entire dataframe, you could just do:
np.log(df / df.shift())
You can use shift for that, which does what you have proposed.
>>> sample_df['col1'].shift()
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 8.0
Name: col1, dtype: float64
The final answer would be:
import math
(sample_df['col1'] / sample_df['col1'].shift()).apply(lambda row: math.log(row))
0 NaN
1 0.693147
2 0.405465
3 0.287682
4 0.223144
5 0.182322
6 0.154151
7 0.133531
8 0.117783
Name: col1, dtype: float64

Categories