Python Pandas Memory Error during Drop - python

I have a df of 825468 rows.
I am performing this over it.
frame = frame.drop(frame.loc[(
frame['RR'].str.contains(r"^([23])[^-]*-\1[^-]*$")), 'RR'].str.replace("[23]([^-]*)-[23]([^-]*)", r"\1-\2").isin(
series1.str.replace("1([^-]*)-1([^-]*)", r"\1-\2"))[lambda d: d].index)
where
series1 = frame.loc[frame['RR'].str.contains("^1[^-]*-1"), 'RR']
So what it does it
prepares a series of where RR has value like 1abc-1bcd and then if in frame there is an RR like 2abc-2bcd which after replacement becomes abc-bcd and its there in series as well after replacement,its dropped.
But it gives Memory Error.Is there a more efficient way to perform the same.
For ex.
if in a df ..
RR
0 2abc-2abc
1 1abc-1abc
2 3abc-3abc
3 2def-2def
4 3def-3def
5 def-dfd
6 sdsd-sdsd
7 1def-1def
Then from this frame 2abc-2abc and 3abc-3abc should be dropped,as after removing 2,3 it becomes abc-abc and when we remove 1 from 1abc-1abc it also is abc-abc.2def-2def should not be dropped as there is no 1def-1def
Output:
RR
0 1abc-1abc
1 def-dfd
2 sdsd-sdsd
3 1def-1def

UPDATE2:
In [176]: df
Out[176]:
RR
0 2abc-2abc
1 3abc-3abc
2 2def-2def
3 3def-3def
4 def-dfd
5 sdsd-sdsd
6 1def-1def
7 abc-abc
8 def-def
In [177]: df[['d1','s','s2']] = df.RR.str.extract(r'^(?P<d1>\d+)(?P<s1>[^-]*)-\1(?P<s2>[^-]*)', expand=True)
In [178]: df
Out[178]:
RR d1 s s2
0 2abc-2abc 2 abc abc
1 3abc-3abc 3 abc abc
2 2def-2def 2 def def
3 3def-3def 3 def def
4 def-dfd NaN NaN NaN
5 sdsd-sdsd NaN NaN NaN
6 1def-1def 1 def def
7 abc-abc NaN NaN NaN
8 def-def NaN NaN NaN
In [179]: df.s += df.pop('s2')
In [180]: df
Out[180]:
RR d1 s
0 2abc-2abc 2 abcabc
1 3abc-3abc 3 abcabc
2 2def-2def 2 defdef
3 3def-3def 3 defdef
4 def-dfd NaN NaN
5 sdsd-sdsd NaN NaN
6 1def-1def 1 defdef
7 abc-abc NaN NaN
8 def-def NaN NaN
In [181]: result = df.loc[~df.s.isin(df.loc[df.d1 == '1', 's']) | (~df.d1.isin(['2','3'])), 'RR']
In [182]: result
Out[182]:
0 2abc-2abc
1 3abc-3abc
4 def-dfd
5 sdsd-sdsd
6 1def-1def
7 abc-abc
8 def-def
Name: RR, dtype: object
UPDATE:
In [171]: df
Out[171]:
RR
0 2abc-2abc
1 1abc-1abc
2 3abc-3abc
3 2def-2def
4 3def-3def
5 def-dfd
6 sdsd-sdsd
7 1def-1def
8 abc-abc
NOTE: I have intentionally added 8th row: abc-abc, which should NOT be dropped (if i understood your question correctly)
Solution 1: using .str.replace() and drop_duplicates() methods:
In [178]: (df.sort_values('RR')
...: .RR
...: .str.replace("[23]([^-]*)-[23]([^-]*)", r"1\1-1\2")
...: .drop_duplicates()
...: )
...:
Out[178]:
1 1abc-1abc
7 1def-1def
8 abc-abc
5 def-dfd
6 sdsd-sdsd
Name: RR, dtype: object
Solution 2: using .str.replace() and .str.contains() methods and boolean indexing:
In [172]: df.loc[~df.sort_values('RR')
...: .RR
...: .str.replace("[23]([^-]*)-[23]([^-]*)", r"_\1-_\2")
...: .str.contains(r"^_[^-]*-_")]
...:
Out[172]:
RR
1 1abc-1abc
5 def-dfd
6 sdsd-sdsd
7 1def-1def
8 abc-abc
NOTE: you may want to replace '_' with another symbol(s), which will never occur in the RR column

Related

Copy row values of Data Frame along rows till not null and replicate the consecutive not null value further

I have a Dataframe as shown below
A B C D
0 1 2 3.3 4
1 NaT NaN NaN NaN
2 NaT NaN NaN NaN
3 5 6 7 8
4 NaT NaN NaN NaN
5 NaT NaN NaN NaN
6 9 1 2 3
7 NaT NaN NaN NaN
8 NaT NaN NaN NaN
I need to copy the first row values (1,2,3,4) till the non-null row with index 2. Then, copy row values (5,6,7,8) till the non-null row with index 5 and copy (9,1,2,3) till row with index 8 and so on. Is there any way to do this in Python or Pandas. Quick help appreciated! Also is necessary not replace column D
Column C ffill gives 3.3456 as value for next row
Expected Output:
A B C D
0 1 2 3.3 4
1 1 2 3.3 NaN
2 1 2 3.3 NaN
3 5 6 7 8
4 5 6 7 NaN
5 5 6 7 NaN
6 9 1 2 3
7 9 1 2 NaN
8 9 1 2 NaN
Question was changed, so for forward filling all columns without D use Index.difference with ffill for columns names in list:
cols = df.columns.difference(['D'])
df[cols] = df[cols].ffill()
Or create mask for all columns names without D:
mask = df.columns != 'D'
df.loc[:, mask] = df.loc[:, mask].ffill()
EDIT: I cannot replicate your problem:
df = pd.DataFrame({'a':[2114.201789, np.nan, np.nan, 1]})
print (df)
a
0 2114.201789
1 NaN
2 NaN
3 1.000000
print (df.ffill())
a
0 2114.201789
1 2114.201789
2 2114.201789
3 1.000000

Creating new pandas dataframe from pivottable condition

I have a dataframe that looks like that:
df
Out[42]:
Unnamed: 0 Unnamed: 0.1 Region GeneID DistanceValue
0 25520 25520 Olfactory areas 69835573 -1.000000
1 25521 25521 Olfactory areas 583846 -1.000000
2 25522 25522 Olfactory areas 68667661 -1.000000
3 25523 25523 Olfactory areas 70474965 -1.000000
4 25524 25524 Olfactory areas 68341920 -1.000000
... ... ... ... ...
15662 1072369 1072369 Cerebellum unspecific 74743327 -0.960186
15663 1072370 1072370 Cerebellum unspecific 69530983 -0.960139
15664 1072371 1072371 Cerebellum unspecific 68442853 -0.960129
15665 1072372 1072372 Cerebellum unspecific 74514339 -0.960038
15666 1072373 1072373 Cerebellum unspecific 70724637 -0.960003
[15667 rows x 5 columns]
I want to count 'GeneID's, and create a new df, that only contains the rows with GeneID's that are there more than 5 times.. so I did
genelist = df.pivot_table(index=['GeneID'], aggfunc='size')
sort_genelist = genelist.sort_values(axis=0,ascending=False)
sort_genelist
Out[44]:
GeneID
631707 11
68269286 10
633269 10
70302366 9
74357905 9
..
70784714 1
70784824 1
70784898 1
70784916 1
70528527 1
Length: 7875, dtype: int64
So now I want my df dataframe to just contain the rows with the ID's that were counted more than 5 times..
Use Series.isin for mask by index values of values of sort_genelist with length more like 5 and filter by boolean indexing:
df = df[df['GeneID'].isin(sort_genelist.index[sort_genelist > 5])]
I think that the best way to do what you have asked is:
df['gene_id_count'] = df.groupby('GeneID').transform(len)
df.loc[df['gene_id_count'] > 5, :]
Lets take this tiny example:
>>> df = pd.DataFrame({'GeneID': [1,1,1,3,4,5,5,4], 'ID': range(8)})
>>> df
GeneID ID
0 1 0
1 1 1
2 1 2
3 3 3
4 4 4
5 5 5
6 5 6
7 4 7
And consider 2 occurrences (instead of 5)
min_gene_id_count = 2
>>> df['gene_id_count'] = df.groupby('GeneID').transform(len)
>>> df
GeneID ID gene_id_count
0 1 0 3
1 1 1 3
2 1 2 3
3 3 3 1
4 4 4 2
5 5 5 2
6 5 6 2
7 4 7 2
>>> df.loc[df['gene_id_count'] > min_gene_id_count , :]
GeneID ID gene_id_count
0 1 0 3
1 1 1 3
2 1 2 3

pandas.DataFrame set all string values to nan

I have a pandas.DataFrame that contain string, float and int types.
Is there a way to set all strings that cannot be converted to float to NaN ?
For example:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 "wajdi"
to:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 NaN
You can use pd.to_numeric and set errors='coerce'
pandas.to_numeric
df['D'] = pd.to_numeric(df.D, errors='coerce')
Which will give you:
A B C D
0 1 2 5.0 7.0
1 0 4 NaN 15.0
2 4 8 9.0 10.0
3 11 5 8.0 0.0
4 11 5 8.0 NaN
Deprecated solution (pandas <= 0.20 only):
df.convert_objects(convert_numeric=True)
pandas.DataFrame.convert_objects
Here's the dev note in the convert_objects source code: # TODO: Remove in 0.18 or 2017, which ever is sooner. So don't make this a long term solution if you use it.
Here is a way:
df['E'] = pd.to_numeric(df.D, errors='coerce')
And then you have:
A B C D E
0 1 2 5.0 7 7.0
1 0 4 NaN 15 15.0
2 4 8 9.0 10 10.0
3 11 5 8.0 0 0.0
4 11 5 8.0 wajdi NaN
You can use pd.to_numeric with errors='coerce'.
In [30]: df = pd.DataFrame({'a': [1, 2, 'NaN', 'bob', 3.2]})
In [31]: pd.to_numeric(df.a, errors='coerce')
Out[31]:
0 1.0
1 2.0
2 NaN
3 NaN
4 3.2
Name: a, dtype: float64
Here is one way to apply it to all columns:
for c in df.columns:
df[c] = pd.to_numeric(df[c], errors='coerce')
(See comment by NinjaPuppy for a better way.)

Index Slicing with Float64Index not working in pandas

I have the following dataframe
p12Diff
Pump Time
3 -2.90 -0.000919
-2.89 -0.000795
-2.88 -0.000814
-2.87 -0.000700
-2.86 -0.000847
-2.85 -0.000769
-2.84 -0.000681
-2.83 -0.000888
-2.82 -0.000815
-2.81 -0.000764
-2.80 -0.000879
-2.70 -0.000757
-2.60 -0.000758
-2.50 -0.000707
Oddly enough, when I slice with idx=IndexSlice for certain ranges, I get a KeyError, whereas for others it simply works. For example, df.loc[idx[:,-2.90:-2.52],:] cuts at -2.60, whereas df.loc[idx[:,-2.90:-2.62],:] raises KeyError: -2.62.
Might this be a bug?
This was fixed for 0.15.0 (RC1 is out now), see here: http://pandas.pydata.org/. 0.14.1 was a bit buggy with this type of indexing.
In [13]: df = DataFrame({'value' : np.arange(11)},index=pd.MultiIndex.from_product([[1],np.linspace(-2.9,-2.3,11)]))
In [14]: df
Out[14]:
value
1 -2.90 0
-2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
-2.48 7
-2.42 8
-2.36 9
-2.30 10
In [15]: idx = pd.IndexSlice
In [16]: df.loc[idx[:,-2.9:-2.42],]
Out[16]:
value
1 -2.90 0
-2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
-2.48 7
-2.42 8
In [17]: df.loc[idx[:,-2.9:-2.52],]
Out[17]:
value
1 -2.90 0
-2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
In [18]: df.loc[idx[:,-2.84:-2.52],]
Out[18]:
value
1 -2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
In [19]: df.loc[idx[:,-2.85:-2.52],]
Out[19]:
value
1 -2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6

Python / pandas: Fastest way to set and retrieve data, without chained assaigment

I am doing som routines that acces scalars and vectors from a pandas dataframe, and then sets the results after some calculations.
Initially I used the form df[var][index] to do this, but encountered problems with chained assaignment (http://pandas.pydata.org/pandas-docs/dev/indexing.html%23indexing-view-versus-copy)
So I change it to use the df.loc[index,var]. Which solved the view/copy problem but it is very slow. For arrays I convert it to a pandas series and uses the builtin df.update(). I am now searching for the fastest/best way of doing this, without having to worry about chained assaingment. In the documentation they say that for example df.at[] is the quickest way to access scalars. Does anyone have any experience with this ? Or can point at some literature that can help ?
Thanks
Edit: Code looks like this, which I think is pretty standard.
def set_var(self,name,periode,value):
try:
if navn.upper() not in self.data:
self.data[name.upper()]=num.NaN
self.data.loc[periode,name.upper()]=value
except:
print('Fail to set'+navn])
def get_var(self,navn,periode):
''' Get value '''
try:
value=self.data.loc[periode,navn.upper()]
def set_series(data, index):
outputserie=pd.Series(data,index)
self.data.update(outputserie)
dataframe looks like this:
SC0.data
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 148 entries, 1980Q1 to 2016Q4
Columns: 3111 entries, CAP1 to CHH_DRD
dtypes: float64(3106), int64(2), object(3)
edit2:
a df could look like
var var1
2012Q4 0.462015 0.01585
2013Q1 0.535161 0.01577
2013Q2 0.735432 0.01401
2013Q3 0.845959 0.01638
2013Q4 0.776809 0.01657
2014Q1 0.000000 0.01517
2014Q2 0.000000 0.01593
and I basically want to perform two operations:
1) perhaps update var1 with the same scalar over all periodes
2) solve var in 2014Q1 as var,2013Q4 = var1,2013Q3/var2013Q4*var,2013Q4
This is done as part of a bigger model setup, which is read from a txt file. Since I doing loads of these calculations, the speed og setting and reading data matter
The example you gave above can be vectorized.
In [3]: df = DataFrame(dict(A = np.arange(10), B = np.arange(10)),index=pd.period_range('2012',freq='Q',periods=10))
In [4]: df
Out[4]:
A B
2012Q1 0 0
2012Q2 1 1
2012Q3 2 2
2012Q4 3 3
2013Q1 4 4
2013Q2 5 5
2013Q3 6 6
2013Q4 7 7
2014Q1 8 8
2014Q2 9 9
Assign a scalar
In [5]: df['A'] = 5
In [6]: df
Out[6]:
A B
2012Q1 5 0
2012Q2 5 1
2012Q3 5 2
2012Q4 5 3
2013Q1 5 4
2013Q2 5 5
2013Q3 5 6
2013Q4 5 7
2014Q1 5 8
2014Q2 5 9
Perform a shifted operation
In [8]: df['C'] = df['B'].shift()/df['B'].shift(2)
In [9]: df
Out[9]:
A B C
2012Q1 5 0 NaN
2012Q2 5 1 NaN
2012Q3 5 2 inf
2012Q4 5 3 2.000000
2013Q1 5 4 1.500000
2013Q2 5 5 1.333333
2013Q3 5 6 1.250000
2013Q4 5 7 1.200000
2014Q1 5 8 1.166667
2014Q2 5 9 1.142857
Using a vectorized assignment
In [10]: df.loc[df['B']>5,'D'] = 'foo'
In [11]: df
Out[11]:
A B C D
2012Q1 5 0 NaN NaN
2012Q2 5 1 NaN NaN
2012Q3 5 2 inf NaN
2012Q4 5 3 2.000000 NaN
2013Q1 5 4 1.500000 NaN
2013Q2 5 5 1.333333 NaN
2013Q3 5 6 1.250000 foo
2013Q4 5 7 1.200000 foo
2014Q1 5 8 1.166667 foo
2014Q2 5 9 1.142857 foo

Categories