I have a df of 825468 rows.
I am performing this over it.
frame = frame.drop(frame.loc[(
frame['RR'].str.contains(r"^([23])[^-]*-\1[^-]*$")), 'RR'].str.replace("[23]([^-]*)-[23]([^-]*)", r"\1-\2").isin(
series1.str.replace("1([^-]*)-1([^-]*)", r"\1-\2"))[lambda d: d].index)
where
series1 = frame.loc[frame['RR'].str.contains("^1[^-]*-1"), 'RR']
So what it does it
prepares a series of where RR has value like 1abc-1bcd and then if in frame there is an RR like 2abc-2bcd which after replacement becomes abc-bcd and its there in series as well after replacement,its dropped.
But it gives Memory Error.Is there a more efficient way to perform the same.
For ex.
if in a df ..
RR
0 2abc-2abc
1 1abc-1abc
2 3abc-3abc
3 2def-2def
4 3def-3def
5 def-dfd
6 sdsd-sdsd
7 1def-1def
Then from this frame 2abc-2abc and 3abc-3abc should be dropped,as after removing 2,3 it becomes abc-abc and when we remove 1 from 1abc-1abc it also is abc-abc.2def-2def should not be dropped as there is no 1def-1def
Output:
RR
0 1abc-1abc
1 def-dfd
2 sdsd-sdsd
3 1def-1def
UPDATE2:
In [176]: df
Out[176]:
RR
0 2abc-2abc
1 3abc-3abc
2 2def-2def
3 3def-3def
4 def-dfd
5 sdsd-sdsd
6 1def-1def
7 abc-abc
8 def-def
In [177]: df[['d1','s','s2']] = df.RR.str.extract(r'^(?P<d1>\d+)(?P<s1>[^-]*)-\1(?P<s2>[^-]*)', expand=True)
In [178]: df
Out[178]:
RR d1 s s2
0 2abc-2abc 2 abc abc
1 3abc-3abc 3 abc abc
2 2def-2def 2 def def
3 3def-3def 3 def def
4 def-dfd NaN NaN NaN
5 sdsd-sdsd NaN NaN NaN
6 1def-1def 1 def def
7 abc-abc NaN NaN NaN
8 def-def NaN NaN NaN
In [179]: df.s += df.pop('s2')
In [180]: df
Out[180]:
RR d1 s
0 2abc-2abc 2 abcabc
1 3abc-3abc 3 abcabc
2 2def-2def 2 defdef
3 3def-3def 3 defdef
4 def-dfd NaN NaN
5 sdsd-sdsd NaN NaN
6 1def-1def 1 defdef
7 abc-abc NaN NaN
8 def-def NaN NaN
In [181]: result = df.loc[~df.s.isin(df.loc[df.d1 == '1', 's']) | (~df.d1.isin(['2','3'])), 'RR']
In [182]: result
Out[182]:
0 2abc-2abc
1 3abc-3abc
4 def-dfd
5 sdsd-sdsd
6 1def-1def
7 abc-abc
8 def-def
Name: RR, dtype: object
UPDATE:
In [171]: df
Out[171]:
RR
0 2abc-2abc
1 1abc-1abc
2 3abc-3abc
3 2def-2def
4 3def-3def
5 def-dfd
6 sdsd-sdsd
7 1def-1def
8 abc-abc
NOTE: I have intentionally added 8th row: abc-abc, which should NOT be dropped (if i understood your question correctly)
Solution 1: using .str.replace() and drop_duplicates() methods:
In [178]: (df.sort_values('RR')
...: .RR
...: .str.replace("[23]([^-]*)-[23]([^-]*)", r"1\1-1\2")
...: .drop_duplicates()
...: )
...:
Out[178]:
1 1abc-1abc
7 1def-1def
8 abc-abc
5 def-dfd
6 sdsd-sdsd
Name: RR, dtype: object
Solution 2: using .str.replace() and .str.contains() methods and boolean indexing:
In [172]: df.loc[~df.sort_values('RR')
...: .RR
...: .str.replace("[23]([^-]*)-[23]([^-]*)", r"_\1-_\2")
...: .str.contains(r"^_[^-]*-_")]
...:
Out[172]:
RR
1 1abc-1abc
5 def-dfd
6 sdsd-sdsd
7 1def-1def
8 abc-abc
NOTE: you may want to replace '_' with another symbol(s), which will never occur in the RR column
Related
I have a Dataframe as shown below
A B C D
0 1 2 3.3 4
1 NaT NaN NaN NaN
2 NaT NaN NaN NaN
3 5 6 7 8
4 NaT NaN NaN NaN
5 NaT NaN NaN NaN
6 9 1 2 3
7 NaT NaN NaN NaN
8 NaT NaN NaN NaN
I need to copy the first row values (1,2,3,4) till the non-null row with index 2. Then, copy row values (5,6,7,8) till the non-null row with index 5 and copy (9,1,2,3) till row with index 8 and so on. Is there any way to do this in Python or Pandas. Quick help appreciated! Also is necessary not replace column D
Column C ffill gives 3.3456 as value for next row
Expected Output:
A B C D
0 1 2 3.3 4
1 1 2 3.3 NaN
2 1 2 3.3 NaN
3 5 6 7 8
4 5 6 7 NaN
5 5 6 7 NaN
6 9 1 2 3
7 9 1 2 NaN
8 9 1 2 NaN
Question was changed, so for forward filling all columns without D use Index.difference with ffill for columns names in list:
cols = df.columns.difference(['D'])
df[cols] = df[cols].ffill()
Or create mask for all columns names without D:
mask = df.columns != 'D'
df.loc[:, mask] = df.loc[:, mask].ffill()
EDIT: I cannot replicate your problem:
df = pd.DataFrame({'a':[2114.201789, np.nan, np.nan, 1]})
print (df)
a
0 2114.201789
1 NaN
2 NaN
3 1.000000
print (df.ffill())
a
0 2114.201789
1 2114.201789
2 2114.201789
3 1.000000
I have a dataframe that looks like that:
df
Out[42]:
Unnamed: 0 Unnamed: 0.1 Region GeneID DistanceValue
0 25520 25520 Olfactory areas 69835573 -1.000000
1 25521 25521 Olfactory areas 583846 -1.000000
2 25522 25522 Olfactory areas 68667661 -1.000000
3 25523 25523 Olfactory areas 70474965 -1.000000
4 25524 25524 Olfactory areas 68341920 -1.000000
... ... ... ... ...
15662 1072369 1072369 Cerebellum unspecific 74743327 -0.960186
15663 1072370 1072370 Cerebellum unspecific 69530983 -0.960139
15664 1072371 1072371 Cerebellum unspecific 68442853 -0.960129
15665 1072372 1072372 Cerebellum unspecific 74514339 -0.960038
15666 1072373 1072373 Cerebellum unspecific 70724637 -0.960003
[15667 rows x 5 columns]
I want to count 'GeneID's, and create a new df, that only contains the rows with GeneID's that are there more than 5 times.. so I did
genelist = df.pivot_table(index=['GeneID'], aggfunc='size')
sort_genelist = genelist.sort_values(axis=0,ascending=False)
sort_genelist
Out[44]:
GeneID
631707 11
68269286 10
633269 10
70302366 9
74357905 9
..
70784714 1
70784824 1
70784898 1
70784916 1
70528527 1
Length: 7875, dtype: int64
So now I want my df dataframe to just contain the rows with the ID's that were counted more than 5 times..
Use Series.isin for mask by index values of values of sort_genelist with length more like 5 and filter by boolean indexing:
df = df[df['GeneID'].isin(sort_genelist.index[sort_genelist > 5])]
I think that the best way to do what you have asked is:
df['gene_id_count'] = df.groupby('GeneID').transform(len)
df.loc[df['gene_id_count'] > 5, :]
Lets take this tiny example:
>>> df = pd.DataFrame({'GeneID': [1,1,1,3,4,5,5,4], 'ID': range(8)})
>>> df
GeneID ID
0 1 0
1 1 1
2 1 2
3 3 3
4 4 4
5 5 5
6 5 6
7 4 7
And consider 2 occurrences (instead of 5)
min_gene_id_count = 2
>>> df['gene_id_count'] = df.groupby('GeneID').transform(len)
>>> df
GeneID ID gene_id_count
0 1 0 3
1 1 1 3
2 1 2 3
3 3 3 1
4 4 4 2
5 5 5 2
6 5 6 2
7 4 7 2
>>> df.loc[df['gene_id_count'] > min_gene_id_count , :]
GeneID ID gene_id_count
0 1 0 3
1 1 1 3
2 1 2 3
I have a pandas.DataFrame that contain string, float and int types.
Is there a way to set all strings that cannot be converted to float to NaN ?
For example:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 "wajdi"
to:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 NaN
You can use pd.to_numeric and set errors='coerce'
pandas.to_numeric
df['D'] = pd.to_numeric(df.D, errors='coerce')
Which will give you:
A B C D
0 1 2 5.0 7.0
1 0 4 NaN 15.0
2 4 8 9.0 10.0
3 11 5 8.0 0.0
4 11 5 8.0 NaN
Deprecated solution (pandas <= 0.20 only):
df.convert_objects(convert_numeric=True)
pandas.DataFrame.convert_objects
Here's the dev note in the convert_objects source code: # TODO: Remove in 0.18 or 2017, which ever is sooner. So don't make this a long term solution if you use it.
Here is a way:
df['E'] = pd.to_numeric(df.D, errors='coerce')
And then you have:
A B C D E
0 1 2 5.0 7 7.0
1 0 4 NaN 15 15.0
2 4 8 9.0 10 10.0
3 11 5 8.0 0 0.0
4 11 5 8.0 wajdi NaN
You can use pd.to_numeric with errors='coerce'.
In [30]: df = pd.DataFrame({'a': [1, 2, 'NaN', 'bob', 3.2]})
In [31]: pd.to_numeric(df.a, errors='coerce')
Out[31]:
0 1.0
1 2.0
2 NaN
3 NaN
4 3.2
Name: a, dtype: float64
Here is one way to apply it to all columns:
for c in df.columns:
df[c] = pd.to_numeric(df[c], errors='coerce')
(See comment by NinjaPuppy for a better way.)
I have the following dataframe
p12Diff
Pump Time
3 -2.90 -0.000919
-2.89 -0.000795
-2.88 -0.000814
-2.87 -0.000700
-2.86 -0.000847
-2.85 -0.000769
-2.84 -0.000681
-2.83 -0.000888
-2.82 -0.000815
-2.81 -0.000764
-2.80 -0.000879
-2.70 -0.000757
-2.60 -0.000758
-2.50 -0.000707
Oddly enough, when I slice with idx=IndexSlice for certain ranges, I get a KeyError, whereas for others it simply works. For example, df.loc[idx[:,-2.90:-2.52],:] cuts at -2.60, whereas df.loc[idx[:,-2.90:-2.62],:] raises KeyError: -2.62.
Might this be a bug?
This was fixed for 0.15.0 (RC1 is out now), see here: http://pandas.pydata.org/. 0.14.1 was a bit buggy with this type of indexing.
In [13]: df = DataFrame({'value' : np.arange(11)},index=pd.MultiIndex.from_product([[1],np.linspace(-2.9,-2.3,11)]))
In [14]: df
Out[14]:
value
1 -2.90 0
-2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
-2.48 7
-2.42 8
-2.36 9
-2.30 10
In [15]: idx = pd.IndexSlice
In [16]: df.loc[idx[:,-2.9:-2.42],]
Out[16]:
value
1 -2.90 0
-2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
-2.48 7
-2.42 8
In [17]: df.loc[idx[:,-2.9:-2.52],]
Out[17]:
value
1 -2.90 0
-2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
In [18]: df.loc[idx[:,-2.84:-2.52],]
Out[18]:
value
1 -2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
In [19]: df.loc[idx[:,-2.85:-2.52],]
Out[19]:
value
1 -2.84 1
-2.78 2
-2.72 3
-2.66 4
-2.60 5
-2.54 6
I am doing som routines that acces scalars and vectors from a pandas dataframe, and then sets the results after some calculations.
Initially I used the form df[var][index] to do this, but encountered problems with chained assaignment (http://pandas.pydata.org/pandas-docs/dev/indexing.html%23indexing-view-versus-copy)
So I change it to use the df.loc[index,var]. Which solved the view/copy problem but it is very slow. For arrays I convert it to a pandas series and uses the builtin df.update(). I am now searching for the fastest/best way of doing this, without having to worry about chained assaingment. In the documentation they say that for example df.at[] is the quickest way to access scalars. Does anyone have any experience with this ? Or can point at some literature that can help ?
Thanks
Edit: Code looks like this, which I think is pretty standard.
def set_var(self,name,periode,value):
try:
if navn.upper() not in self.data:
self.data[name.upper()]=num.NaN
self.data.loc[periode,name.upper()]=value
except:
print('Fail to set'+navn])
def get_var(self,navn,periode):
''' Get value '''
try:
value=self.data.loc[periode,navn.upper()]
def set_series(data, index):
outputserie=pd.Series(data,index)
self.data.update(outputserie)
dataframe looks like this:
SC0.data
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 148 entries, 1980Q1 to 2016Q4
Columns: 3111 entries, CAP1 to CHH_DRD
dtypes: float64(3106), int64(2), object(3)
edit2:
a df could look like
var var1
2012Q4 0.462015 0.01585
2013Q1 0.535161 0.01577
2013Q2 0.735432 0.01401
2013Q3 0.845959 0.01638
2013Q4 0.776809 0.01657
2014Q1 0.000000 0.01517
2014Q2 0.000000 0.01593
and I basically want to perform two operations:
1) perhaps update var1 with the same scalar over all periodes
2) solve var in 2014Q1 as var,2013Q4 = var1,2013Q3/var2013Q4*var,2013Q4
This is done as part of a bigger model setup, which is read from a txt file. Since I doing loads of these calculations, the speed og setting and reading data matter
The example you gave above can be vectorized.
In [3]: df = DataFrame(dict(A = np.arange(10), B = np.arange(10)),index=pd.period_range('2012',freq='Q',periods=10))
In [4]: df
Out[4]:
A B
2012Q1 0 0
2012Q2 1 1
2012Q3 2 2
2012Q4 3 3
2013Q1 4 4
2013Q2 5 5
2013Q3 6 6
2013Q4 7 7
2014Q1 8 8
2014Q2 9 9
Assign a scalar
In [5]: df['A'] = 5
In [6]: df
Out[6]:
A B
2012Q1 5 0
2012Q2 5 1
2012Q3 5 2
2012Q4 5 3
2013Q1 5 4
2013Q2 5 5
2013Q3 5 6
2013Q4 5 7
2014Q1 5 8
2014Q2 5 9
Perform a shifted operation
In [8]: df['C'] = df['B'].shift()/df['B'].shift(2)
In [9]: df
Out[9]:
A B C
2012Q1 5 0 NaN
2012Q2 5 1 NaN
2012Q3 5 2 inf
2012Q4 5 3 2.000000
2013Q1 5 4 1.500000
2013Q2 5 5 1.333333
2013Q3 5 6 1.250000
2013Q4 5 7 1.200000
2014Q1 5 8 1.166667
2014Q2 5 9 1.142857
Using a vectorized assignment
In [10]: df.loc[df['B']>5,'D'] = 'foo'
In [11]: df
Out[11]:
A B C D
2012Q1 5 0 NaN NaN
2012Q2 5 1 NaN NaN
2012Q3 5 2 inf NaN
2012Q4 5 3 2.000000 NaN
2013Q1 5 4 1.500000 NaN
2013Q2 5 5 1.333333 NaN
2013Q3 5 6 1.250000 foo
2013Q4 5 7 1.200000 foo
2014Q1 5 8 1.166667 foo
2014Q2 5 9 1.142857 foo