I have a python dataframe and some columns refer to repeated samples as below:
In [3]: df = pd.DataFrame(
...: [[89, 89, 12, 34, 32],
...: [788, 25, 55, 65, 55],
...: [588, 23, 58, 8, 55],
...: [25, 14, 45, 123, 58]],
...: columns = ['sample1','sample2.1','sample2.2','sample3','sample4'],
...: )
In [4]: df
sample1 sample2.1 sample2.2 sample3 sample4
0 89 89 12 34 32
1 788 25 55 65 55
2 588 23 58 8 55
3 25 14 45 123 58
for the repeated samples, sample2.1 and sample2.2, I want to remain with an average of the two, i.e
sample1 sample2_averaged sample3 sample4
0 89 50.5 34 32
1 788 40.0 65 55
2 588 40.5 8 55
3 25 29.5 123 58
I am thinking of using regex but I have never used them on python dataframes
You can group by columns if you provide axis=1, e.g.:
>>> df.groupby(df.columns.str.replace(r'\..+', ''), axis=1).mean()
sample1 sample2 sample3 sample4
0 89.0 50.5 34.0 32.0
1 788.0 40.0 65.0 55.0
2 588.0 40.5 8.0 55.0
3 25.0 29.5 123.0 58.0
Pandas columns and indices can use the pandas.Series.str string accessor methods, including regex.
I would do:
(df.T.groupby(df.columns.str.extract('^([^\.]+)')[0].values)
.mean().T
)
Output:
sample1 sample2 sample3 sample4
0 89.0 50.5 34.0 32.0
1 788.0 40.0 65.0 55.0
2 588.0 40.5 8.0 55.0
3 25.0 29.5 123.0 58.0
Try:
import re
from itertools import groupby
res=pd.DataFrame(index=df.index, columns=[])
for k,v in groupby(df.columns, key=lambda el: re.sub(r"\.[^\.]+$", "", el)):
v=list(v)
if(len(v)==1):
res[k]=df[v[0]]
else:
res[k]=df[v].mean(axis=1)
Outputs:
>>> res
sample1 sample2 sample3 sample4
0 89 50.5 34 32
1 788 40.0 65 55
2 588 40.5 8 55
3 25 29.5 123 58
Related
I am using some data where I need to find the time difference between all previous rows i.e. in row 3 I need to know the time between row 2 and row 1 and row 2 and row 0, in row 5 i need to know the time between row 5 and row 4, row 5 and row 3.... row 5 and row 0. I then want to have a big dataframe with all these differences in (as well as the other columns).
I have made a test dataframe for this
data = {random': [1, 3, 9, 3, 4, 7, 8, 10],
'timestamp': [2, 138, 157, 232, 245, 302, 323, 379]}
df = pd.DataFrame(data)
I then tried to do
for i in range(0,len(df-1)):
difference = df.timestamp.diff(periods=i+1)
print(difference)
To iterate through each row and takeaway the previous row the first iteration, the second row the second iteration etc.
I am stuck on how to combine this into one large dataframe after all the iterations AND how to make sure the loop uses the original dataframe at the start of each iteration (not the dataframe from the previous iteration).
This is what is being outputted
0 NaN
1 136.0
2 19.0
3 75.0
4 13.0
5 57.0
6 21.0
7 56.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 155.0
3 94.0
4 88.0
5 70.0
6 78.0
7 77.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 230.0
4 107.0
5 145.0
6 91.0
7 134.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 243.0
5 164.0
6 166.0
7 147.0
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 300.0
6 185.0
7 222.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 321.0
7 241.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 377.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
Name: timestamp, dtype: float64
If anyone knows how to solve this that would be great :)
Here is one way of solving the problem with Series.expanding:
df['diff'] = [list(s.iat[-1] - s[-2::-1]) for s in df['timestamp'].expanding(1)]
random timestamp diff
0 1 2 []
1 3 138 [136]
2 9 157 [19, 155] #--> 157-138, 157-2
3 3 232 [75, 94, 230] #--> 232-157, 232-138, 232-2
4 4 245 [13, 88, 107, 243]
5 7 302 [57, 70, 145, 164, 300]
6 8 323 [21, 78, 91, 166, 185, 321]
7 10 379 [56, 77, 134, 147, 222, 241, 377]
I may be misunderstanding what you mean but if you're asking how to collect these differences together:
differences = [df.timestamp.diff(periods=i+1) for i in range(0,len(df-1))]
differences = pd.concat(differences)
I also may be misunderstanding, but this is the best representation I could think of from what you described:
>>> df2 = df.copy()
>>> for i in df2.timestamp:
df2[i]=df2['timestamp']-i
>>> df2
random timestamp 2 138 157 232 245 302 323 379
0 1 2 0 -136 -155 -230 -243 -300 -321 -377
1 3 138 136 0 -19 -94 -107 -164 -185 -241
2 9 157 155 19 0 -75 -88 -145 -166 -222
3 3 232 230 94 75 0 -13 -70 -91 -147
4 4 245 243 107 88 13 0 -57 -78 -134
5 7 302 300 164 145 70 57 0 -21 -77
6 8 323 321 185 166 91 78 21 0 -56
7 10 379 377 241 222 147 134 77 56 0
I would like to apply a function to one pandas dataframe column which does the following task:
I have a cycle counter that starts from a value but sometimes restarts.
I would like to have the counter continue and increase its value.
The function I use at the moment is the following one:
Code
import pandas as pd
d = {'Cycle':[100,100,100,100,101,101,101,102,102,102,102,102,102,103,103,103,100,100,100,100,101,101,101,101]}
df = pd.DataFrame(data=d)
df.loc[:,'counter'] = df['Cycle'].to_numpy()
df.loc[:,'counter'] = df['counter'].rolling(2).apply(lambda x: x[0] if (x[0] == x[1]) else x[0]+1, raw=True)
print(df)
Output
Cycle counter
0 100 NaN
1 100 100.0
2 100 100.0
3 100 100.0
4 101 101.0
5 101 101.0
6 101 101.0
7 102 102.0
8 102 102.0
9 102 102.0
10 102 102.0
11 102 102.0
12 102 102.0
13 103 103.0
14 103 103.0
15 103 103.0
16 100 104.0
17 100 100.0
18 100 100.0
19 100 100.0
20 101 101.0
21 101 101.0
22 101 101.0
23 101 101.0
My goal is to get a dataframe similar to this one:
Cycle counter
0 100 NaN
1 100 100.0
2 100 100.0
3 100 100.0
4 101 101.0
5 101 101.0
6 101 101.0
7 102 102.0
8 102 102.0
9 102 102.0
10 102 102.0
11 102 102.0
12 102 102.0
13 103 103.0
14 103 103.0
15 103 103.0
16 100 104.0
17 100 104.0
18 100 104.0
19 100 104.0
20 101 105.0
21 101 105.0
22 101 105.0
23 101 105.0
How do I use the rolling function with one overlap?
Do you have any recommendation to reach my goal?
Best regards,
Matteo
Another approach would be to identify the points in the Cycle column where the value changes using .diff(). Then at those points increment from the original initial cycle value and merge to the original dataframe forward filling the new values.
df2 = df[df['Cycle'].diff().apply(lambda x: x!=0)].reset_index()
df2['Target Count'] = df[df['Cycle'].diff().apply(lambda x: x!=0)].reset_index().reset_index().apply(lambda x: df.iloc[0,0] + x['level_0'], axis = 1)
df = df.merge(df2.drop('Cycle', axis = 1), right_on = 'index', left_index = True, how = 'left').ffill().set_index('index', drop = True)
def df.index.name
df
Cycle Target Count
0 100 100.0
1 100 100.0
2 100 100.0
3 100 100.0
4 101 101.0
5 101 101.0
6 101 101.0
7 102 102.0
8 102 102.0
9 102 102.0
10 102 102.0
11 102 102.0
12 102 102.0
13 103 103.0
14 103 103.0
15 103 103.0
16 100 104.0
17 100 104.0
18 100 104.0
19 100 104.0
20 101 105.0
21 101 105.0
22 101 105.0
23 101 105.0
We can use shift and ne (same as !=) to check where the Cycle column changes.
Then we use cumsum to make a counter which changes each time Cycle changes.
We add the first value of Cycle to the counter -1, to let it start at 100:
groups = df['Cycle'].ne(df['Cycle'].shift()).cumsum()
df['counter'] = groups + df['Cycle'].iat[0] - 1
Cycle counter
0 100 100
1 100 100
2 100 100
3 100 100
4 101 101
5 101 101
6 101 101
7 102 102
8 102 102
9 102 102
10 102 102
11 102 102
12 102 102
13 103 103
14 103 103
15 103 103
16 100 104
17 100 104
18 100 104
19 100 104
20 101 105
21 101 105
22 101 105
23 101 105
Details: groups gives us a counter starting at 1:
print(groups)
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 3
8 3
9 3
10 3
11 3
12 3
13 4
14 4
15 4
16 5
17 5
18 5
19 5
20 6
21 6
22 6
23 6
Name: Cycle, dtype: int64
I have a dataframe with depth and other value columns:
data = {'Depth': [1.0, 1.0, 1.5, 2.0, 2.5, 2.5, 3.0, 3.5, 4.0, 4.0, 5.0, 5.5, 6.0],
'Value1':[44, 46, 221, 12, 47, 44, 67, 90, 100, 111, 112, 120, 122],
'Value2': [55, 65, 76, 45, 55, 58, 23, 12, 32, 20, 22, 26, 36]}
df = pd.DataFrame(data)
As you can see sometime there are repetitions in the Depth.
I'd like to be able to somehow groupby intervals and average over them.
For example an output I desire would be:
intervals = [1.0, 2.0]
Taking a list of intervals and breaking up the data set on those intervals to average per value (Value1, Value2) to get:
Depth Value1 Value2 Avg1_1 Avg2_1 Avg1_2 Avg2_2
0 1.0 44 55 80.75 60.25 78.2 .
1 1.0 46 65 80.75 60.25 78.2 .
2 1.5 221 76 80.75 60.25 78.2 .
3 2.0 12 45 80.75 60.25 78.2
4 2.5 47 55 52.67 . 78.2
5 2.5 44 58 52.67 . 78.2
6 3.0 67 23 52.67 . 78.2
7 3.5 90 12 100.33 78.2
8 4.0 100 32 100.33 78.2
9 4.0 111 20 100.33 78.2
10 5.0 112 22 112 .
11 5.5 120 26 121 .
12 6.0 122 36 121 .
Where Avg1_ is the Average of Value1 over every interval of 1.0 (which includes (1.0 - 2.0, 2.5 - 3.0,....etc).
Is there an easy way to do this using groupby in a loop?
You can accomplish this with the dataframe's apply method, and then sampling by boolean values the rows (and associated values) that meet the condition like depth + 1.0 or depth + 2.0.
df['avg1_1'] = df.apply(lambda x: (df[df['Depth'] <= x['Depth'] + 1.0]['Value1'].values.sum() /
len(df[df['Depth'] <= x['Depth'] + 1.0]['Value1'].values)),
axis=1)
df['avg2_1'] = df.apply(lambda x: (df[df['Depth'] <= x['Depth'] + 1.0]['Value2'].values.sum() /
len(df[df['Depth'] <= x['Depth'] + 1.0]['Value2'].values)),
axis=1)
df['avg1_2'] = df.apply(lambda x: (df[df['Depth'] <= x['Depth'] + 2.0]['Value1'].values.sum() /
len(df[df['Depth'] <= x['Depth'] + 2.0]['Value1'].values)),
axis=1)
df['avg2_2'] = df.apply(lambda x: (df[df['Depth'] <= x['Depth'] + 2.0]['Value2'].values.sum() /
len(df[df['Depth'] <= x['Depth'] + 2.0]['Value2'].values)),
axis=1)
This would return:
Depth Value1 Value2 newval avg1_1 avg2_1 avg1_2 avg2_2
0 1.0 44 55 66.0 80.750000 60.250000 68.714286 53.857143
1 1.0 46 65 241.0 80.750000 60.250000 68.714286 53.857143
2 1.5 221 76 32.0 69.000000 59.000000 71.375000 48.625000
3 2.0 12 45 67.0 68.714286 53.857143 78.200000 44.100000
4 2.5 47 55 64.0 71.375000 48.625000 78.200000 44.100000
5 2.5 44 58 87.0 71.375000 48.625000 78.200000 44.100000
6 3.0 67 23 110.0 78.200000 44.100000 81.272727 42.090909
7 3.5 90 12 120.0 78.200000 44.100000 84.500000 40.750000
8 4.0 100 32 131.0 81.272727 42.090909 87.384615 40.384615
9 4.0 111 20 132.0 81.272727 42.090909 87.384615 40.384615
10 5.0 112 22 140.0 87.384615 40.384615 87.384615 40.384615
11 5.5 120 26 142.0 87.384615 40.384615 87.384615 40.384615
12 6.0 122 36 NaN 87.384615 40.384615 87.384615 40.384615
I have the following MultiIndex dataframe.
Close ATR
Date Symbol
1990-01-01 A 24 2
1990-01-01 B 72 7
1990-01-01 C 40 3.4
1990-01-02 A 21 1.5
1990-01-02 B 65 6
1990-01-02 C 45 4.2
1990-01-03 A 19 2.5
1990-01-03 B 70 6.3
1990-01-03 C 51 5
I want to calculate three columns:
Shares = previous day's Equity * 0.02 / ATR, rounded down to whole number
Profit = Shares * Close
Equity = previous day's Equity + sum of Profit for each Symbol
Equity has an initial value of 10,000.
The expected output is:
Close ATR Shares Profit Equity
Date Symbol
1990-01-01 A 24 2 0 0 10000
1990-01-01 B 72 7 0 0 10000
1990-01-01 C 40 3.4 0 0 10000
1990-01-02 A 21 1.5 133 2793 17053
1990-01-02 B 65 6 33 2145 17053
1990-01-02 C 45 4.2 47 2115 17053
1990-01-03 A 19 2.5 136 2584 26885
1990-01-03 B 70 6.3 54 3780 26885
1990-01-03 C 51 5 68 3468 26885
I suppose I need a for loop or a function to be applied to each row. With these I have two issues. One is that I'm not sure how I can create a for loop for this logic in case of a MultiIndex dataframe. The second is that my dataframe is pretty large (something like 10 million rows) so I'm not sure if a for loop would be a good idea. But then how can I create these columns?
This solution can surely be cleaned up, but will produce your desired output. I've included your initial conditions in the construction of your sample dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['1990-01-01','1990-01-01','1990-01-01','1990-01-02','1990-01-02','1990-01-02','1990-01-03','1990-01-03','1990-01-03'],
'Symbol': ['A','B','C','A','B','C','A','B','C'],
'Close': [24, 72, 40, 21, 65, 45, 19, 70, 51],
'ATR': [2, 7, 3.4, 1.5, 6, 4.2, 2.5, 6.3, 5],
'Shares': [0, 0, 0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Profit': [0, 0, 0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]})
Gives:
Date Symbol Close ATR Shares Profit
0 1990-01-01 A 24 2.0 0.0 0.0
1 1990-01-01 B 72 7.0 0.0 0.0
2 1990-01-01 C 40 3.4 0.0 0.0
3 1990-01-02 A 21 1.5 NaN NaN
4 1990-01-02 B 65 6.0 NaN NaN
5 1990-01-02 C 45 4.2 NaN NaN
6 1990-01-03 A 19 2.5 NaN NaN
7 1990-01-03 B 70 6.3 NaN NaN
8 1990-01-03 C 51 5.0 NaN NaN
Then use groupby() with apply() and track your Equity globally. Took me a second to realize that the nature of this problem requires you to group on two separate columns individually (Symbol and Date):
start = 10000
Equity = 10000
def calcs(x):
global Equity
if x.index[0]==0: return x #Skip first group
x['Shares'] = np.floor(Equity*0.02/x['ATR'])
x['Profit'] = x['Shares']*x['Close']
Equity += x['Profit'].sum()
return x
df = df.groupby('Date').apply(calcs)
df['Equity'] = df.groupby('Date')['Profit'].transform('sum')
df['Equity'] = df.groupby('Symbol')['Equity'].cumsum()+start
This yields:
Date Symbol Close ATR Shares Profit Equity
0 1990-01-01 A 24 2.0 0.0 0.0 10000.0
1 1990-01-01 B 72 7.0 0.0 0.0 10000.0
2 1990-01-01 C 40 3.4 0.0 0.0 10000.0
3 1990-01-02 A 21 1.5 133.0 2793.0 17053.0
4 1990-01-02 B 65 6.0 33.0 2145.0 17053.0
5 1990-01-02 C 45 4.2 47.0 2115.0 17053.0
6 1990-01-03 A 19 2.5 136.0 2584.0 26885.0
7 1990-01-03 B 70 6.3 54.0 3780.0 26885.0
8 1990-01-03 C 51 5.0 68.0 3468.0 26885.0
can you try using shift and groupby? Once you have the value of the previous line, all columns operations are straight forward.
table2['previous'] = table2['close'].groupby('symbol').shift(1)
table2
date symbol close atr previous
1990-01-01 A 24 2 NaN
B 72 7 NaN
C 40 3.4 NaN
1990-01-02 A 21 1.5 24
B 65 6 72
C 45 4.2 40
1990-01-03 A 19 2.5 21
B 70 6.3 65
C 51 5 45
I have the following DataFrame:
daysago line_race rating rw wrating
line_date
2007-03-31 62 11 56 1.000000 56.000000
2007-03-10 83 11 67 1.000000 67.000000
2007-02-10 111 9 66 1.000000 66.000000
2007-01-13 139 10 83 0.880678 73.096278
2006-12-23 160 10 88 0.793033 69.786942
2006-11-09 204 9 52 0.636655 33.106077
2006-10-22 222 8 66 0.581946 38.408408
2006-09-29 245 9 70 0.518825 36.317752
2006-09-16 258 11 68 0.486226 33.063381
2006-08-30 275 8 72 0.446667 32.160051
2006-02-11 475 5 65 0.164591 10.698423
2006-01-13 504 0 70 0.142409 9.968634
2006-01-02 515 0 64 0.134800 8.627219
2005-12-06 542 0 70 0.117803 8.246238
2005-11-29 549 0 70 0.113758 7.963072
2005-11-22 556 0 -1 0.109852 -0.109852
2005-11-01 577 0 -1 0.098919 -0.098919
2005-10-20 589 0 -1 0.093168 -0.093168
2005-09-27 612 0 -1 0.083063 -0.083063
2005-09-07 632 0 -1 0.075171 -0.075171
2005-06-12 719 0 69 0.048690 3.359623
2005-05-29 733 0 -1 0.045404 -0.045404
2005-05-02 760 0 -1 0.039679 -0.039679
2005-04-02 790 0 -1 0.034160 -0.034160
2005-03-13 810 0 -1 0.030915 -0.030915
2004-11-09 934 0 -1 0.016647 -0.016647
I need to remove the rows where line_race is equal to 0. What's the most efficient way to do this?
If I'm understanding correctly, it should be as simple as:
df = df[df.line_race != 0]
But for any future bypassers you could mention that df = df[df.line_race != 0] doesn't do anything when trying to filter for None/missing values.
Does work:
df = df[df.line_race != 0]
Doesn't do anything:
df = df[df.line_race != None]
Does work:
df = df[df.line_race.notnull()]
just to add another solution, particularly useful if you are using the new pandas assessors, other solutions will replace the original pandas and lose the assessors
df.drop(df.loc[df['line_race']==0].index, inplace=True)
If you want to delete rows based on multiple values of the column, you could use:
df[(df.line_race != 0) & (df.line_race != 10)]
To drop all rows with values 0 and 10 for line_race.
In case of multiple values and str dtype
I used the following to filter out given values in a col:
def filter_rows_by_values(df, col, values):
return df[~df[col].isin(values)]
Example:
In a DataFrame I want to remove rows which have values "b" and "c" in column "str"
df = pd.DataFrame({"str": ["a","a","a","a","b","b","c"], "other": [1,2,3,4,5,6,7]})
df
str other
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 c 7
filter_rows_by_values(df, "str", ["b","c"])
str other
0 a 1
1 a 2
2 a 3
3 a 4
Though the previous answer are almost similar to what I am going to do, but using the index method does not require using another indexing method .loc(). It can be done in a similar but precise manner as
df.drop(df.index[df['line_race'] == 0], inplace = True)
The best way to do this is with boolean masking:
In [56]: df
Out[56]:
line_date daysago line_race rating raw wrating
0 2007-03-31 62 11 56 1.000 56.000
1 2007-03-10 83 11 67 1.000 67.000
2 2007-02-10 111 9 66 1.000 66.000
3 2007-01-13 139 10 83 0.881 73.096
4 2006-12-23 160 10 88 0.793 69.787
5 2006-11-09 204 9 52 0.637 33.106
6 2006-10-22 222 8 66 0.582 38.408
7 2006-09-29 245 9 70 0.519 36.318
8 2006-09-16 258 11 68 0.486 33.063
9 2006-08-30 275 8 72 0.447 32.160
10 2006-02-11 475 5 65 0.165 10.698
11 2006-01-13 504 0 70 0.142 9.969
12 2006-01-02 515 0 64 0.135 8.627
13 2005-12-06 542 0 70 0.118 8.246
14 2005-11-29 549 0 70 0.114 7.963
15 2005-11-22 556 0 -1 0.110 -0.110
16 2005-11-01 577 0 -1 0.099 -0.099
17 2005-10-20 589 0 -1 0.093 -0.093
18 2005-09-27 612 0 -1 0.083 -0.083
19 2005-09-07 632 0 -1 0.075 -0.075
20 2005-06-12 719 0 69 0.049 3.360
21 2005-05-29 733 0 -1 0.045 -0.045
22 2005-05-02 760 0 -1 0.040 -0.040
23 2005-04-02 790 0 -1 0.034 -0.034
24 2005-03-13 810 0 -1 0.031 -0.031
25 2004-11-09 934 0 -1 0.017 -0.017
In [57]: df[df.line_race != 0]
Out[57]:
line_date daysago line_race rating raw wrating
0 2007-03-31 62 11 56 1.000 56.000
1 2007-03-10 83 11 67 1.000 67.000
2 2007-02-10 111 9 66 1.000 66.000
3 2007-01-13 139 10 83 0.881 73.096
4 2006-12-23 160 10 88 0.793 69.787
5 2006-11-09 204 9 52 0.637 33.106
6 2006-10-22 222 8 66 0.582 38.408
7 2006-09-29 245 9 70 0.519 36.318
8 2006-09-16 258 11 68 0.486 33.063
9 2006-08-30 275 8 72 0.447 32.160
10 2006-02-11 475 5 65 0.165 10.698
UPDATE: Now that pandas 0.13 is out, another way to do this is df.query('line_race != 0').
The given answer is correct nontheless as someone above said you can use df.query('line_race != 0') which depending on your problem is much faster. Highly recommend.
Another way of doing it. May not be the most efficient way as the code looks a bit more complex than the code mentioned in other answers, but still alternate way of doing the same thing.
df = df.drop(df[df['line_race']==0].index)
One of the efficient and pandaic way is using eq() method:
df[~df.line_race.eq(0)]
I compiled and run my code. This is accurate code. You can try it your own.
data = pd.read_excel('file.xlsx')
If you have any special character or space in column name you can write it in '' like in the given code:
data = data[data['expire/t'].notnull()]
print (date)
If there is just a single string column name without any space or special
character you can directly access it.
data = data[data.expire ! = 0]
print (date)
Adding one more way to do this.
df = df.query("line_race!=0")
There are various ways to achieve that. Will leave below various options, that one can use, depending on specificities of one's use case.
One will consider that OP's dataframe is stored in the variable df.
Option 1
For OP's case, considering that the only column with values 0 is the line_race, the following will do the work
df_new = df[df != 0].dropna()
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
However, as that is not always the case, would recommend checking the following options where one will specify the column name.
Option 2
tshauck's approach ends up being better than Option 1, because one is able to specify the column. There are, however, additional variations depending on how one wants to refer to the column:
For example, using the position in the dataframe
df_new = df[df[df.columns[2]] != 0]
Or by explicitly indicating the column as follows
df_new = df[df['line_race'] != 0]
One can also follow the same login but using a custom lambda function, such as
df_new = df[df.apply(lambda x: x['line_race'] != 0, axis=1)]
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 3
Using pandas.Series.map and a custom lambda function
df_new = df['line_race'].map(lambda x: x != 0)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 4
Using pandas.DataFrame.drop as follows
df_new = df.drop(df[df['line_race'] == 0].index)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 5
Using pandas.DataFrame.query as follows
df_new = df.query('line_race != 0')
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 6
Using pandas.DataFrame.drop and pandas.DataFrame.query as follows
df_new = df.drop(df.query('line_race == 0').index)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.000000 56.000000
1 2007-03-10 83 11.0 67 1.000000 67.000000
2 2007-02-10 111 9.0 66 1.000000 66.000000
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
Option 7
If one doesn't have strong opinions on the output, one can use a vectorized approach with numpy.select
df_new = np.select([df != 0], [df], default=np.nan)
[Out]:
[['2007-03-31' 62 11.0 56 1.0 56.0]
['2007-03-10' 83 11.0 67 1.0 67.0]
['2007-02-10' 111 9.0 66 1.0 66.0]
['2007-01-13' 139 10.0 83 0.880678 73.096278]
['2006-12-23' 160 10.0 88 0.793033 69.786942]
['2006-11-09' 204 9.0 52 0.636655 33.106077]
['2006-10-22' 222 8.0 66 0.581946 38.408408]
['2006-09-29' 245 9.0 70 0.518825 36.317752]
['2006-09-16' 258 11.0 68 0.486226 33.063381]
['2006-08-30' 275 8.0 72 0.446667 32.160051]
['2006-02-11' 475 5.0 65 0.164591 10.698423]]
This can also be converted to a dataframe with
df_new = pd.DataFrame(df_new, columns=df.columns)
[Out]:
line_date daysago line_race rating rw wrating
0 2007-03-31 62 11.0 56 1.0 56.0
1 2007-03-10 83 11.0 67 1.0 67.0
2 2007-02-10 111 9.0 66 1.0 66.0
3 2007-01-13 139 10.0 83 0.880678 73.096278
4 2006-12-23 160 10.0 88 0.793033 69.786942
5 2006-11-09 204 9.0 52 0.636655 33.106077
6 2006-10-22 222 8.0 66 0.581946 38.408408
7 2006-09-29 245 9.0 70 0.518825 36.317752
8 2006-09-16 258 11.0 68 0.486226 33.063381
9 2006-08-30 275 8.0 72 0.446667 32.160051
10 2006-02-11 475 5.0 65 0.164591 10.698423
With regards to the most efficient solution, that would depend on how one wants to measure efficiency. Assuming that one wants to measure the time of execution, one way that one can go about doing it is with time.perf_counter().
If one measures the time of execution for all the options above, one gets the following
method time
0 Option 1 0.00000110000837594271
1 Option 2.1 0.00000139995245262980
2 Option 2.2 0.00000369996996596456
3 Option 2.3 0.00000160001218318939
4 Option 3 0.00000110000837594271
5 Option 4 0.00000120000913739204
6 Option 5 0.00000140001066029072
7 Option 6 0.00000159995397552848
8 Option 7 0.00000150001142174006
However, this might change depending on the dataframe one uses, on the requirements (such as hardware), and more.
Notes:
There are various suggestions on using inplace=True. Would suggest reading this: https://stackoverflow.com/a/59242208/7109869
There are also some people with strong opinions on .apply(). Would suggest reading this: When should I (not) want to use pandas apply() in my code?
If one has missing values, one might want to consider as well pandas.DataFrame.dropna. Using the option 2, it would be something like
df = df[df['line_race'] != 0].dropna()
There are additional ways to measure the time of execution, so I would recommend this thread: How do I get time of a Python program's execution?
Just adding another way for DataFrame expanded over all columns:
for column in df.columns:
df = df[df[column]!=0]
Example:
def z_score(data,count):
threshold=3
for column in data.columns:
mean = np.mean(data[column])
std = np.std(data[column])
for i in data[column]:
zscore = (i-mean)/std
if(np.abs(zscore)>threshold):
count=count+1
data = data[data[column]!=i]
return data,count
Just in case you need to delete the row, but the value can be in different columns.
In my case I was using percentages so I wanted to delete the rows which has a value 1 in any column, since that means that it's the 100%
for x in df:
df.drop(df.loc[df[x]==1].index, inplace=True)
Is not optimal if your df have too many columns.
so many options provided(or maybe i didnt pay much attention to it, sorry if its the case), but no one mentioned this:
we can use this notation in pandas: ~ (this gives us the inverse of the condition)
df = df[~df["line_race"] == 0]
It doesn't make much difference for simple example like this, but for complicated logic, I prefer to use drop() when deleting rows because it is more straightforward than using inverse logic. For example, delete rows where A=1 AND (B=2 OR C=3).
Here's a scalable syntax that is easy to understand and can handle complicated logic:
df.drop( df.query(" `line_race` == 0 ").index)
You can try using this:
df.drop(df[df.line_race != 0].index, inplace = True)
.