I have a large dataframe with 25-30 columns, and thousands of lines.
I need to analyse the trend of some columns and the trend of some column ratios against each other.
Now I have 3 choices:
1) iterate row by row with a simple for i in range(len(df)), and build a series of if/else conditions, comparing each value with the previous and following ones, for different columns;
for i in range(len(df)):
if (df.iloc[i]['col1'] < df.iloc[i]['col2']):
print ('error type 1')
if (df.iloc[i]['col2'] < df.iloc[i]['col3']):
print ('error type 2')
if (df.iloc[i+1]['col1'] > 2 * df.iloc[i]['col1']):
print ('error trend 1')
if (df.iloc[i+1]['col2'] > 2 * df.iloc[i+1]['col2']):
print ('error trend 2')
if (df.iloc[i-1]['col2'] > 2 * df.iloc[i]['col2']):
print ('error trend 2')
# and so on, with around 40-50 if statements per line
2) Iterate with iterrows or itertuples, but I am not sure to be able to access previous and next rows easily
3) create shifted columns and use vectorized operations, but it means I will create around 100 more columns in the dataframe (shift +1, +2,-1, -2 x 20 columns):
df['ratio12'] = df['col1'] / df['col2']
df['ratio12up1'] = df['ratio12'].shift(-1)
df['ratio12up2'] = df['ratio12'].shift(-2)
df['ratio12dn1'] = df['ratio12'].shift(1)
df['ratio12dn2'] = df['ratio12'].shift(2)
df['ratio23'] = df['col2'] / df['col3']
df['ratio23up1'] = df['ratio23'].shift(-1)
df['ratio23up2'] = df['ratio23'].shift(-2)
df['ratio23dn1'] = df['ratio23'].shift(1)
df['ratio23dn2'] = df['ratio23'].shift(2)
df['ratio34'] = df['col3'] / df['col4']
#..... for other 10 ratios
# and then do the checks on all these new columns, like:
peak_ratio12 = ( (df['ratio12'] / df['ratio12up1']) > 1.5 && (df['ratio12'] / df['ratio12dn1'] > 1.5) )
EDIT: EXAMPLE: I have this table:
Index col1 col2 col3 col4 col5 col6 col7
0 732 58 18 10 6 3 3
1 754 60 18 10 6 3 3
2 3964 365 98 34 34 17 13
3 4286 417 110 36 35 19 15
4 5807 545 155 54 53 27 21
5 1681 132 46 16 13 9 8
6 542 620 13 11 4 3 2
7 319 38 30 20 4 2 2
8 286 22 17 10 3 2 2
9 324 25 18 10 3 2 2
10 370 29 10 0 4 2 2
11 299 28 19 10 3 2 2
12 350 36 14 11 6 3 4
13 309 34 14 11 7 3 4
On this infinitely small part of data I would like to find errors:
when a value in the columns becomes 2x or half the previous and next values (so in line 2, 5, 6 for example)
when a successive column is higher than the previous (like col2 >col1 in line 6, col7>col6 in line 12 and 13, etc.
and a huge number of other checks on these columns (like col1/col2 must be pretty constant, col2/col3 the same, col6+col7 must be smaller than col3, etc)
My main problem is on checking column ratios with previous and next values. All other checks are easily vectorized.
Any advice on how to proceed?
Thanks!
Related
Consider the following table. The first column, Data1, contains data values that are clustered in groups: there are values around 100 and 200. I am wondering how I can apply a function that deals with each data grouping separately, perhaps by applying an if statement that excludes data points with values too far apart to be considered a neighboring data point.
Data1 Value1
99 1
100 2
101 3
102 4
199 5
200 6
201 7
... ...
For example, if I want to generate a third column called "Result1" that adds every Data1 cluster's corresponding Value1 together. The result would look something like this, where 1+2+3+4=10 and 5+6+7=18:
Data1 Value1 Result1
99 1 10
100 2 10
101 3 10
102 4 10
199 5 18
200 6 18
201 7 18
... ... ...
Try merge_asof:
data = [100,200]
labels = pd.merge_asof(df, pd.DataFrame({'label':data}),
left_on='Data1', right_on='label',
direction='nearest')['label']
df['Result1'] = df.groupby(labels)['Value1'].transform('sum')
Output:
Data1 Value1 Result1
0 99 1 10
1 100 2 10
2 101 3 10
3 102 4 10
4 199 5 18
5 200 6 18
6 201 7 18
In your case, a simple mask aught to do.
mask = df[“Data1”]<150
df.loc[mask,”Result1”] = df.loc[mask,”Value1”].sum()
df.loc[~mask,”Result1”] = ”df.loc[~mask,”Value1”].sum()
For example if I wanted
df = pd.DataFrame(index=range(5000)
df[‘A’]= 0
df[‘A’][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = df['A'][i-1] * 3
Is there a way to do this without a loop?
your code sample has missing close brackets and quotes are not valid. Fixed these
if I understand what you are trying to achieve, multiply previous value by 3 where zeroth number is 1
initialize the series to 3, then set zeroth item to 1
then simple use of cumprod
shortened your series as this calculation will rapidly result in number overflow
df = pd.DataFrame(index=range(20))
df["A"]= 3
df.loc[0,"A"] = 1
df["A"] = df["A"].cumprod()
A
0
1
1
3
2
9
3
27
4
81
5
243
6
729
7
2187
8
6561
9
19683
10
59049
11
177147
12
531441
13
1.59432e+06
14
4.78297e+06
15
1.43489e+07
16
4.30467e+07
17
1.2914e+08
18
3.8742e+08
19
1.16226e+09
I have one dataframe. I want to add one column which calculates the difference value between two adjacent rows (if the sequence is different, it doesn't matter).
For instance, if in row[A] is 12,22,5,7; in row B is 22,7,3,6 then the number is 2, etc.Because in row[a] and row[b] we have the same 22 and 7(although the sequence is different). in row b we have two new number 3,6. So we add one number in row "b" at last which records the difference between row a and row b.
df = pd.DataFrame({'X': [22, 7, 43, 44, 56,67,7,38,29,130],'Y': [5,3,330,140,250,10,207,320,420,50],'Z': [7,6,136,144,312,10,82,63,42,12],'T':[12, 22, 4, 424, 256,167,27,38,229,30]},index=list('ABCDEFGHIJ'))
Thanks.
John Galt in his (now, unfortunately deleted) answer was on the right track with set operations.
In addition, accounting for duplicates will involve:
s = df.apply(set, 1)
df['diffs'] = s.diff().fillna('').str.len() + (4 - s.str.len())
df
T X Y Z diffs
A 12 22 5 7 0
B 22 7 3 6 2
C 4 43 330 136 4
D 424 44 140 144 4
E 256 56 250 312 4
F 167 67 10 10 4
G 27 7 207 82 4
H 38 38 320 63 4
I 229 29 420 42 4
J 30 130 50 12 4
I'm quite new on bash scripting and on python programing; at the moment have 2 columns which contains numeric sequences as follow:
Col 1:
1
2
3
5
7
8
Col 2:
101
102
103
105
107
108
Need to extract the numeric ranges from both columns and print them according to the sequence break occrance on any of those 2 columns and the result should be as follow:
1,3,101,103
5,5,105,105
7,8,107,108
Already received a useful information on how to extract numeric ranges from one column using awk: - $ awk 'NR==1||sqrt(($0-p)*($0-p))>1{print p; printf "%s", $0 ", "} {p=$0} END{print $0}' file - ; but now the problem got a bit more complex as have to include a second column with another numeric sequence and requires as a result the ranges from the columns wherever the sequence breaks occurs on any of the 2 columns.
To add a bit more complexity the sequences can be ascending and/or descending.
Trying to find a solution using pandas (data frames) and numpy libraries for python.
Thanks in advances.
Hello MaxU thanks for your reply, unfortunately I'm hitting an issue for the following case:
Col 1:
7
8
9
10
11
Col 2:
52
51
47
46
45
Where numeric sequence in the second column is descending from the begining; it generates as a result:
7,11,45,52
instead of:
7,8,51,52
8,11,45,47
Cheers.
UPDATE:
In [103]: df
Out[103]:
Col1 Col2
0 7 52
1 8 51
2 9 47
3 10 46
4 11 45
In [104]: (df.groupby((df.diff().abs() != 1).any(1).cumsum()).agg(['min','max']))
Out[104]:
Col1 Col2
min max min max
1 7 8 51 52
2 9 11 45 47
OLD answer:
Here is one way (among many) to do it in Pandas:
Data:
In [314]: df
Out[314]:
Col1 Col2
0 1 101
1 2 102
2 3 103
3 5 105
4 8 108
5 7 107
6 6 106
7 9 109
NOTE: pay attention - rows with indexes (4,5,6) is a descending sequence
Solution:
In [350]: rslt = (df.groupby((df.diff().abs() != 1).all(1).cumsum())
...: .agg(['min','max']))
...:
In [351]: rslt
Out[351]:
Col1 Col2
min max min max
1 1 3 101 103
2 5 5 105 105
3 6 8 106 108
4 9 9 109 109
now you can easily save it to CSV file:
rslt.to_csv(r'/path/to/file_name.csv', index=False, header=None)
or just print it:
In [333]: print(rslt.to_csv(index=False, header=None))
1,3,101,103
5,5,105,105
6,8,106,108
9,9,109,109
I have a DataFrame with 4 columns and 251 rows and an index that is a progression of numbers e.g. 1000 to 1250 . The index was initially necessary to aid in joining data from 4 different dataframes. However, once i get the 4 columns together, i would like to change the index to a number progression from 250 to 0. This is because i would be performing the same operation on different sets of data (in groups of 4) that would have different indices, e.g. 2000 to 2250 or 500 to 750, but would all have the same number of rows. 250 to 0 is a way of unifying these data sets, but i can't figure out how to do this. i.e. i'm looking for something that replaces any existing index with the function range(250, 0, -1)
I've tried using set_index below and a whole bunch of other attempts that invariably return errors,
df.set_index(range(250, 0, -1), inplace=True)
and in the instance when i am able to set the index of the df to the range, the data in the 4 columns change to NaN since they have no data that matches the new index. I apologize if this is rudimentary, but i'm a week old in the world of python/pandas, haven't programmed in +10yrs, and have taken 2 days to try to figure this out for myself as an exercise, but its time to cry... Uncle!!
Try introducing the 250:0 indices as a column first, then setting them as the index:
df = pd.DataFrame({'col1': list('abcdefghij'), 'col2': range(0, 50, 5)})
df['new_index'] = range(30, 20, -1)
df.set_index('new_index')
Before:
col1 col2 new_index
0 a 0 30
1 b 5 29
2 c 10 28
3 d 15 27
4 e 20 26
5 f 25 25
6 g 30 24
7 h 35 23
8 i 40 22
9 j 45 21
After:
col1 col2
new_index
30 a 0
29 b 5
28 c 10
27 d 15
26 e 20
25 f 25
24 g 30
23 h 35
22 i 40
21 j 45
You can just do
df.index = range(250, 0, -1)
or am I missing something?