Determine the number of unique values between consecutive rows - python

I have one dataframe. I want to add one column which calculates the difference value between two adjacent rows (if the sequence is different, it doesn't matter).
For instance, if in row[A] is 12,22,5,7; in row B is 22,7,3,6 then the number is 2, etc.Because in row[a] and row[b] we have the same 22 and 7(although the sequence is different). in row b we have two new number 3,6. So we add one number in row "b" at last which records the difference between row a and row b.
df = pd.DataFrame({'X': [22, 7, 43, 44, 56,67,7,38,29,130],'Y': [5,3,330,140,250,10,207,320,420,50],'Z': [7,6,136,144,312,10,82,63,42,12],'T':[12, 22, 4, 424, 256,167,27,38,229,30]},index=list('ABCDEFGHIJ'))
Thanks.

John Galt in his (now, unfortunately deleted) answer was on the right track with set operations.
In addition, accounting for duplicates will involve:
s = df.apply(set, 1)
df['diffs'] = s.diff().fillna('').str.len() + (4 - s.str.len())
df
T X Y Z diffs
A 12 22 5 7 0
B 22 7 3 6 2
C 4 43 330 136 4
D 424 44 140 144 4
E 256 56 250 312 4
F 167 67 10 10 4
G 27 7 207 82 4
H 38 38 320 63 4
I 229 29 420 42 4
J 30 130 50 12 4

Related

How to select the "locals" min and max out of a list in a panda DataFrame

I'm struggling to figure out how to do the following:
I've a dataframe that looks like this (it's a little more complicated, this is but an example):
df = pd.DataFrame({'id' : ['id1','id2'], 'coverage' : ['1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19 20 40 41 42 43 44 45 46 47 48 49 50','1 2 3 4 5 6 7 8 9 10 100 101 102 103 104 105 106 107 108 109 110']})
And I want to generate a new key that only holds the min-max of every segment, basically it should look like this:
id coverage
0 id1 1 11 13 20 40 50
1 id2 1 10 100 110
It's a simple problem but I can't come up with any solutions, I know that map(lambda x:) could work...
Thanks!
Let's try:
# split the values and convert to integers
s = df['coverage'].str.split().explode().astype(int)
# continuous blocks
blocks = s.diff().ne(1).groupby(level=0).cumsum()
s['coverage'] = (s.groupby([s.index, blocks])
.agg(['min','max'])
.astype(str).agg(' '.join, axis=1)
.groupby(level=0).agg(' '.join)
)
First, split those strings and then explode into a large Series, keeping the index as your 'id' column. Then we take the difference between successive rows within each group and check where it's not equal to 1.
Slice the exploded Series by this mask and it shifted to get the start and end points, then groupby and agg(list) (or ' '.join) to get your output.
# To numeric so values become numbers.
s = pd.to_numeric(df.set_index('id')['coverage'].str.split().explode())
m = s.groupby(level=0).diff().ne(1)
result = s[m | m.shift(-1).fillna(True)].groupby(level=0).agg(list)
id
id1 [1, 11, 13, 20, 40, 50]
id2 [1, 10, 100, 110]
Name: coverage, dtype: object

Multiply two dataframes with same column names but different index

I have two dataframes, one with data, one with a list of forecasting assumptions. The column names correspond, but the index levels do not (by design). Please show me how to multiply columns A, B, and C in df1 by the relevant columns in df2, as in my example below, and with the remainder of the original dataframe (aka column D) intact. Thanks!
df1list=[1,2,3,4,5,6,7,8,9,10]
df2list=[2017]
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'), index=list(df1list))
df2 = pd.DataFrame(np.random.randint(1,4,size=(1, 3)), columns=list('ABC'), index=list(df2list))
>>> df[['A','B','C']] * df2.values
A B C
1 81 168 116
2 21 8 6
3 147 108 52
4 54 64 114
5 48 16 20
6 72 116 12
7 36 188 178
8 90 96 162
9 63 166 156
10 120 22 10
So to overwrite you can do:
df.loc[:,['A','B','C']] = df[['A','B','C']] * df2.values
And I guess to be more programmatic you can do:
df[df2.columns] *= df2.values

Multiply each element of a column by each element of a different dataframe

I have two data frame both having same number of columns but the first data frame has multiple rows and the second one has only one row but same number of columns as the first one. I need to multiply the entries of the first data frame with the second by column name.
DF:1
A B C
0 34 54 56
1 12 87 78
2 78 35 0
3 84 25 14
4 26 82 13
DF:2
A B C
0 2 3 1
Result
A B C
68 162 56
24 261 78
156 105 0
168 75 14
52 246 13
This will work. Here we are manipulating numpy array inside the DataFrame.
pd.DataFrame(df1.values*df2.values, columns=df1.columns, index=df1.index)
for col in df1.columns:
df1[col] = df1[col].apply(lambda x: x * df2[col])
theres probably an even simpler solution. Play around with apply, map and transform

What is the fastest way to check multiple trends in pandas dataframe?

I have a large dataframe with 25-30 columns, and thousands of lines.
I need to analyse the trend of some columns and the trend of some column ratios against each other.
Now I have 3 choices:
1) iterate row by row with a simple for i in range(len(df)), and build a series of if/else conditions, comparing each value with the previous and following ones, for different columns;
for i in range(len(df)):
if (df.iloc[i]['col1'] < df.iloc[i]['col2']):
print ('error type 1')
if (df.iloc[i]['col2'] < df.iloc[i]['col3']):
print ('error type 2')
if (df.iloc[i+1]['col1'] > 2 * df.iloc[i]['col1']):
print ('error trend 1')
if (df.iloc[i+1]['col2'] > 2 * df.iloc[i+1]['col2']):
print ('error trend 2')
if (df.iloc[i-1]['col2'] > 2 * df.iloc[i]['col2']):
print ('error trend 2')
# and so on, with around 40-50 if statements per line
2) Iterate with iterrows or itertuples, but I am not sure to be able to access previous and next rows easily
3) create shifted columns and use vectorized operations, but it means I will create around 100 more columns in the dataframe (shift +1, +2,-1, -2 x 20 columns):
df['ratio12'] = df['col1'] / df['col2']
df['ratio12up1'] = df['ratio12'].shift(-1)
df['ratio12up2'] = df['ratio12'].shift(-2)
df['ratio12dn1'] = df['ratio12'].shift(1)
df['ratio12dn2'] = df['ratio12'].shift(2)
df['ratio23'] = df['col2'] / df['col3']
df['ratio23up1'] = df['ratio23'].shift(-1)
df['ratio23up2'] = df['ratio23'].shift(-2)
df['ratio23dn1'] = df['ratio23'].shift(1)
df['ratio23dn2'] = df['ratio23'].shift(2)
df['ratio34'] = df['col3'] / df['col4']
#..... for other 10 ratios
# and then do the checks on all these new columns, like:
peak_ratio12 = ( (df['ratio12'] / df['ratio12up1']) > 1.5 && (df['ratio12'] / df['ratio12dn1'] > 1.5) )
EDIT: EXAMPLE: I have this table:
Index col1 col2 col3 col4 col5 col6 col7
0 732 58 18 10 6 3 3
1 754 60 18 10 6 3 3
2 3964 365 98 34 34 17 13
3 4286 417 110 36 35 19 15
4 5807 545 155 54 53 27 21
5 1681 132 46 16 13 9 8
6 542 620 13 11 4 3 2
7 319 38 30 20 4 2 2
8 286 22 17 10 3 2 2
9 324 25 18 10 3 2 2
10 370 29 10 0 4 2 2
11 299 28 19 10 3 2 2
12 350 36 14 11 6 3 4
13 309 34 14 11 7 3 4
On this infinitely small part of data I would like to find errors:
when a value in the columns becomes 2x or half the previous and next values (so in line 2, 5, 6 for example)
when a successive column is higher than the previous (like col2 >col1 in line 6, col7>col6 in line 12 and 13, etc.
and a huge number of other checks on these columns (like col1/col2 must be pretty constant, col2/col3 the same, col6+col7 must be smaller than col3, etc)
My main problem is on checking column ratios with previous and next values. All other checks are easily vectorized.
Any advice on how to proceed?
Thanks!

How can I extract numeric ranges from 2 columns and print the range from both columns as tuples?

I'm quite new on bash scripting and on python programing; at the moment have 2 columns which contains numeric sequences as follow:
Col 1:
1
2
3
5
7
8
Col 2:
101
102
103
105
107
108
Need to extract the numeric ranges from both columns and print them according to the sequence break occrance on any of those 2 columns and the result should be as follow:
1,3,101,103
5,5,105,105
7,8,107,108
Already received a useful information on how to extract numeric ranges from one column using awk: - $ awk 'NR==1||sqrt(($0-p)*($0-p))>1{print p; printf "%s", $0 ", "} {p=$0} END{print $0}' file - ; but now the problem got a bit more complex as have to include a second column with another numeric sequence and requires as a result the ranges from the columns wherever the sequence breaks occurs on any of the 2 columns.
To add a bit more complexity the sequences can be ascending and/or descending.
Trying to find a solution using pandas (data frames) and numpy libraries for python.
Thanks in advances.
Hello MaxU thanks for your reply, unfortunately I'm hitting an issue for the following case:
Col 1:
7
8
9
10
11
Col 2:
52
51
47
46
45
Where numeric sequence in the second column is descending from the begining; it generates as a result:
7,11,45,52
instead of:
7,8,51,52
8,11,45,47
Cheers.
UPDATE:
In [103]: df
Out[103]:
Col1 Col2
0 7 52
1 8 51
2 9 47
3 10 46
4 11 45
In [104]: (df.groupby((df.diff().abs() != 1).any(1).cumsum()).agg(['min','max']))
Out[104]:
Col1 Col2
min max min max
1 7 8 51 52
2 9 11 45 47
OLD answer:
Here is one way (among many) to do it in Pandas:
Data:
In [314]: df
Out[314]:
Col1 Col2
0 1 101
1 2 102
2 3 103
3 5 105
4 8 108
5 7 107
6 6 106
7 9 109
NOTE: pay attention - rows with indexes (4,5,6) is a descending sequence
Solution:
In [350]: rslt = (df.groupby((df.diff().abs() != 1).all(1).cumsum())
...: .agg(['min','max']))
...:
In [351]: rslt
Out[351]:
Col1 Col2
min max min max
1 1 3 101 103
2 5 5 105 105
3 6 8 106 108
4 9 9 109 109
now you can easily save it to CSV file:
rslt.to_csv(r'/path/to/file_name.csv', index=False, header=None)
or just print it:
In [333]: print(rslt.to_csv(index=False, header=None))
1,3,101,103
5,5,105,105
6,8,106,108
9,9,109,109

Categories