Retrieving future value in Python using offset variable from another column

Retrieving future value in Python using offset variable from another column - python

I'm trying to figure out how to retrieve values from future dates using an offset variable in a separate row in Python. For instance, I have the dataframe df below, and I'd like to find a way to produce Column C:
Orig A Orig B Desired Column C
54 1 76
76 4 46
14 3 46
35 1 -3
-3 0 -3
46 0 46
64 0 64
93 0 93
72 0 72
Any help is much appreciated, thank you!

You can use NumPy for a vectorised solution:
import numpy as np
idx = np.arange(df.shape[0]) + df['OrigB'].values
df['C'] = df['OrigA'].iloc[idx].values
print(df)
OrigA OrigB C
0 54 1 76
1 76 4 46
2 14 3 46
3 35 1 -3
4 -3 0 -3
5 46 0 46
6 64 0 64
7 93 0 93
8 72 0 72

import pandas as pd
dict = {"Orig A": [54,76,14,35,-3,46,64,93,72],
"Orig B": [1,4,3,1,0,0,0,0,0],
"Desired Column C": [76,46,46,-3,-3,46,64,93,72]}
df = pd.DataFrame(dict)
df["desired_test"] = [df["Orig A"].values[i+j] for i,j in enumerate(df["Orig B"].values)]
df
Orig A Orig B Desired Column C desired_test
0 54 1 76 76
1 76 4 46 46
2 14 3 46 46
3 35 1 -3 -3
4 -3 0 -3 -3
5 46 0 46 46
6 64 0 64 64
7 93 0 93 93
8 72 0 72 72

Related

Venn Diagram for each row in DataFrame

I have a set of data that looks like this:
Exp # ID Q1 Q2 All IDs Q1 unique Q2 unique Overlap Unnamed: 8
0 1 58 32 58 58 14 40 18 18
1 2 55 38 44 55 28 34 10 10
2 4 95 69 83 95 37 51 32 32
3 5 92 68 84 92 31 47 37 37
4 6 0 0 0 0 0 0 0 0
5 7 71 52 65 71 27 40 25 25
6 8 84 69 69 84 39 39 30 30
7 10 65 35 63 65 17 45 18 18
8 11 90 72 72 90 39 39 33 33
9 14 88 84 80 88 52 48 32 32
10 17 89 56 75 89 30 49 26 26
11 19 83 56 70 83 32 46 24 24
12 20 94 72 83 93 35 46 37 37
13 21 73 57 56 73 38 37 19 19
For each exp #, I want to make a Venn diagram with the values Q1 Unique, Q2 Unique, and Overlap.
I have tried a couple of things, the below code has gotten me the closest:
from matplotlib import pyplot as plt
import numpy as np
from matplotlib_venn import venn2, venn2_circles
import csv
import pandas as pd
import numpy as np
val_path = r"C:\Users\lawashburn\Documents\DIA\DSD First Pass\20220202_Acquisition\Overlap_Values.csv"
val_tab = pd.read_csv(val_path)
exp_num = val_tab['Exp #']
cols = ['Q1 unique','Q2 unique', 'Overlap']
df = pd.DataFrame()
df ['Exp #'] = exp_num
df['combined'] = val_tab[cols].apply(lambda row: ','.join(row.values.astype(str)), axis=1)
print(df)
exp_no = df['Exp #'].tolist()
combined = df['combined'].tolist()
#combined = [int(i) for i in combined]
print(combined)
for a in exp_no:
plt.figure(figsize=(4,4))
plt.title(a)
for b in combined:
v = venn2(subsets=(b), set_labels = ('Q1', 'Q2'), set_colors=('purple','skyblue'), alpha=0.7)
v.get_label_by_id('A').set_text('Q1')
c = venn2_circles(subsets=(b))
plt.show()
plt.savefig(a + 'output.png')
This generates a DataFrame:
Exp # combined
0 1 14,40,18
1 2 28,34,10
2 4 37,51,32
3 5 31,47,37
4 6 0,0,0
5 7 27,40,25
6 8 39,39,30
7 10 17,45,18
8 11 39,39,33
9 14 52,48,32
10 17 30,49,26
11 19 32,46,24
12 20 35,46,37
13 21 38,37,19
However, I think I run into the issue when I export the combined column into a list:
['14,40,18', '28,34,10', '37,51,32', '31,47,37', '0,0,0', '27,40,25', '39,39,30', '17,45,18', '39,39,33', '52,48,32', '30,49,26', '32,46,24', '35,46,37', '38,37,19']
As after this I get the error:
numpy.core._exceptions.UFuncTypeError: ufunc 'absolute' did not contain a loop with signature matching types dtype('<U8') -> dtype('<U8')
How should I proceed from here? I would like 13 separate Venn Diagrams, and to export each of them into a separate .png file.

Find the difference between the max value and 2nd highest value within a subset of pandas columns

I have a fairly large dataframe:
A
B
C
D
0
17
36
45
54
1
18
23
17
17
2
74
47
8
46
3
48
38
96
83
I am trying to create a new column that is the (max value of the columns) - (2nd highest value) / (2nd highest value).
In this example it would look something like:
A
B
C
D
Diff
0
17
36
45
54
.20
1
18
23
17
17
.28
2
74
47
8
46
.57
3
48
38
96
83
.16
I've tried df['diff'] = df.loc[:, 'A': 'D'].max(axis=1) - df.iloc[:df.index.get_loc(df.loc[:, 'A': 'D'].idxmax(axis=1))] / ...
but even that part of the formula returns an error, nevermind including the final division. I'm sure there must be an easier way going about this.
Edit: Additionally, I am also trying to get the difference between the max value and the column that immediately precedes the max value. I know this is a somewhat different question, but I would appreciate any insight. Thank you!

One way using pandas.Series.nlargest with pct_change:
df["Diff"] = df.apply(lambda x: x.nlargest(2).pct_change(-1)[0], axis=1)
Output:
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627

One way is to apply a udf:
def get_pct(x):
xmax2, xmax = x.sort_values().tail(2)
return (xmax-xmax2)/xmax2
df['Diff'] = df.apply(get_pct, axis=1)
Output:
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627

We can also make use of numpy sort and np.diff :
arr = np.sort(df,axis=1)[:,-2:]
df['Diff'] = np.diff(arr,axis=1)[:,0]/arr[:,0]
print(df)
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627

Let us try get the second Max value with mask
Max = df.max(1)
secMax = df.mask(df.eq(Max,0)).max(1)
df['Diff'] = (Max - secMax)/secMax
df
Out[69]:
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627

get only previous three values from the dataframe

I am new to the python and pandas. Here , what I have is a dataframe which is like,
Id Offset feature
0 0 2
0 5 2
0 11 0
0 21 22
0 28 22
1 32 0
1 38 21
1 42 21
1 52 21
1 55 0
1 58 0
1 62 1
1 66 1
1 70 1
2 73 0
2 78 1
2 79 1
from this I am trying to get the previous three values from the column with the offsets of that .
SO, output would be like,
offset Feature
11 2
21 22
28 22
// Here these three values are `of the 0 which is at 32 offset`
In the same dataframe for next place where is 0
38 21
42 21
52 21
58 0
62 1
66 1
is there any way through which I can get this ?
Thanks
This will be on the basis of the document ID.

Even i am quite new to pandas but i have attempted to answer you question.
I populated your data as comma separated values in data.csv and then used slicing to get the previous 3 columns.
import pandas as pd
df = pd.read_csv('./data.csv')
for index in (df.loc[df['Feature'] == 0]).index:
print(df.loc[index-3:index-1])
The output looks like this. The leftmost column is index which you can discard if you dont want. Is this what you were looking for?
Offset Feature
2 11 2
3 21 22
4 28 22
Offset Feature
6 38 21
7 42 21
8 52 21
Offset Feature
7 42 21
8 52 21
9 55 0
Offset Feature
11 62 1
12 66 1
13 70 1
Note : There might be a more pythonic way to do this.

You can take 3 previous rows of your current 0 value in the column using loc.
Follow the code:
import pandas as pd
df = pd.read_csv("<path_of_the_file">)
zero_indexes = list(df[df['Feature'] == 0].index)
for each_zero_index in zero_indexes:
df1 = df.loc[each_zero_index - 3: each_zero_index]
print(df1) # This dataframe has 4 records. Your previous three including the zero record.
Output:
Offset Feature
2 11 2
3 21 22
4 28 22
5 32 0
Offset Feature
6 38 21
7 42 21
8 52 21
9 55 0
Offset Feature
7 42 21
8 52 21
9 55 0
10 58 0
Offset Feature
11 62 1
12 66 1
13 70 1
14 73 0

pandas set one column equal to 1 but both df changes

I'm trying to get df b column D to be 1, however, when I run this code, it also changes df a column D to 1 also... why is that, why are the variables linked? and how to I just change df b only?
import pandas as pd, os, numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
a=df
b=df
b['D']=1
output:
>>> a
A B C D
0 98 84 3 1
1 13 35 76 1
2 17 84 28 1
3 22 9 41 1
4 54 3 20 1
>>> b
A B C D
0 98 84 3 1
1 13 35 76 1
2 17 84 28 1
3 22 9 41 1
4 54 3 20 1
>>>

a, b and df are references to the same object. When you change b['D'], you are actually changing that column of the actual object. Instead, it looks like you want to copy the DataFrame:
import pandas as pd, os, numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
a=df.copy()
b=df.copy()
b['D']=1
which yields
b.head()
Out:
A B C D
0 63 52 92 1
1 98 35 43 1
2 24 87 70 1
3 38 4 7 1
4 71 30 25 1
a.head()
Out:
A B C D
0 63 52 92 80
1 98 35 43 78
2 24 87 70 26
3 38 4 7 48
4 71 30 25 61
There are also detailed responses here.

Don't use = when trying to copy a dataframe
use pd.DataFrame.copy(yourdataframe) instead
a = pd.DataFrame.copy(df)
b = pd.DataFrame.copy(df)
b['D'] = 1
This should solve your problem

You should use copy. Change
a=df
b=df
to
a=df.copy()
b=df.copy()
Check out this reference where this issue is discussed a bit more in depth. I also had this confusion when I started using Pandas.

Matlab diff(F,var,n) vs Python numpy diff(a, n=1, axis=-1)

I am trying to calculate matlab function in python.
y = diff(x,1,2)
x is and grayscale image
i tried numpy diff function but i get different answer
please help

There are two problems here.
First, you swapped the order of arguments in np.diff. MATLAB and Python use the same argument order. Python supports named arguments, so it is often better to use the argument name to avoid this sort of problem.
Second, python indexing starts with 0, while MATLAB indexing starts with 1. This applies to axes as well, so MATLAB's axis 2 is Python's axis 1.
So the correct function call in Python is np.diff(fimg, 1, 1), but np.diff(fimg, axis=1) is better IMO.
MATLAB:
>> a = reshape(1:100, 10, [])'
a =
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
>> diff(a,1, 2)
ans =
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
Python:
>>> a = np.arange(100).reshape(10, -1)
>>> print(a)
[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
[50 51 52 53 54 55 56 57 58 59]
[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]
[80 81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98 99]]
>>> print(np.diff(a, axis=1))
[[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]]

in the comment to your question, it appears you swapped the arguments to the diff functions. However, the documentation states that both in matlab and in numpy, the order of the arguments is:
array
n
dimension

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Retrieving future value in Python using offset variable from another column - python

You can use NumPy for a vectorised solution: import numpy as np idx = np.arange(df.shape[0]) + df['OrigB'].values df['C'] = df['OrigA'].iloc[idx].values print(df) OrigA OrigB C 0 54 1 76 1 76 4 46 2 14 3 46 3 35 1 -3 4 -3 0 -3 5 46 0 46 6 64 0 64 7 93 0 93 8 72 0 72

Related

Venn Diagram for each row in DataFrame

Find the difference between the max value and 2nd highest value within a subset of pandas columns

get only previous three values from the dataframe

pandas set one column equal to 1 but both df changes

Matlab diff(F,var,n) vs Python numpy diff(a, n=1, axis=-1)

Categories

Resources