Pandas new column with constant increments - python

I need a new column that adds in increments, in this case .02.
DF before:
x y x2 y2
0 1.022467 1.817298 1.045440 3.302572
1 1.026426 1.821669 1.053549 3.318476
2 1.018198 1.818419 1.036728 3.306648
3 1.013077 1.813290 1.026325 3.288020
4 1.017878 1.811058 1.036076 3.279930
DF after:
x y x2 y2 t
0 1.022467 1.817298 1.045440 3.302572 0.000000
1 1.026426 1.821669 1.053549 3.318476 0.020000
2 1.018198 1.818419 1.036728 3.306648 0.040000
3 1.013077 1.813290 1.026325 3.288020 0.060000
4 1.017878 1.811058 1.036076 3.279930 0.080000
5 1.016983 1.814031 1.034254 3.290708 0.100000
I have looked around for a while, and cannot find a good solution. The only way on my mind is to make a standard python list and bring it in. There has to be a better way. Thanks

Because your index is the perfect range for this (i.e. 0...n), just multiply it by your constant:
df['t'] = .02 * df.index.values
>>> df
x y x2 y2 t
0 1.022467 1.817298 1.045440 3.302572 0.00
1 1.026426 1.821669 1.053549 3.318476 0.02
2 1.018198 1.818419 1.036728 3.306648 0.04
3 1.013077 1.813290 1.026325 3.288020 0.06
4 1.017878 1.811058 1.036076 3.279930 0.08
You could also use a list comprehension:
df['t'] = [0.02 * i for i in range(len(df))]

Related

length of list len(list) resulting wrong value in Python

It might sound trivial but I am surprised by the output. Basically, I have am calculating y = m*x + b for given a, b & x. With below code I am able to get the desired result of y which a list of 20 values.
But when I am checking the length of the list, I am getting 1 in return. And the range is (0,1) which is weird as I was expecting it to be 20.
Am I making any mistake here?
a = 10
b = 0
x = df['x']
print(x)
0 0.000000
1 0.052632
2 0.105263
3 0.157895
4 0.210526
5 0.263158
6 0.315789
7 0.368421
8 0.421053
9 0.473684
10 0.526316
11 0.578947
12 0.631579
13 0.684211
14 0.736842
15 0.789474
16 0.842105
17 0.894737
18 0.947368
19 1.000000
y_new = []
for i in x:
y = a*x +b
y_new.append(y)
len(y_new)
Output: 1
print(y_new)
[0 0.000000
1 0.526316
2 1.052632
3 1.578947
4 2.105263
5 2.631579
6 3.157895
7 3.684211
8 4.210526
9 4.736842
10 5.263158
11 5.789474
12 6.315789
13 6.842105
14 7.368421
15 7.894737
16 8.421053
17 8.947368
18 9.473684
19 10.000000
Name: x, dtype: float64]
I would propose two solutions:
The first solution is : you convert your columnn df['x'] into a list by doing df['x'].tolist() and you re-run your code and also you should replace ax+b by ai+b
The second solution is (which I would do): You convert your df['x'] into an array by doing x = np.array(df['x']). By doing this you can do some array broadcasting.
So, your code will simply be :
x = np.array(df['x'])
y = a*x + b
This should give you the desired output.
I hope this would be helpful
With the code below, I have a length of 20 for the array y_new. Are you sure to print the right value? According to this post, df['x'] returns a panda Series so df['x'] is equivalent to pd.Series(...).
df['x'] — index a column named 'x'. Returns pd.Series
import pandas as pd
a = 10
b = 0
x = pd.Series(data=[0.000000,0.052632,0.105263,0.157895,0.210526, 0.263158, 0.315789, 0.368421, 0.421053,0.473684,0.526316,0.578947,0.631579
,0.684211,0.736842,0.789474,0.842105,0.894737,0.947368,1.000000])
y_new = []
for i in x:
y = a*x +b
y_new.append(y)
print("y_new length: " + str(len(y_new)) )
Output:
y_new length: 20

Pandas - Euclidean Distance Between Columns

I have a dataframe as follows:
uuid x_1 y_1 x_2 y_2
0 di-ab5 82.31 184.20 148.06 142.54
1 di-de6 92.35 185.21 24.12 16.45
2 di-gh7 123.45 0.01 NaN NaN
...
I am trying to calculate the euclidean distance between [x_1, y_1] and [x_2, y_2] in a new column (not real values in this example).
uuid dist
0 di-ab5 12.31
1 di-de6 62.35
2 di-gh7 NaN
Caveats:
some rows have NaN on some of the datapoints
it is okay to represent data in the original dataframe as points (i.e. [1.23, 4.56]) instead of splitting up the x and y coordinates
I am currently using the following script:
df['dist'] = np.sqrt((df['x_1'] - df['x_2'])**2 + (df['y_1'] - df['y_2'])**2)
But it seems verbose and often fails.
Is there a better way to do this using pandas, numpy, or scipy?
You can use np.linalg.norm, i.e.:
df['dist'] = np.linalg.norm(df.iloc[:, [1,2]].values - df.iloc[:, [3,4]], axis=1)
Output:
uuid x_1 y_1 x_2 y_2 dist
0 di-ab5 82.31 184.20 148.06 142.54 77.837125
1 di-de6 92.35 185.21 24.12 16.45 182.030960
2 di-gh7 123.45 0.01 NaN NaN NaN
def getDist( df, a, b ):
return np.sqrt((df[f'x_{a}']-df[f'x_{b}'])**2+(df[f'y_{a}']-df[f'y_{b}'])**2)
np.sqrt((df.filter(like='x').agg('diff',1).sum(1)**2)+(df.filter(like='y').agg('diff',1).sum(1)**2))
How it works
Filter x and y respectively
df.filter(like='x')
Find the cross column difference and square it.
df.filter(like='x').agg('diff',1).sum(1)**2
Add the two outcomes together and find the square root.
np.sqrt((df.filter(like='x').agg('diff',1).sum(1)**2)+(df.filter(like='y').agg('diff',1).sum(1)**2))
Another solution using numpy:
diff = (df[['x_1','y_1']].to_numpy()-df[['x_2','y_2']].to_numpy())
df['dist'] = np.sqrt((diff*diff).sum(-1))
output:
uuid x_1 y_1 x_2 y_2 dist
0 di-ab5 82.31 184.20 148.06 142.54 77.837125
1 di-de6 92.35 185.21 24.12 16.45 182.030960
2 di-gh7 123.45 0.01 NaN NaN NaN

how to loop a dataframe with increment factor based on a particular column value

The dataframe I am working with looks like this:
vid2 COS fsim FWeight
0 -_aaMGK6GGw_57_61 2 0.253792 0.750000
1 -_aaMGK6GGw_57_61 2 0.192565 0.250000
2 -_hbPLsZvvo_5_8 2 0.562707 0.333333
3 -_hbPLsZvvo_5_8 2 0.179969 0.666667
4 -_hbPLsZvvo_18_25 1 0.275962 0.714286
Here,
the features have the following meanings:
FWeight - weight of each fragment (or row)
fsim - similarity score between the two columns cap1 and cap2
The weighted formula is:
For example,
For vid2 "-_aaMGK6GGw_57_61", COS = 2
Thus, the two rows with vid2 comes under this.
fsim FWeight
0 0.253792 0.750000
1 0.192565 0.250000
The calculated value vid_score needs to be
vid_score(1st video) = (fsim[0] * FWeight[0] + fsim[1] * FWeight[1])/(FWeight[0] + FWeight[1])
The expected output value vid_score for vid2 = -_aaMGK6GGw_57_61 is
(0.750000) * (0.253792) + (0.250000) * (0.192565)
= 0.238485 (Final value)
For some videos, this COS = 1, 2, 3, 4, 5, ...
Thus this needs to be dynamic
I am trying to calculate the weighted similarity score for each video ID that is vid2 here. However, there are a number of captions and weights respectively for each video. It varies, some have 2, some 1, some 3, etc. This number of segments and captions has been stored in the feature COS (that is, count of segments).
I want to iterate through the dataframe where score for each video is stored as a weighted average score of the fsim (fragment similarity score). But the number of iteration is not regular.
I have tried this code. But I am not able to iterate dynamically with the iteration factor being COS instead of just a constant value
vems_score = 0.0
video_scores = []
for i, row in merged.iterrows():
vid_score = 0.0
total_weight = 0.0
for j in range(row['COS']):
total_weight = total_weight + row['FWeight']
vid_score = vid_score + (row['FWeight'] * row['fsim'])
i = i + row['COS']
vid_score = vid_score/total_weight
video_scores.append(vid_score)
print(video_scores)
Here is my sol which you can modify/optimize to your needs.
import pandas as pd, numpy as np
def computeSim():
vid=[1,1,2,2,3]
cos=[2,2,2,2,1]
fsim=[0.25,.19,.56,.17,.27]
weight = [.75,.25,.33,.66,.71]
df= pd.DataFrame({'vid':vid,'cos':cos,'fsim':fsim,'fw':weight})
print(df)
df2 = df.groupby('vid')
similarity=[]
for group in df2:
similarity.append( np.sum(group[1]['fsim']*group[1]['fw'])/ np.sum(group[1]['fw']))
return similarity
output:
0.235
0.30000000000000004
0.27
Solution
Try this with your data. I assume that you stored the dataframe as df.
df['Prod'] = df['fsim']*df['FWeight']
grp = df.groupby(['vid2', 'COS'])
result = grp['Prod'].sum()/grp['FWeight'].sum()
print(result)
Output with your data: Dummy Data (B)
vid2 COS
-_aaMGK6GGw_57_61 2 0.238485
-_hbPLsZvvo_18_25 1 0.275962
-_hbPLsZvvo_5_8 2 0.307548
dtype: float64
Dummy Data: A
I made the following dummy data to test a few aspects of the logic.
df = pd.DataFrame({'vid2': [1,1,2,5,2,6,7,4,8,7,6,2],
'COS': [2,2,3,1,3,2,2,1,1,2,2,3],
'fsim': np.random.rand(12),
'FWeight': np.random.rand(12)})
df['Prod'] = df['fsim']*df['FWeight']
print(df)
# Groupby and apply formula
grp = df.groupby(['vid2', 'COS'])
result = grp['Prod'].sum()/grp['FWeight'].sum()
print(result)
Output:
vid2 COS
1 2 0.405734
2 3 0.535873
4 1 0.534456
5 1 0.346937
6 2 0.369810
7 2 0.479250
8 1 0.065854
dtype: float64
Dummy Data: B (OP Provided)
This is your dummy data. I made this script so anyone could easily run it and load the data as a dataframe.
import pandas as pd
from io import StringIO
s = """
vid2 COS fsim FWeight
0 -_aaMGK6GGw_57_61 2 0.253792 0.750000
1 -_aaMGK6GGw_57_61 2 0.192565 0.250000
2 -_hbPLsZvvo_5_8 2 0.562707 0.333333
3 -_hbPLsZvvo_5_8 2 0.179969 0.666667
4 -_hbPLsZvvo_18_25 1 0.275962 0.714286
"""
df = pd.read_csv(StringIO(s), sep='\s+')
#print(df)

Python - Creating an array for a series within a loop?

I'd like to add values calculated in a for loop to a series so that it can be its own column in a dataframe. So far I've got this: the y values are from a dataframe named block.
N = 12250
for i in range(0,N-1):
y1 = block.iloc[i]['y']
y2 = block.iloc[i+1]['y']
diffy[i] = y2-y1
I'd like to make diffy its own series instead of just replacing the diffy val on each loop
Some sample data (assume N = 5):
N = 5
np.random.seed(42)
block = pd.DataFrame({
'y': np.random.randint(0, 10, N)
})
y
0 6
1 3
2 7
3 4
4 6
You can calculate diffy as follow:
diffy = block['y'].diff().shift(-1)[:-1]
0 -3.0
1 4.0
2 -3.0
3 2.0
Name: y, dtype: float64
diffy is a pandas.Series. If you want list, add .to_list(). If you want a numpy array, add .values

Can we use a pandas data frame to calculate the next value using a previous value? A good example would be the Fibonacci numbers

So I understand we can use pandas data frame to do vector operations on cells like
df = pd.Dataframe([a, b, c])
df*3
would equal something like :
0 a*3
1 b*3
2 c*3
but could we use a pandas dataframe to say calculate the Fibonacci sequence ?
I am asking this because for the Fibonacci sequence the next number depends on the previous two number ( F_n = F_(n-1) + F_(n-2) ). I am not exactly interested in the Fibonacci sequence and more interested in knowing if we can do something like:
df = pd.DataFrame([a,b,c])
df.apply( some_func )
0 x1 a
1 x2 b
2 x3 c
where x1 would be calculated from a,b,c (I know this is possible), x2 would be calculated from x1 and x3 would be calculated from x2
the Fibonacci example would just be something like :
df = pd.DataFrame()
df.apply(fib(n, df))
0 0
1 1
2 1
3 2
4 2
5 5
.
.
.
n-1 F(n-1) + F(n-2)
You need to iterate through the rows and access previous rows data by DataFrame.loc. For example, n = 6
df = pd.DataFrame()
for i in range(0, 6):
df.loc[i, 'f'] = i if i in [0, 1] else df.loc[i - 1, 'f'] + df.loc[i - 2, 'f']
df
f
0 0.0
1 1.0
2 1.0
3 2.0
4 3.0
5 5.0

Categories