Making pair by 2 rows and slicing from each row - python

I have a dataframe like:
x1 y1 x2 y2
0 149 2653 2152 2656
1 149 2465 2152 2468
2 149 1403 2152 1406
3 149 1215 2152 1218
4 170 2692 2170 2695
5 170 2475 2170 2478
6 170 1413 2170 1416
7 170 1285 2170 1288
I need to pair by each two rows from data frame index. i.e., [0,1], [2,3], [4,5], [6,7] etc.,
and extract x1,y1 from first row of the pair x2,y2 from second row of the pair, similarly for each pair of rows.
Sample Output:
[[149,2653,2152,2468],[149,1403,2152,1218],[170,2692,2170,2478],[170,1413,2170,1288]]
Please feel free to ask if it's not clear.
So far I tried grouping by pairs, and tried shift operation.
But I didn't manage to make make pair records.

Python solution:
Select values of columns by positions to lists:
a = df[['x2', 'y2']].iloc[1::2].values.tolist()
b = df[['x1', 'y1']].iloc[0::2].values.tolist()
And then zip and join together in list comprehension:
L = [y + x for x, y in zip(a, b)]
print (L)
[[149, 2653, 2152, 2468], [149, 1403, 2152, 1218],
[170, 2692, 2170, 2478], [170, 1413, 2170, 1288]]
Thank you, #user2285236 for another solution:
L = np.concatenate([df.loc[::2, ['x1', 'y1']], df.loc[1::2, ['x2', 'y2']]], axis=1).tolist()
Pure pandas solution:
First DataFrameGroupBy.shift by each 2 rows:
df[['x2', 'y2']] = df.groupby(np.arange(len(df)) // 2)[['x2', 'y2']].shift(-1)
print (df)
x1 y1 x2 y2
0 149 2653 2152.0 2468.0
1 149 2465 NaN NaN
2 149 1403 2152.0 1218.0
3 149 1215 NaN NaN
4 170 2692 2170.0 2478.0
5 170 2475 NaN NaN
6 170 1413 2170.0 1288.0
7 170 1285 NaN NaN
Then remove NaNs rows, convert to int and then to list:
print (df.dropna().astype(int).values.tolist())
[[149, 2653, 2152, 2468], [149, 1403, 2152, 1218],
[170, 2692, 2170, 2478], [170, 1413, 2170, 1288]]

Here's one solution via numpy.hstack. Note it is natural to feed numpy arrays directly to pd.DataFrame, since this is how Pandas stores data internally.
import numpy as np
arr = np.hstack((df[['x1', 'y1']].values[::2],
df[['x2', 'y2']].values[1::2]))
res = pd.DataFrame(arr)
print(res)
0 1 2 3
0 149 2653 2152 2468
1 149 1403 2152 1218
2 170 2692 2170 2478
3 170 1413 2170 1288

Here's a solution using a custom iterator based on iterrows(), but it's a bit clunky:
import pandas as pd
df = pd.DataFrame( columns=['x1','y1','x2','y2'], data=
[[149, 2653, 2152, 2656], [149, 2465, 2152, 2468], [149, 1403, 2152, 1406], [149, 1215, 2152, 1218],
[170, 2692, 2170, 2695], [170, 2475, 2170, 2478], [170, 1413, 2170, 1416], [170, 1285, 2170, 1288]] )
def iter_oddeven_pairs(df):
row_it = df.iterrows()
try:
while True:
_,row = next(row_it)
yield row[0:2]
_,row = next(row_it)
yield row[2:4]
except StopIteration:
pass
print(pd.concat([pair for pair in iter_oddeven_pairs(df)]))

Related

Dynamically differencing columns in a pandas dataframe using similar column names

The following is the first couple of columns of a data frame, and I calculate V1_x - V1_y, V2_x - V2_y, V3_x - V3_y etc. The difference variable names differ only by the last character (either x or y)
import pandas as pd
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Address': ['xx', 'yy', 'zz','ww'], 'V1_x': [20, 21, 19, 18], 'V2_x': [233, 142, 643, 254], 'V3_x': [343, 543, 254, 543], 'V1_y': [20, 21, 19, 18], 'V2_y': [233, 142, 643, 254], 'V3_y': [343, 543, 254, 543]}
df = pd.DataFrame(data)
df
Name Address V1_x V2_x V3_x V1_y V2_y V3_y
0 Tom xx 20 233 343 20 233 343
1 Joseph yy 21 142 543 21 142 543
2 Krish zz 19 643 254 19 643 254
3 John ww 18 254 543 18 254 543
I currently do the calculation by manually defining the column names:
new_df = pd.DataFrame()
new_df['Name'] = df['Name']
new_df['Address'] = df['Address']
new_df['Col1'] = df['V1_x']-df['V1_y']
new_df['Col1'] = df['V2_x']-df['V2_y']
new_df['Col1'] = df['V3_x']-df['V3_y']
Is there an approach that I can use to check if the last column names only differ by the last character and difference them if so?
Try creating a multiindex header using .str.split then reshape the dataframe and using pd.DataFrame.eval for calcuation then reshape back to original form with additional columns. Lastly flatten the multiindex header using list comprehension with f-string formatting:
dfi = df.set_index(['Name', 'Address'])
dfi.columns = dfi.columns.str.split('_', expand=True)
dfs = dfi.stack(0).eval('diff=x-y').unstack()
dfs.columns = [f'{j}_{i}' for i, j in dfs.columns]
dfs
Output:
V1_x V2_x V3_x V1_y V2_y V3_y V1_diff V2_diff V3_diff
Name Address
John ww 18 254 543 18 254 543 0 0 0
Joseph yy 21 142 543 21 142 543 0 0 0
Krish zz 19 643 254 19 643 254 0 0 0
Tom xx 20 233 343 20 233 343 0 0 0

How do I create a new column based on existing columns in pandas?

I am new to Python/pandas. I want to compute continuous returns based on "GOOG" Price. If the price is in column (a); How should I calculate the return in column (b) according to the following formula?
continuous returns =
I want to do this like the image below (calculating continuous returns in Excel) in Pandas DataFrame.
import pandas as pd
x = pd.DataFrame([2340, 2304, 2238, 2260, 2315, 2318, 2300, 2310, 2353, 2350],
columns=['a'])
Try:
x['b'] = np.log(x['a']/x['a'].shift())
Output:
a b
0 2340 NaN
1 2304 -0.015504
2 2238 -0.029064
3 2260 0.009782
4 2315 0.024045
5 2318 0.001295
6 2300 -0.007796
7 2310 0.004338
8 2353 0.018444
9 2350 -0.001276
You can use generator function with .apply:
import numpy as np
import pandas as pd
x = pd.DataFrame(
[2340, 2304, 2238, 2260, 2315, 2318, 2300, 2310, 2353, 2350], columns=["a"]
)
def fn():
old_a = np.nan
a = yield
while True:
new_a = yield np.log(a / old_a)
a, old_a = new_a, a
s = fn()
next(s)
x["b"] = x["a"].apply(lambda v: s.send(v))
print(x)
Prints:
a b
0 2340 NaN
1 2304 -0.015504
2 2238 -0.029064
3 2260 0.009782
4 2315 0.024045
5 2318 0.001295
6 2300 -0.007796
7 2310 0.004338
8 2353 0.018444
9 2350 -0.001276

one to one column-value comparison between 2 dataframes - pandas

I have 2 dataframe -
print(d)
Year Salary Amount Amount1 Amount2
0 2019 1200 53 53 53
1 2020 3443 455 455 455
2 2021 6777 123 123 123
3 2019 5466 313 313 313
4 2020 4656 545 545 545
5 2021 4565 775 775 775
6 2019 4654 567 567 567
7 2020 7867 657 657 657
8 2021 6766 567 567 567
print(d1)
Year Salary Amount Amount1 Amount2
0 2019 1200 53 73 63
import pandas as pd
d = pd.DataFrame({
'Year': [
2019,
2020,
2021,
] * 3,
'Salary': [
1200,
3443,
6777,
5466,
4656,
4565,
4654,
7867,
6766
],
'Amount': [
53,
455,
123,
313,
545,
775,
567,
657,
567
],
'Amount1': [
53,
455,
123,
313,
545,
775,
567,
657,
567
], 'Amount2': [
53,
455,
123,
313,
545,
775,
567,
657,
567
]
})
d1 = pd.DataFrame({
'Year': [
2019
],
'Salary': [
1200
],
'Amount': [
53
],
'Amount1': [
73
], 'Amount2': [
63
]
})
I want to compare the 'Salary' value of dataframe d1 i.e. 1200 with all the values of 'Salary' in dataframe d and set a count if it is >= or < (a Boolean comparison) - this is to be done for all the columns(amount, amount1, amount2 etc), if the value in any column of d1 is NaN/None, no comparison needs to be done. The name of the columns will always be same so it is basically one to one column comparison.
My approach and thoughts -
I can get the values of d1 in a list by doing -
l = []
for i in range(len(d1.columns.values)):
if i == 0:
continue
else:
num = d1.iloc[0, i]
l.append(num)
print(l)
# list comprehension equivalent
lst = [d1.iloc[0, i] for i in range(len(d1.columns.values)) if i != 0]
[1200, 53, 73, 63]
and then use iterrows to iterate over all the columns and rows in dataframe d OR
I can iterate over d and then perform a similar comparison by looping over d1 - but these would be time consuming for a high dimensional dataframe(d in this case).
What would be the more efficient or pythonic way of doing it?
IIUC, you can do:
(df1 >= df2.values).sum()
Output:
Year 9
Salary 9
Amount 9
Amount1 8
Amount2 8
dtype: int64

How to transform a 3d arrays into a dataframe in python

I have a 3d array as follows:
ThreeD_Arrays = np.random.randint(0, 1000, (5, 4, 3))
array([[[715, 226, 632],
[305, 97, 534],
[ 88, 592, 902],
[172, 932, 263]],
[[895, 837, 431],
[649, 717, 39],
[363, 121, 274],
[334, 359, 816]],
[[520, 692, 230],
[452, 816, 887],
[688, 509, 770],
[290, 856, 584]],
[[286, 358, 462],
[831, 26, 332],
[424, 178, 642],
[955, 42, 938]],
[[ 44, 119, 757],
[908, 937, 728],
[809, 28, 442],
[832, 220, 348]]])
Now I would like to have it into a DataFrame like this:
Add a Date column like indicated and the column names A, B, C.
How to do this transformation? Thanks!
Based on the answer to this question, we can use a MultiIndex. First, create the MultiIndex and a flattened DataFrame.
A = np.random.randint(0, 1000, (5, 4, 3))
names = ['x', 'y', 'z']
index = pd.MultiIndex.from_product([range(s)for s in A.shape], names=names)
df = pd.DataFrame({'A': A.flatten()}, index=index)['A']
Now we can reshape it however we like:
df = df.unstack(level='x').swaplevel().sort_index()
df.columns = ['A', 'B', 'C']
df.index.names = ['DATE', 'i']
This is the result:
A B C
DATE i
0 0 715 226 632
1 895 837 431
2 520 692 230
3 286 358 462
4 44 119 757
1 0 305 97 534
1 649 717 39
2 452 816 887
3 831 26 332
4 908 937 728
2 0 88 592 902
1 363 121 274
2 688 509 770
3 424 178 642
4 809 28 442
3 0 172 932 263
1 334 359 816
2 290 856 584
3 955 42 938
4 832 220 348
You could convert your 3D array to a Pandas Panel, then flatten it to a 2D DataFrame (using .to_frame()):
import numpy as np
import pandas as pd
np.random.seed(2016)
arr = np.random.randint(0, 1000, (5, 4, 3))
pan = pd.Panel(arr)
df = pan.swapaxes(0, 2).to_frame()
df.index = df.index.droplevel('minor')
df.index.name = 'Date'
df.index = df.index+1
df.columns = list('ABC')
yields
A B C
Date
1 875 702 266
1 940 180 971
1 254 649 353
1 824 677 745
...
4 675 488 939
4 382 238 225
4 923 926 633
4 664 639 616
4 770 274 378
Alternatively, you could reshape the array to shape (20, 3), form the DataFrame as usual, and then fix the index:
import numpy as np
import pandas as pd
np.random.seed(2016)
arr = np.random.randint(0, 1000, (5, 4, 3))
df = pd.DataFrame(arr.reshape(-1, 3), columns=list('ABC'))
df.index = np.repeat(np.arange(arr.shape[0]), arr.shape[1]) + 1
df.index.name = 'Date'
print(df)
yields the same result.
ThreeD_Arrays = np.random.randint(0, 1000, (5, 4, 3))
df = pd.DataFrame([list(l) for l in ThreeD_Arrays]).stack().apply(pd.Series).reset_index(1, drop=True)
df.index.name = 'Date'
df.columns = list('ABC')

How to (efficiently) check if any two elements differ by 10

Suppose I have the following column in a -- pandas -- dataframe:
x
1 589
2 354
3 692
4 474
5 739
6 731
7 259
8 723
9 497
10 48
Note: I've changed the indexing to start at 1 (see test data).
I simply wish to test if the difference between any two of the items in this column are less than 10.
Final result: No two elements should have an absolute difference less than 10.
Goal:
x
1 589
2 354
3 692
4 474
5 749 #
6 731
7 259
8 713 #
9 497
10 48
Perhaps this could be done using:
for index, row in df.iterrows():
However, that has not be successful thus far...
Given I'm looking to perform element-wise comparisions, I don't expect staging speed...
Test Data:
import pandas as pd
df = pd.DataFrame(index = range(1,stim_numb+1), columns= ['x'])
df['x'] = [589, 354, 692, 474, 739, 731, 259, 723, 497, 48]
One solution might be to sort the list, then compare consecutive items, adding 10 whenever the difference is too small, and then sorting the list back to the original order (if necessary).
from operator import itemgetter
lst = [589, 354, 692, 474, 739, 731, 259, 723, 497, 48]
# temp is list as pairs of original index and value, sorted by value
temp = [[i, e] for i, e in sorted(enumerate(lst), key=itemgetter(1))]
last = None
for item in temp:
while last is not None and item[1] < last + 10:
item[1] += 10
last = item[1]
# sort the list back to original order using the index from the tuple
lst_new = [e for i, e in sorted(temp, key=itemgetter(0))]
Result is [589, 354, 692, 474, 759, 741, 259, 723, 497, 48]
This is using plain Python lists; maybe it can be done more elegantly in Pandas or Numpy.

Categories