Pandas get nearest index with equivalent positive number - python

I have data similar to this
data = {'A': [10,20,30,10,-10, 20,-20, 10], 'B': [100,200,300,100,-100, 30,-30,100], 'C':[1000,2000,3000,1000, -1000, 40,-40, 1000]}
df = pd.DataFrame(data)
df
Index
A
B
C
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
4
-10
-100
-1000
5
20
30
40
6
-20
-30
-40
7
10
100
1000
Here sum values of all the columns for index 0,3,7 equal to 1110 index 4 equals -1110 and sum value of index 5 and 6 equals 90, and -90 these are exact opposite, so for such scenarios I want a fourth column D to be populated with a value 'Exact opposite' for index value 3,4. and 5,6(Nearest index)
Similar to
Index
A
B
C
D
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
Exact opposite
4
-10
-100
-1000
Exact opposite
5
20
30
40
Exact opposite
6
-20
-30
-40
Exact opposite
7
10
100
1000
One approach I can think of is by adding a column which adds values of all the columns
column_names=['A','B','C']
df['Sum Val'] = df[column_names].sum(axis=1)
Index
A
B
C
Sum val
0
10
100
1000
1110
1
20
200
2000
2200
2
30
300
3000
3300
3
10
100
1000
1110
4
-10
-100
-1000
-1110
5
20
30
40
90
6
-20
-30
-40
-90
7
10
100
1000
1110
and then check if there are any negative values and try to find out the corresponding equal positive value but could not proceed from there

This should do the trick:
data = {'A': [10,20,30,10,-10, 30, 10], 'B': [100,200,300,100,-100, 300, 100], 'C':[1000,2000,3000,1000, -1000, 3000, 1000]}
df = pd.DataFrame(data)
print (df)
for i in df.index[:-2]:
for col in df.columns[0:2]:
if not df[col][i] == -df[col][i+1]:
break
else:
df.at[i, 'D'] = 'Exact opposite'
df.at[i+1, 'D'] = 'Exact opposite'
continue
print(df)
This solution only considers 2 adjacent lines.
The following code compares all lines so it also detects lines 0 and 6:
data = {'A': [10,20,30,10,-10, 30, 10], 'B': [100,200,300,100,-100, 300, 100], 'C':[1000,2000,3000,1000, -1000, 3000, 1000]}
df = pd.DataFrame(data)
print (df)
for i in df.index:
for j in df.index[i:]:
for col in df.columns[0:2]:
if not df[col][i] == -df[col][j]:
break
else:
df.at[i, 'D'] = 'Exact opposite'
df.at[j, 'D'] = 'Exact opposite'
continue
print(df)

Related

How to efficiently reorder rows based on condition?

My dataframe:
df = pd.DataFrame({'col_1': [10, 20, 10, 20, 10, 10, 20, 20],
'col_2': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
5 10 f
6 20 g
7 20 h
I don't want consecutive rows with col_1 = 10, instead a row below a repeating 10 should jump up by one (in this case, index 6 should become index 5 and vice versa), so the order is always 10, 20, 10, 20...
My current solution:
for idx, row in df.iterrows():
if row['col_1'] == 10 and df.iloc[idx + 1]['col_1'] != 20:
df = df.rename({idx + 1:idx + 2, idx + 2: idx + 1})
df = df.sort_index()
df
gives me:
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
5 20 g
6 10 f
7 20 h
which is what I want but it is very slow (2.34s for a dataframe with just over 8000 rows).
Is there a way to avoid loop here?
Thanks
You can use a custom key in sort_values with groupby.cumcount:
df.sort_values(by='col_1', kind='stable', key=lambda s: df.groupby(s).cumcount())
Output:
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
6 20 g
5 10 f
7 20 h

Python Pandas - Date comparison works one statement at a time, but not together when using the & [duplicate]

For example I have simple DF:
import pandas as pd
from random import randint
df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
'B': [randint(1, 9)*10 for x in range(10)],
'C': [randint(1, 9)*100 for x in range(10)]})
Can I select values from 'A' for which corresponding values for 'B' will be greater than 50, and for 'C' - not equal to 900, using methods and idioms of Pandas?
Sure! Setup:
>>> import pandas as pd
>>> from random import randint
>>> df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
'B': [randint(1, 9)*10 for x in range(10)],
'C': [randint(1, 9)*100 for x in range(10)]})
>>> df
A B C
0 9 40 300
1 9 70 700
2 5 70 900
3 8 80 900
4 7 50 200
5 9 30 900
6 2 80 700
7 2 80 400
8 5 80 300
9 7 70 800
We can apply column operations and get boolean Series objects:
>>> df["B"] > 50
0 False
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 True
9 True
Name: B
>>> (df["B"] > 50) & (df["C"] == 900)
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 False
8 False
9 False
[Update, to switch to new-style .loc]:
And then we can use these to index into the object. For read access, you can chain indices:
>>> df["A"][(df["B"] > 50) & (df["C"] == 900)]
2 5
3 8
Name: A, dtype: int64
but you can get yourself into trouble because of the difference between a view and a copy doing this for write access. You can use .loc instead:
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"]
2 5
3 8
Name: A, dtype: int64
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"].values
array([5, 8], dtype=int64)
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"] *= 1000
>>> df
A B C
0 9 40 300
1 9 70 700
2 5000 70 900
3 8000 80 900
4 7 50 200
5 9 30 900
6 2 80 700
7 2 80 400
8 5 80 300
9 7 70 800
Note that I accidentally typed == 900 and not != 900, or ~(df["C"] == 900), but I'm too lazy to fix it. Exercise for the reader. :^)
Another solution is to use the query method:
import pandas as pd
from random import randint
df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
'B': [randint(1, 9) * 10 for x in xrange(10)],
'C': [randint(1, 9) * 100 for x in xrange(10)]})
print df
A B C
0 7 20 300
1 7 80 700
2 4 90 100
3 4 30 900
4 7 80 200
5 7 60 800
6 3 80 900
7 9 40 100
8 6 40 100
9 3 10 600
print df.query('B > 50 and C != 900')
A B C
1 7 80 700
2 4 90 100
4 7 80 200
5 7 60 800
Now if you want to change the returned values in column A you can save their index:
my_query_index = df.query('B > 50 & C != 900').index
....and use .iloc to change them i.e:
df.iloc[my_query_index, 0] = 5000
print df
A B C
0 7 20 300
1 5000 80 700
2 5000 90 100
3 4 30 900
4 5000 80 200
5 5000 60 800
6 3 80 900
7 9 40 100
8 6 40 100
9 3 10 600
And remember to use parenthesis!
Keep in mind that & operator takes a precedence over operators such as > or < etc. That is why
4 < 5 & 6 > 4
evaluates to False. Therefore if you're using pd.loc, you need to put brackets around your logical statements, otherwise you get an error. That's why do:
df.loc[(df['A'] > 10) & (df['B'] < 15)]
instead of
df.loc[df['A'] > 10 & df['B'] < 15]
which would result in
TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]
You can use pandas it has some built in functions for comparison. So if you want to select values of "A" that are met by the conditions of "B" and "C" (assuming you want back a DataFrame pandas object)
df[['A']][df.B.gt(50) & df.C.ne(900)]
df[['A']] will give you back column A in DataFrame format.
pandas gt function will return the positions of column B that are greater than 50 and ne will return the positions not equal to 900.
It may be more readable to assign each condition to a variable, especially if there are a lot of them (maybe with descriptive names) and chain them together using bitwise operators such as (& or |). As a bonus, you don't need to worry about brackets () because each condition evaluates independently.
m1 = df['B'] > 50
m2 = df['C'] != 900
m3 = df['C'].pow(2) > 1000
m4 = df['B'].mul(4).between(50, 500)
# filter rows where all of the conditions are True
df[m1 & m2 & m3 & m4]
# filter rows of column A where all of the conditions are True
df.loc[m1 & m2 & m3 & m4, 'A']
or put the conditions in a list and reduce it via bitwise_and from numpy (wrapper for &).
conditions = [
df['B'] > 50,
df['C'] != 900,
df['C'].pow(2) > 1000,
df['B'].mul(4).between(50, 500)
]
# filter rows of A where all of conditions are True
df.loc[np.bitwise_and.reduce(conditions), 'A']

How can I write a function to check if a date is in a date range in python? [duplicate]

For example I have simple DF:
import pandas as pd
from random import randint
df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
'B': [randint(1, 9)*10 for x in range(10)],
'C': [randint(1, 9)*100 for x in range(10)]})
Can I select values from 'A' for which corresponding values for 'B' will be greater than 50, and for 'C' - not equal to 900, using methods and idioms of Pandas?
Sure! Setup:
>>> import pandas as pd
>>> from random import randint
>>> df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
'B': [randint(1, 9)*10 for x in range(10)],
'C': [randint(1, 9)*100 for x in range(10)]})
>>> df
A B C
0 9 40 300
1 9 70 700
2 5 70 900
3 8 80 900
4 7 50 200
5 9 30 900
6 2 80 700
7 2 80 400
8 5 80 300
9 7 70 800
We can apply column operations and get boolean Series objects:
>>> df["B"] > 50
0 False
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 True
9 True
Name: B
>>> (df["B"] > 50) & (df["C"] == 900)
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 False
8 False
9 False
[Update, to switch to new-style .loc]:
And then we can use these to index into the object. For read access, you can chain indices:
>>> df["A"][(df["B"] > 50) & (df["C"] == 900)]
2 5
3 8
Name: A, dtype: int64
but you can get yourself into trouble because of the difference between a view and a copy doing this for write access. You can use .loc instead:
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"]
2 5
3 8
Name: A, dtype: int64
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"].values
array([5, 8], dtype=int64)
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"] *= 1000
>>> df
A B C
0 9 40 300
1 9 70 700
2 5000 70 900
3 8000 80 900
4 7 50 200
5 9 30 900
6 2 80 700
7 2 80 400
8 5 80 300
9 7 70 800
Note that I accidentally typed == 900 and not != 900, or ~(df["C"] == 900), but I'm too lazy to fix it. Exercise for the reader. :^)
Another solution is to use the query method:
import pandas as pd
from random import randint
df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
'B': [randint(1, 9) * 10 for x in xrange(10)],
'C': [randint(1, 9) * 100 for x in xrange(10)]})
print df
A B C
0 7 20 300
1 7 80 700
2 4 90 100
3 4 30 900
4 7 80 200
5 7 60 800
6 3 80 900
7 9 40 100
8 6 40 100
9 3 10 600
print df.query('B > 50 and C != 900')
A B C
1 7 80 700
2 4 90 100
4 7 80 200
5 7 60 800
Now if you want to change the returned values in column A you can save their index:
my_query_index = df.query('B > 50 & C != 900').index
....and use .iloc to change them i.e:
df.iloc[my_query_index, 0] = 5000
print df
A B C
0 7 20 300
1 5000 80 700
2 5000 90 100
3 4 30 900
4 5000 80 200
5 5000 60 800
6 3 80 900
7 9 40 100
8 6 40 100
9 3 10 600
And remember to use parenthesis!
Keep in mind that & operator takes a precedence over operators such as > or < etc. That is why
4 < 5 & 6 > 4
evaluates to False. Therefore if you're using pd.loc, you need to put brackets around your logical statements, otherwise you get an error. That's why do:
df.loc[(df['A'] > 10) & (df['B'] < 15)]
instead of
df.loc[df['A'] > 10 & df['B'] < 15]
which would result in
TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]
You can use pandas it has some built in functions for comparison. So if you want to select values of "A" that are met by the conditions of "B" and "C" (assuming you want back a DataFrame pandas object)
df[['A']][df.B.gt(50) & df.C.ne(900)]
df[['A']] will give you back column A in DataFrame format.
pandas gt function will return the positions of column B that are greater than 50 and ne will return the positions not equal to 900.
It may be more readable to assign each condition to a variable, especially if there are a lot of them (maybe with descriptive names) and chain them together using bitwise operators such as (& or |). As a bonus, you don't need to worry about brackets () because each condition evaluates independently.
m1 = df['B'] > 50
m2 = df['C'] != 900
m3 = df['C'].pow(2) > 1000
m4 = df['B'].mul(4).between(50, 500)
# filter rows where all of the conditions are True
df[m1 & m2 & m3 & m4]
# filter rows of column A where all of the conditions are True
df.loc[m1 & m2 & m3 & m4, 'A']
or put the conditions in a list and reduce it via bitwise_and from numpy (wrapper for &).
conditions = [
df['B'] > 50,
df['C'] != 900,
df['C'].pow(2) > 1000,
df['B'].mul(4).between(50, 500)
]
# filter rows of A where all of conditions are True
df.loc[np.bitwise_and.reduce(conditions), 'A']

Pandas get equivalent positive number

I have data similar to this
data = {'A': [10,20,30,10,-10], 'B': [100,200,300,100,-100], 'C':[1000,2000,3000,1000, -1000]}
df = pd.DataFrame(data)
df
Index
A
B
C
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
4
-10
-100
-1000
Here index value 0,3 and 4 are exactly equal but one is negative, so for such scenarios I want a fourth column D to be populated with a value 'Exact opposite' for index value 3 and 4.(Any one value)
Similar to
Index
A
B
C
D
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
Exact opposite
4
-10
-100
-1000
Exact opposite
One approach I can think of is by adding a column which adds values of all the columns
column_names=['A','B','C']
df['Sum Val'] = df[column_names].sum(axis=1)
Index
A
B
C
Sum val
0
10
100
1000
1110
1
20
200
2000
2200
2
30
300
3000
3300
3
10
100
1000
1110
4
-10
-100
-1000
-1110
and then check if there are any negative values and try to find out the corresponding equal positive value but could not proceed from there
Any mistakes please pardon.
Like this maybe:
In [69]: import numpy as np
# Create column 'D' with exact duplicate rows using 'abs'
In [68]: df['D'] = np.where(df.abs().duplicated(keep=False), 'Duplicate', '')
# If the sum of duplicated rows = 0, this means they are 'exact opposite'
In [78]: if df[df.D.eq('Duplicate')].sum(1).sum() == 0:
...: df.loc[ix, 'D'] = 'Exact Opposite'
...:
In [79]: df
Out[79]:
A B C D
0 10 100 1000 Exact Opposite
1 20 200 2000
2 30 300 3000
3 -10 -100 -1000 Exact Opposite
To follow your logic let us just adding abs with groupby , so the output will return the pair index as list
df.reset_index().groupby(df['Sum Val'].abs())['index'].agg(list)
Out[367]:
Sum Val
1110 [0, 3]
2220 [1]
3330 [2]
Name: index, dtype: object
import pandas as pd
data = {'A': [10, 20, 30, -10], 'B': [100, 200,300, -100], 'C': [1000, 2000, 3000,-1000]}
df = pd.DataFrame(data)
print(df)
df['total'] = df.sum(axis=1)
df['total'] = df['total'].apply(lambda x: "Exact opposite" if sum(df['total'] == -1*x) else "")
print(df)

Pandas modify column values based on another DataFrame

I am trying to add values to a column based on a couple of conditions. Here is the code example:
Import pandas as pd
df1 = pd.DataFrame({'Type': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C'], 'Val': [20, -10, 20, -10, 30, -20, 40, -30]})
df2 = pd.DataFrame({'Type': ['A', 'A', 'B', 'B', 'C', 'C'], 'Cat':['p', 'n', 'p', 'n','p', 'n'], 'Val': [30, -40, 20, -30, 10, -20]})
for index, _ in df1.iterrows():
if df1.loc[index,'Val'] >=0:
df1.loc[index,'Val'] = df1.loc[index,'Val'] + float(df2.loc[(df2['Type'] == df1.loc[index,'Type']) & (df2['Cat'] == 'p'), 'Val'])
else:
df1.loc[index,'Val'] = df1.loc[index,'Val'] + float(df2.loc[(df2['Type'] == df1.loc[index,'Type']) & (df2['Cat'] == 'n'), 'Val'])
For each value in the 'Val' column of df1, I want to add values from df2, based on the type and whether the original value was positive or negative.
The expected output for this example would be alternate 50 and -50 in df1. The above code does the job, but is too slow to be usable for a large data set. Is there a better way to do this?
Try adding a Cat column to df1 merge then sum val columns across axis 1 then drop the extra columns:
df1['Cat'] = np.where(df1['Val'].lt(0), 'n', 'p')
df1 = df1.merge(df2, on=['Type', 'Cat'], how='left')
df1['Val'] = df1[['Val_x', 'Val_y']].sum(axis=1)
df1 = df1.drop(['Cat', 'Val_x', 'Val_y'], 1)
Type Val
0 A 50
1 A 50
2 A -50
3 A -50
4 B 50
5 B -50
6 C 50
7 C -50
Add new column with np.where
df1['Cat'] = np.where(df1['Val'].lt(0), 'n', 'p')
Type Val Cat
0 A 20 p
1 A -10 n
2 A 20 p
3 A -10 n
4 B 30 p
5 B -20 n
6 C 40 p
7 C -30 n
merge on Type and Cat
df1 = df1.merge(df2, on=['Type', 'Cat'], how='left')
Type Val_x Cat Val_y
0 A 20 p 30
1 A -10 n -40
2 A 20 p 30
3 A -10 n -40
4 B 30 p 20
5 B -20 n -30
6 C 40 p 10
7 C -30 n -20
sum Val columns:
df1['Val'] = df1[['Val_x', 'Val_y']].sum(axis=1)
Type Val_x Cat Val_y Val
0 A 20 p 30 50
1 A -10 n -40 -50
2 A 20 p 30 50
3 A -10 n -40 -50
4 B 30 p 20 50
5 B -20 n -30 -50
6 C 40 p 10 50
7 C -30 n -20 -50
drop extra columns:
df1 = df1.drop(['Cat', 'Val_x', 'Val_y'], 1)
Type Val
0 A 50
1 A -50
2 A 50
3 A -50
4 B 50
5 B -50
6 C 50
7 C -50
import numpy as np
df1['sign'] = np.sign(df1.Val)
df2['sign'] = np.sign(df2.Val)
df = pd.merge(df1, df2, on=['Type', 'sign'], suffixes=('_df1', '_df2'))
df['Val'] = df.Val_df1 + df.Val_df2
df = df.drop(columns=['Val_df1', 'sign', 'Val_df2'])
df

Categories