I have data similar to this
data = {'A': [10,20,30,10,-10], 'B': [100,200,300,100,-100], 'C':[1000,2000,3000,1000, -1000]}
df = pd.DataFrame(data)
df
Index
A
B
C
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
4
-10
-100
-1000
Here index value 0,3 and 4 are exactly equal but one is negative, so for such scenarios I want a fourth column D to be populated with a value 'Exact opposite' for index value 3 and 4.(Any one value)
Similar to
Index
A
B
C
D
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
Exact opposite
4
-10
-100
-1000
Exact opposite
One approach I can think of is by adding a column which adds values of all the columns
column_names=['A','B','C']
df['Sum Val'] = df[column_names].sum(axis=1)
Index
A
B
C
Sum val
0
10
100
1000
1110
1
20
200
2000
2200
2
30
300
3000
3300
3
10
100
1000
1110
4
-10
-100
-1000
-1110
and then check if there are any negative values and try to find out the corresponding equal positive value but could not proceed from there
Any mistakes please pardon.
Like this maybe:
In [69]: import numpy as np
# Create column 'D' with exact duplicate rows using 'abs'
In [68]: df['D'] = np.where(df.abs().duplicated(keep=False), 'Duplicate', '')
# If the sum of duplicated rows = 0, this means they are 'exact opposite'
In [78]: if df[df.D.eq('Duplicate')].sum(1).sum() == 0:
...: df.loc[ix, 'D'] = 'Exact Opposite'
...:
In [79]: df
Out[79]:
A B C D
0 10 100 1000 Exact Opposite
1 20 200 2000
2 30 300 3000
3 -10 -100 -1000 Exact Opposite
To follow your logic let us just adding abs with groupby , so the output will return the pair index as list
df.reset_index().groupby(df['Sum Val'].abs())['index'].agg(list)
Out[367]:
Sum Val
1110 [0, 3]
2220 [1]
3330 [2]
Name: index, dtype: object
import pandas as pd
data = {'A': [10, 20, 30, -10], 'B': [100, 200,300, -100], 'C': [1000, 2000, 3000,-1000]}
df = pd.DataFrame(data)
print(df)
df['total'] = df.sum(axis=1)
df['total'] = df['total'].apply(lambda x: "Exact opposite" if sum(df['total'] == -1*x) else "")
print(df)
Related
I have data similar to this
data = {'A': [10,20,30,10,-10, 20,-20, 10], 'B': [100,200,300,100,-100, 30,-30,100], 'C':[1000,2000,3000,1000, -1000, 40,-40, 1000]}
df = pd.DataFrame(data)
df
Index
A
B
C
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
4
-10
-100
-1000
5
20
30
40
6
-20
-30
-40
7
10
100
1000
Here sum values of all the columns for index 0,3,7 equal to 1110 index 4 equals -1110 and sum value of index 5 and 6 equals 90, and -90 these are exact opposite, so for such scenarios I want a fourth column D to be populated with a value 'Exact opposite' for index value 3,4. and 5,6(Nearest index)
Similar to
Index
A
B
C
D
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
Exact opposite
4
-10
-100
-1000
Exact opposite
5
20
30
40
Exact opposite
6
-20
-30
-40
Exact opposite
7
10
100
1000
One approach I can think of is by adding a column which adds values of all the columns
column_names=['A','B','C']
df['Sum Val'] = df[column_names].sum(axis=1)
Index
A
B
C
Sum val
0
10
100
1000
1110
1
20
200
2000
2200
2
30
300
3000
3300
3
10
100
1000
1110
4
-10
-100
-1000
-1110
5
20
30
40
90
6
-20
-30
-40
-90
7
10
100
1000
1110
and then check if there are any negative values and try to find out the corresponding equal positive value but could not proceed from there
This should do the trick:
data = {'A': [10,20,30,10,-10, 30, 10], 'B': [100,200,300,100,-100, 300, 100], 'C':[1000,2000,3000,1000, -1000, 3000, 1000]}
df = pd.DataFrame(data)
print (df)
for i in df.index[:-2]:
for col in df.columns[0:2]:
if not df[col][i] == -df[col][i+1]:
break
else:
df.at[i, 'D'] = 'Exact opposite'
df.at[i+1, 'D'] = 'Exact opposite'
continue
print(df)
This solution only considers 2 adjacent lines.
The following code compares all lines so it also detects lines 0 and 6:
data = {'A': [10,20,30,10,-10, 30, 10], 'B': [100,200,300,100,-100, 300, 100], 'C':[1000,2000,3000,1000, -1000, 3000, 1000]}
df = pd.DataFrame(data)
print (df)
for i in df.index:
for j in df.index[i:]:
for col in df.columns[0:2]:
if not df[col][i] == -df[col][j]:
break
else:
df.at[i, 'D'] = 'Exact opposite'
df.at[j, 'D'] = 'Exact opposite'
continue
print(df)
Say I have a DataFrame like following:
df = pd.DataFrame({
'group':['A', 'A', 'A', 'B', 'B', 'C'],
'amount':[100, -100, 50, 30, -30, 40]
})
If I would like to add another column as to check if the amount of each row can be paired (i.e. same amount, but 1 positive and 1 negative) within a group.
For example, in group A, 100 & -100 can be paired, then they will be True, while 50 cannot find a pair, then it's False (like the following table).
group
amount
pair
A
100
True
A
-100
True
A
50
False
B
30
True
B
-30
True
C
40
False
What would be the most efficient way to do this?
We can take the abs of the amount column, then create the pair column based on where the values are DataFrame.duplicated:
df['pair'] = df.assign(amount=df['amount'].abs()).duplicated(keep=False)
*keep=False means both duplicated rows get True. The right hand side can also be subset if the DataFrame has more than these 2 columns.
df:
group amount pair
0 A 100 True
1 A -100 True
2 A 50 False
3 B 30 True
4 B -30 True
5 C 40 False
Update to handle duplicate values, but ensure only positive negative pairs get matched using pivot_table:
Updated DataFrame:
df = pd.DataFrame({
'group': ['A', 'A', 'A', 'A', 'B', 'B', 'C'],
'amount': [100, -100, 50, 50, 30, -30, 40]
})
Pivot to wide form and check for pairs:
df['abs_amount'] = df['amount'].abs()
df = df.join(
df.pivot_table(index=['group', 'abs_amount'],
columns=df['amount'].gt(0),
values='amount',
aggfunc='first')
.notnull().all(axis=1)
.rename('pair'),
on=['group', 'abs_amount']
).drop('abs_amount', axis=1)
df:
group amount pair
0 A 100 True
1 A -100 True
2 A 50 False
3 A 50 False
4 B 30 True
5 B -30 True
6 C 40 False
The pivot_table:
df['abs_amount'] = df['amount'].abs()
df.pivot_table(index=['group', 'abs_amount'],
columns=df['amount'].gt(0),
values='amount',
aggfunc='first')
amount False True
group abs_amount
A 50 NaN 50.0 # Multiple 50s but no -50
100 -100.0 100.0
B 30 -30.0 30.0
C 40 NaN 40.0
Ensure all values in the row:
df.pivot_table(index=['group', 'abs_amount'],
columns=df['amount'].gt(0),
values='amount',
aggfunc='first').notnull().all(axis=1)
group abs_amount
A 50 False
100 True
B 30 True
C 40 False
dtype: bool
I'm working with a Pandas DataFrame having the following structure:
import pandas as pd
df = pd.DataFrame({'brand' : ['A', 'A', 'B', 'B', 'C', 'C'],
'target' : [0, 1, 0, 1, 0, 1],
'freq' : [5600, 220, 5700, 90, 5000, 100]})
print(df)
brand target freq
0 A 0 5600
1 A 1 220
2 B 0 5700
3 B 1 90
4 C 0 5000
5 C 1 100
For each brand, I would like to calculate the ratio of positive targets, e.g. for brand A, the percentage of positive target is 220/(220+5600) = 0.0378.
My resulting DataFrame should look like the following:
brand target freq ratio
0 A 0 5600 0.0378
1 A 1 220 0.0378
2 B 0 5700 0.0156
3 B 1 90 0.0156
4 C 0 5000 0.0196
5 C 1 100 0.0196
I know that I should group my DataFrame by brand and then apply some function to each group (since I want to keep all rows in my final result I think I should use transform here). I tested a couple of things but without any success. Any help is appreciated.
First sorting columns by brand and target for last 1 row per group and then divide in GroupBy.transform with lambda function:
df = df.sort_values(['brand','target'])
df['ratio'] = df.groupby('brand')['freq'].transform(lambda x: x.iat[-1] / x.sum())
print (df)
brand target freq ratio
0 A 0 5600 0.037801
1 A 1 220 0.037801
2 B 0 5700 0.015544
3 B 1 90 0.015544
4 C 0 5000 0.019608
5 C 1 100 0.019608
Or divide Series created by functions GroupBy.last and GroupBy.sum:
df = df.sort_values(['brand','target'])
g = df.groupby('brand')['freq']
df['ratio'] = g.transform('last').div(g.transform('sum'))
I have a dataframe like this
>>> df = pd.DataFrame([100, 150, 150, 103])
>>> df
0
0 100
1 150
2 150
3 103
>>>
I want to check if next value is less than +10% or -10% of previus value, if not replace next value with previous value
This is the desired result
0
0 100
1 100
2 100
3 103
I tried using "where" but it doesn't work properly
>>> df.where(abs(df / df.shift()-1) < 0.1, df.shift().fillna(method='bfill'), inplace=True)
>>> df
0
0 100
1 100
2 150
3 150
how can I solve?
This is the manual loop method using pd.Series.iteritems.
df = pd.DataFrame([100, 150, 150, 103])
res = np.zeros(len(df[0]))
res[0] = df[0].iloc[0]
for idx, val in df[0].iloc[1:].iteritems():
if abs(val / res[idx-1] - 1) < 0.1:
res[idx] = val
else:
res[idx] = res[idx-1]
df[1] = res.astype(int)
print(df)
0 1
0 100 100
1 150 100
2 150 100
3 103 103
In a dataframe I would like to compare the elements of a column with a value and sort the elements which pass the comparison into a new column.
df = pandas.DataFrame([{'A':3,'B':10},
{'A':2, 'B':30},
{'A':1,'B':20},
{'A':2,'B':15},
{'A':2,'B':100}])
df['C'] = [x for x in df['B'] if x > 18]
I can't find out what's wrongs and why I get:
ValueError: Length of values does not match length of index
I think you can use loc with boolean indexing:
print (df)
A B
0 3 10
1 2 30
2 1 20
3 2 15
4 2 100
print (df['B'] > 18)
0 False
1 True
2 True
3 False
4 True
Name: B, dtype: bool
df.loc[df['B'] > 18, 'C'] = df['B']
print (df)
A B C
0 3 10 NaN
1 2 30 30.0
2 1 20 20.0
3 2 15 NaN
4 2 100 100.0
If you need select by condition use boolean indexing:
print (df[df['B'] > 18])
A B
1 2 30
2 1 20
4 2 100
If need something more faster, use where:
df['C'] = df.B.where(df['B'] > 18)
Timings (len(df)=50k):
In [1367]: %timeit (a(df))
The slowest run took 8.34 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.14 ms per loop
In [1368]: %timeit (b(df1))
100 loops, best of 3: 15.5 ms per loop
In [1369]: %timeit (c(df2))
100 loops, best of 3: 2.93 ms per loop
Code for timings:
import pandas as pd
df = pd.DataFrame([{'A':3,'B':10},
{'A':2, 'B':30},
{'A':1,'B':20},
{'A':2,'B':15},
{'A':2,'B':100}])
print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
def a(df):
df['C'] = df.B.where(df['B'] > 18)
return df
def b(df1):
df['C'] = ([x if x > 18 else None for x in df['B']])
return df
def c(df2):
df.loc[df['B'] > 18, 'C'] = df['B']
return df
print (a(df))
print (b(df1))
print (c(df2))
As Darren mentioned, all columns in a DataFrame should have same length.
When you try print [x for x in df['B'] if x > 18], you get only [30, 20, 100] values. But you have got five index/rows. That's the reason you get Length of values does not match length of index error.
You can change your code as follows:
df['C'] = [x if x > 18 else None for x in df['B']]
print df
You will get:
A B C
0 3 10 NaN
1 2 30 30.0
2 1 20 20.0
3 2 15 NaN
4 2 100 100.0
All columns in a DataFrame have to be the same length. Because you are filtering away some values, you are trying to insert fewer values into column C than are in columns A and B.
So, your two options are to start a new DataFrame for C:
dfC = [x for x in df['B'] if x > 18]
or but some dummy value in the column for when x is not 18+. E.g.:
df['C'] = np.where(df['B'] > 18, True, False)
Or even:
df['C'] = np.where(df['B'] > 18, 'Yay', 'Nay')
P.S. Also take a look at: Pandas conditional creation of a series/dataframe column for other ways to do this.