Python dataframe check increment of consecutive values - python

I have a dataframe like this
>>> df = pd.DataFrame([100, 150, 150, 103])
>>> df
0
0 100
1 150
2 150
3 103
>>>
I want to check if next value is less than +10% or -10% of previus value, if not replace next value with previous value
This is the desired result
0
0 100
1 100
2 100
3 103
I tried using "where" but it doesn't work properly
>>> df.where(abs(df / df.shift()-1) < 0.1, df.shift().fillna(method='bfill'), inplace=True)
>>> df
0
0 100
1 100
2 150
3 150
how can I solve?

This is the manual loop method using pd.Series.iteritems.
df = pd.DataFrame([100, 150, 150, 103])
res = np.zeros(len(df[0]))
res[0] = df[0].iloc[0]
for idx, val in df[0].iloc[1:].iteritems():
if abs(val / res[idx-1] - 1) < 0.1:
res[idx] = val
else:
res[idx] = res[idx-1]
df[1] = res.astype(int)
print(df)
0 1
0 100 100
1 150 100
2 150 100
3 103 103

Related

Create new dataframe that contain the average value from some of the columns in the old dataframe

I have a dataframe extracted from a csv file. I want to iterate a data process where only some of the columns's data is the mean of n rows, while the rest of the columns is the first row for each iteration.
For example, the data extracted from the csv consisted of 100 rows and 6 columns.
I have a variable n_AVE = 6, which tells the code to average the data per 6 rows.
rawDf = pd.read_csv(outputFilePath / 'Raw_data.csv', encoding='CP932')
OUT:
TIME A B C D E
0 2021/3/4 148 0 142 0 1 [0]
1 2021/3/5 148 0 142 0 1
2 2021/3/6 150 0 148 0 1
3 2021/3/7 150 0 148 0 1
4 2021/3/8 151 0 148 0 1
5 2021/3/9 151 0 148 0 1
....
91 2021/4/30 195 5 180 0 1 [5]
92 2021/5/1 195 5 180 0 1
93 2021/5/2 195 5 180 0 1
94 2021/5/3 200 5 180 0 1
95 2021/5/4 200 0 200 0 1
96 2021/5/5 200 5 200 0 1 [6]
97 2021/5/6 200 5 200 1 1
98 2021/5/7 200 5 200 1 1
99 2021/5/8 205 5 210 1 1
100 2021/5/9 205 5 210 1 1
Take only the first row of [TIME, D, E] columns
Average the data per n_AVE (6) from [A, B, C] columns.
I want to create a new dataframe which looks like this
OUT:
TIME A B C D E
0 2021/3/4 149.66 0 146 0 1
....
5 2021/4/30 197.5 4.166 186.66 0 1
6 2021/5/5 168.33 5 170 0 1
The code is like this:
for x in range(0,len(rawDf.index), n_AVE):
df = pd.DataFrame([rawDf.iloc[[x],0], rawDf.iloc[x:(x + n_AVE),1:3].mean(), rawDf.iloc[x,4:5]])
But the code is not working because apparently when I use pandas.mean(), the dataframe's format changed into like this
df2 = rawDf.iloc[0:6,1:3].mean()
print(df2)
OUT:
index 0
0 A 149.66
1 B 0.0
2 C 146.0
[3 rows x 2 columns]
How to use pandas.mean() without losing the old format?
Or should I not use pandas.mean() and just create my own averaging code?
You can group the dataframe by the grouper np.arange(len(df)) // 6 which groups the dataframe every six rows, then aggregate the columns using the desired aggregation functions to get the result, optionally reindex along axis=1 to reorder the columns
d = {
'A': 'mean', 'B': 'mean', 'C': 'mean',
'TIME': 'first', 'D': 'first', 'E': 'first'
}
df.groupby(np.arange(len(df)) // 6).agg(d).reindex(df.columns, axis=1)
Define aggegation functions using columns index:
d = {
**dict.fromkeys(df.columns[[0, 4, 5]], 'first'),
**dict.fromkeys(df.columns[[1, 2, 3]], 'mean' )
}
df.groupby(np.arange(len(df)) // 6).agg(d).reindex(df.columns, axis=1)
Result
TIME A B C D E
0 2021/3/4 149.666667 0.000000 146.000000 0 1
1 2021/4/30 197.500000 4.166667 186.666667 0 1
2 2021/5/6 202.500000 5.000000 205.000000 1 1

Pandas get equivalent positive number

I have data similar to this
data = {'A': [10,20,30,10,-10], 'B': [100,200,300,100,-100], 'C':[1000,2000,3000,1000, -1000]}
df = pd.DataFrame(data)
df
Index
A
B
C
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
4
-10
-100
-1000
Here index value 0,3 and 4 are exactly equal but one is negative, so for such scenarios I want a fourth column D to be populated with a value 'Exact opposite' for index value 3 and 4.(Any one value)
Similar to
Index
A
B
C
D
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
Exact opposite
4
-10
-100
-1000
Exact opposite
One approach I can think of is by adding a column which adds values of all the columns
column_names=['A','B','C']
df['Sum Val'] = df[column_names].sum(axis=1)
Index
A
B
C
Sum val
0
10
100
1000
1110
1
20
200
2000
2200
2
30
300
3000
3300
3
10
100
1000
1110
4
-10
-100
-1000
-1110
and then check if there are any negative values and try to find out the corresponding equal positive value but could not proceed from there
Any mistakes please pardon.
Like this maybe:
In [69]: import numpy as np
# Create column 'D' with exact duplicate rows using 'abs'
In [68]: df['D'] = np.where(df.abs().duplicated(keep=False), 'Duplicate', '')
# If the sum of duplicated rows = 0, this means they are 'exact opposite'
In [78]: if df[df.D.eq('Duplicate')].sum(1).sum() == 0:
...: df.loc[ix, 'D'] = 'Exact Opposite'
...:
In [79]: df
Out[79]:
A B C D
0 10 100 1000 Exact Opposite
1 20 200 2000
2 30 300 3000
3 -10 -100 -1000 Exact Opposite
To follow your logic let us just adding abs with groupby , so the output will return the pair index as list
df.reset_index().groupby(df['Sum Val'].abs())['index'].agg(list)
Out[367]:
Sum Val
1110 [0, 3]
2220 [1]
3330 [2]
Name: index, dtype: object
import pandas as pd
data = {'A': [10, 20, 30, -10], 'B': [100, 200,300, -100], 'C': [1000, 2000, 3000,-1000]}
df = pd.DataFrame(data)
print(df)
df['total'] = df.sum(axis=1)
df['total'] = df['total'].apply(lambda x: "Exact opposite" if sum(df['total'] == -1*x) else "")
print(df)

Pandas get nearest index with equivalent positive number

I have data similar to this
data = {'A': [10,20,30,10,-10, 20,-20, 10], 'B': [100,200,300,100,-100, 30,-30,100], 'C':[1000,2000,3000,1000, -1000, 40,-40, 1000]}
df = pd.DataFrame(data)
df
Index
A
B
C
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
4
-10
-100
-1000
5
20
30
40
6
-20
-30
-40
7
10
100
1000
Here sum values of all the columns for index 0,3,7 equal to 1110 index 4 equals -1110 and sum value of index 5 and 6 equals 90, and -90 these are exact opposite, so for such scenarios I want a fourth column D to be populated with a value 'Exact opposite' for index value 3,4. and 5,6(Nearest index)
Similar to
Index
A
B
C
D
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
Exact opposite
4
-10
-100
-1000
Exact opposite
5
20
30
40
Exact opposite
6
-20
-30
-40
Exact opposite
7
10
100
1000
One approach I can think of is by adding a column which adds values of all the columns
column_names=['A','B','C']
df['Sum Val'] = df[column_names].sum(axis=1)
Index
A
B
C
Sum val
0
10
100
1000
1110
1
20
200
2000
2200
2
30
300
3000
3300
3
10
100
1000
1110
4
-10
-100
-1000
-1110
5
20
30
40
90
6
-20
-30
-40
-90
7
10
100
1000
1110
and then check if there are any negative values and try to find out the corresponding equal positive value but could not proceed from there
This should do the trick:
data = {'A': [10,20,30,10,-10, 30, 10], 'B': [100,200,300,100,-100, 300, 100], 'C':[1000,2000,3000,1000, -1000, 3000, 1000]}
df = pd.DataFrame(data)
print (df)
for i in df.index[:-2]:
for col in df.columns[0:2]:
if not df[col][i] == -df[col][i+1]:
break
else:
df.at[i, 'D'] = 'Exact opposite'
df.at[i+1, 'D'] = 'Exact opposite'
continue
print(df)
This solution only considers 2 adjacent lines.
The following code compares all lines so it also detects lines 0 and 6:
data = {'A': [10,20,30,10,-10, 30, 10], 'B': [100,200,300,100,-100, 300, 100], 'C':[1000,2000,3000,1000, -1000, 3000, 1000]}
df = pd.DataFrame(data)
print (df)
for i in df.index:
for j in df.index[i:]:
for col in df.columns[0:2]:
if not df[col][i] == -df[col][j]:
break
else:
df.at[i, 'D'] = 'Exact opposite'
df.at[j, 'D'] = 'Exact opposite'
continue
print(df)

Calculate rate of positive values by group

I'm working with a Pandas DataFrame having the following structure:
import pandas as pd
​
df = pd.DataFrame({'brand' : ['A', 'A', 'B', 'B', 'C', 'C'],
'target' : [0, 1, 0, 1, 0, 1],
'freq' : [5600, 220, 5700, 90, 5000, 100]})
​
print(df)
brand target freq
0 A 0 5600
1 A 1 220
2 B 0 5700
3 B 1 90
4 C 0 5000
5 C 1 100
For each brand, I would like to calculate the ratio of positive targets, e.g. for brand A, the percentage of positive target is 220/(220+5600) = 0.0378.
My resulting DataFrame should look like the following:
brand target freq ratio
0 A 0 5600 0.0378
1 A 1 220 0.0378
2 B 0 5700 0.0156
3 B 1 90 0.0156
4 C 0 5000 0.0196
5 C 1 100 0.0196
I know that I should group my DataFrame by brand and then apply some function to each group (since I want to keep all rows in my final result I think I should use transform here). I tested a couple of things but without any success. Any help is appreciated.
First sorting columns by brand and target for last 1 row per group and then divide in GroupBy.transform with lambda function:
df = df.sort_values(['brand','target'])
df['ratio'] = df.groupby('brand')['freq'].transform(lambda x: x.iat[-1] / x.sum())
print (df)
brand target freq ratio
0 A 0 5600 0.037801
1 A 1 220 0.037801
2 B 0 5700 0.015544
3 B 1 90 0.015544
4 C 0 5000 0.019608
5 C 1 100 0.019608
Or divide Series created by functions GroupBy.last and GroupBy.sum:
df = df.sort_values(['brand','target'])
g = df.groupby('brand')['freq']
df['ratio'] = g.transform('last').div(g.transform('sum'))

adding value to prior row in dataframe within a function

I am looking for a solution that produces column C (first row = 10000) within a function framework, i.e. without iterations but with vectorization:
index A B c
1 0 0 1000
2 100 0 900
3 0 0 900
4 0 200 1100
5 0 0 1100
the Function should look similar to this:
def calculate(self):
df = pd.DataFrame()
df['A'] = self.some_value
df['B'] = self.some_other_value
df['C'] = df['C'].shift(1) - df['A'] + df['B']........
but the reference to the prior row does not work. what can be done to accomplish the task lined out?
This should work:
df['C'] = 1000 + (-df['A'] + df['B']).cumsum()
df
Out[80]:
A B C
0 0 0 1000
1 100 0 900
2 0 0 900
3 0 200 1100
4 0 0 1100

Categories