Using Panda to skip the next three rows from a specific column

Using Panda to skip the next three rows from a specific column - python

I have been working on this code forever:
I would like to accomplish the following
-If Pout>3 then drop/delete the next 3 Rows
df=pd.read_csv(file,sep=',',usecols=['Iin', 'Iout','Pout'])
print(df['Pout'])
for i in df['Pout']:
if i>3:
df.drop(df[3:])# drop/delete the next 3 rows regardless of the value
print(df)
Any help will greatly appreciated
Thanks
I came up with this code based on your first code. but the updated version that you have just posted is more efficient. I am now dropping the next five rows after the conditions have been met.
import pandas as pd
df = pd.DataFrame({'a': [1,5.0,1,2.3,2.1,2,1,3,4,7], 'b':
[1,4,0.2,4.5,8.2,1,2,3,4,7], 'c': [1,4.5,5.4,6,2,4,2,3,4,7]})
for index in range(len(df['c'])):
if df['c'][index] >3:
df.at[index+1, 'c'] = None
df.at[index+2, 'c'] = None
df.at[index+3, 'c'] = None
df.at[index+4, 'c'] = None
df.at[index+5, 'c'] = None
print(df['c'])
break

try this:
import pandas as pd
df = pd.DataFrame({'a': [1,5,1,2,2,2,1], 'b': [1,4,2,4,8,1,2], 'c': [1,2,6,6,2,1,2]})
for i in df['c']:
if i>3:
try:
idx = df['c'].tolist().index(i)# drop/delete the next 3 rows regardless of the value
print(idx)
except:
pass
for i in range(idx, idx+3):
df.at[i, 'c'] = None
print(df)
Output:
a b c
0 1 1 1.0
1 5 4 2.0
2 1 2 NaN
3 2 4 NaN
4 2 8 NaN
5 2 1 1.0
6 1 2 2.0
my solution is with a dummy data frame
i got the index of the item if the item is bigger than 3 then iterate trough the range of the items index to the items index plus 3 then do at function to set the value to Nan
In my edit i just added try and except and now it works
For 5 rows:
I think this code is what you want, i also think this is more efficient
import pandas as pd
df = pd.DataFrame({'a': [1,5.0,1,2.3,2.1,2,1,3,4,7], 'b':
[1,4,0.2,4.5,8.2,1,2,3,4,7], 'c': [1,4.5,5.4,6,2,4,2,3,4,7]})
for index in range(len(df['c'])):
if df['c'][index] >3:
for i in range(index+1, index+6):
df.at[i, 'c'] = None
print(df['c'])
break
Output:
0 1.0
1 4.5
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 3.0
8 4.0
9 7.0
Name: c, dtype: float64

Related

How to get average between first row and current row per each group in data frame?

i have data frame like this,
id
value
a
2
a
4
a
3
a
5
b
1
b
4
b
3
c
1
c
nan
c
5
the resulted data frame contain new column ['average'] and to get its values will be:
make group-by(id)
first row in 'average' column per each group is equal to its corresponding value in 'value'
other rows in ' average' in group is equal to mean for all previous rows in 'value'(except current value)
the resulted data frame must be :
id
value
average
a
2
2
a
4
2
a
3
3
a
5
3
b
1
1
b
4
1
b
3
2.5
c
1
1
c
nan
1
c
5
1

You can group the dataframe by id, then calculate the expanding mean for value column for each groups, then shift the expanding mean and get it back to the original dataframe, once you have it, you just need to ffill on axis=1 on for the value and average columns to get the first value for the categories:
out = (df
.assign(average=df
.groupby(['id'])['value']
.transform(lambda x: x.expanding().mean().shift(1))
)
)
out[['value', 'average']] = out[['value', 'average']].ffill(axis=1)
OUTPUT:
id value average
0 a 2.0 2.0
1 a 4.0 2.0
2 a 3.0 3.0
3 a 5.0 3.0
4 b 1.0 1.0
5 b 4.0 1.0
6 b 3.0 2.5
7 c 1.0 1.0
8 c NaN 1.0
9 c 5.0 1.0

Here is a solution which, I think, satisfies the requirements. Here, the first row in a group of ids is simply passing its value to the average column. For every other row, we take the average where the index is smaller than the current index.
You may want to specify how you want to handle the NaN values. In the below, I set them to None so that they are ignored.
import numpy as np
from numpy import average
import pandas as pd
df = pd.DataFrame([
['a', 2],
['a', 4],
['a', 3],
['a', 5],
['b', 1],
['b', 4],
['b', 3],
['c', 1],
['c', np.NAN],
['c', 5]
], columns=['id', 'value'])
# Replace the NaN value with None
df['value'] = df['value'].replace(np.nan, None)
id_groups = df.groupby(['id'])
id_level_frames = []
for group, frame in id_groups:
print(group)
# Resets the index for each id-level frame
frame = frame.reset_index()
for index, row in frame.iterrows():
# If this is the first row:
if index== 0:
frame.at[index, 'average'] = row['value']
else:
current_index = index
earlier_rows = frame[frame.index < index]
frame.at[index, 'average'] = average(earlier_rows['value'])
id_level_frames.append(frame)
final_df = pd.concat(id_level_frames)

Pandas : How to drop a specific number of duplicates rows?

I hope you're doing well.
So I want to drop a specific number of duplicates rows. Let me explain by an example :
A B C
0 foo 2 3
1 foo nan 9
2 foo 1 4
3 bar 8 nan
4 xxx 9 10
5 xxx 4 4
6 xxx 9 6
So we have duplicated rows based on column A, so for 'foo' I want to drop 2 duplicates rows for example and for 'xxx' I want to drop just one row.
The method drop_duplicates can keep either 0 or 1 rows so it didn't help me.
Thanks in advance.

Probably not the optimal solution, but this one works:
df = pd.DataFrame({
'A': ['foo','foo','foo','bar','xxx','xxx','xxx'],
'B': [2,np.nan,1,8,9,4,9],
'C': [3,9,4,np.nan,10,4,6]
})
nb_drops = {'foo':2, 'xxx':1}
df2 = pd.DataFrame()
for k, v in nb_drops.items():
df2 = df2.append(df[df['A'] == k].head(v))
df = df.drop_duplicates(subset=['A'])
df = df.merge(df2,how='outer')
df
Gives
A B C
0 foo 2.0 3.0
1 bar 8.0 NaN
2 xxx 9.0 10.0
3 foo NaN 9.0

I made this code and it works...
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['foo','foo','foo','bar','xxx','xxx','xxx'],
'B': [2,np.nan,1,8,9,4,9],
'C': [3,9,4,np.nan,10,4,6]
nb_drops = {'foo':2, 'xxx':1}
rows_to_delete = []
for item in nb_drops :
indices_item = list(df[df['A'] == item].index)
rows_to_delete += range(indices_item[-1] - nb_drops[item] + 1,indices_item[-1] + 1)
df.drop(rows_to_delete, inplace = True)

How to create data fame from random lists length using python?

I want to create pandas data frame with multiple lists with different length. Below is my python code.
import pandas as pd
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
lenA = len(A)
lenB = len(B)
lenC = len(C)
df = pd.DataFrame(columns=['A', 'B','C'])
for i,v1 in enumerate(A):
for j,v2 in enumerate(B):
for k, v3 in enumerate(C):
if(i<random.randint(0, lenA)):
if(j<random.randint(0, lenB)):
if (k < random.randint(0, lenC)):
df = df.append({'A': v1, 'B': v2,'C':v3}, ignore_index=True)
print(df)
My lists are as below:
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6,7]
In each run I got different output and which is correct. But not covers all list items in each run. In one run I got below output as:
A B C
0 1 1 3
1 1 2 1
2 1 2 2
3 2 2 5
In the above output 'A' list's all items (1,2) are there. But 'B' list has only (1,2) items, the item 3 is missing. Also list 'C' has (1,2,3,5) items only. (4,6,7) items are missing in 'C' list. My expectation is: in each list each item should be in the data frame at least once and 'C' list items should be in data frame only once. My expected sample output is as below:
A B C
0 1 1 3
1 1 2 1
2 1 2 2
3 2 2 5
4 2 3 4
5 1 1 7
6 2 3 6
Guide me to get my expected output. Thanks in advance.

You can add random values of each list to total length and then use DataFrame.sample:
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
L = [A,B,C]
m = max(len(x) for x in L)
print (m)
6
a = [np.hstack((np.random.choice(x, m - len(x)), x)) for x in L]
df = pd.DataFrame(a, index=['A', 'B', 'C']).T.sample(frac=1)
print (df)
A B C
2 2 2 3
0 2 1 1
3 1 1 4
4 1 2 5
5 2 3 6
1 2 2 2

You can use transpose to achieve the same.
EDIT: Used random to randomize the output as requested.
import pandas as pd
from random import shuffle, choice
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
shuffle(A)
shuffle(B)
shuffle(C)
data = [A,B,C]
df = pd.DataFrame(data)
df = df.transpose()
df.columns = ['A', 'B', 'C']
df.loc[:,'A'].fillna(choice(A), inplace=True)
df.loc[:,'B'].fillna(choice(B), inplace=True)
This should give the below output
A B C
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 NaN NaN 5.0
5 NaN NaN 6.0

Getting Pandas.groupby.shift() results with groupbyvars as cols / index?

Given this trivial dataset
df = pd.DataFrame({'one': ['a', 'a', 'a', 'b', 'b', 'b'],
'two': ['c', 'c', 'c', 'c', 'd', 'd'],
'three': [1, 2, 3, 4, 5, 6]})
grouping on one / two and applying .max() returns me a Series indexed on the groupby vars, as expected...
df.groupby(['one', 'two'])['three'].max()
output:
one two
a c 3
b c 4
d 6
Name: three, dtype: int64
...in my case I want to shift() my records, by group. But for some reason, when I apply .shift() to the groupby object, my results don't include the groupby variables:
output:
df.groupby(['one', 'two'])['three'].shift()
0 NaN
1 1.0
2 2.0
3 NaN
4 NaN
5 5.0
Name: three, dtype: float64
Is there a way to preserve those groupby variables in the results, as either columns or a multi-indexed Series (as in .max())? Thanks!

It is difference between max and diff - max aggregate values (return aggregate Series) and diff not - return same size Series.
So is possible append output to new column:
df['shifted'] = df.groupby(['one', 'two'])['three'].shift()
Theoretically is possible use agg, but it return error in pandas 0.20.3:
df1 = df.groupby(['one', 'two'])['three'].agg(['max', lambda x: x.shift()])
print (df1)
ValueError: Function does not reduce
One possible solution is transform if need max with diff:
g = df.groupby(['one', 'two'])['three']
df['max'] = g.transform('max')
df['shifted'] = g.shift()
print (df)
one three two max shifted
0 a 1 c 3 NaN
1 a 2 c 3 1.0
2 a 3 c 3 2.0
3 b 4 c 4 NaN
4 b 5 d 6 NaN
5 b 6 d 6 5.0

As what Jez explained, shift return the Serise keep the same len of dataframe, if you assign it like max(), will getting the error
Function does not reduce
df.assign(shifted=df.groupby(['one', 'two'])['three'].shift()).set_index(['one','two'])
Out[57]:
three shifted
one two
a c 1 NaN
c 2 1.0
c 3 2.0
b c 4 NaN
d 5 NaN
d 6 5.0
Using max as the key , and shift value slice the value max row
df.groupby(['one', 'two'])['three'].apply(lambda x : x.shift()[x==x.max()])
Out[58]:
one two
a c 2 2.0
b c 3 NaN
d 5 5.0
Name: three, dtype: float64

Remove NaN from pandas series

Is there a way to remove a NaN values from a panda series? I have a series that may or may not have some NaN values in it, and I'd like to return a copy of the series with all the NaNs removed.

>>> s = pd.Series([1,2,3,4,np.NaN,5,np.NaN])
>>> s[~s.isnull()]
0 1
1 2
2 3
3 4
5 5
update or even better approach as #DSM suggested in comments, using pandas.Series.dropna():
>>> s.dropna()
0 1
1 2
2 3
3 4
5 5

A small usage of np.nan ! = np.nan
s[s==s]
Out[953]:
0 1.0
1 2.0
2 3.0
3 4.0
5 5.0
dtype: float64
More Info
np.nan == np.nan
Out[954]: False

If you have a pandas serie with NaN, and want to remove it (without loosing index):
serie = serie.dropna()
# create data for example
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
ser.replace('e', np.NAN)
print(ser)
0 g
1 NaN
2 NaN
3 k
4 s
dtype: object
# the code
ser = ser.dropna()
print(ser)
0 g
3 k
4 s
dtype: object

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Panda to skip the next three rows from a specific column - python

Related

How to get average between first row and current row per each group in data frame?

Pandas : How to drop a specific number of duplicates rows?

How to create data fame from random lists length using python?

Getting Pandas.groupby.shift() results with groupbyvars as cols / index?

Remove NaN from pandas series

Categories

Resources