df.apply() but skip the first row

df.apply() but skip the first row - python

I am trying to apply the following df.apply command to a dataframe but want it to skip the first row. Any advice on how to do that without setting the first row as the column headers?
res = sheet1[sheet1.apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)]

You can select from the index one and on as follow:
res = sheet1[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)

EDIT Version 3:
import pandas as pd
import random
df = pd.DataFrame({'a':range(1,11), 'b':range(2,21,2), 'c':range(1,20,2),
'd':['TRUE' if random.randint(0,1) else 'FALSE' for _ in range(10)]})
print (df)
res = df[df.apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)]
print (res.loc[1:])
If all you want to do is to get only the rows from 1 onwards, you can just do it as shown above:
The input Dataframe is:
a b c d
0 1 2 1 TRUE
1 2 4 3 FALSE
2 3 6 5 FALSE
3 4 8 7 TRUE
4 5 10 9 TRUE
5 6 12 11 TRUE
6 7 14 13 FALSE
7 8 16 15 FALSE
8 9 18 17 TRUE
9 10 20 19 TRUE
The output of res will be:
a b c d
0 1 2 1 TRUE
3 4 8 7 TRUE
4 5 10 9 TRUE
5 6 12 11 TRUE
8 9 18 17 TRUE
9 10 20 19 TRUE
The output of res[1:] - excluding first row will be:
a b c d
3 4 8 7 TRUE
4 5 10 9 TRUE
5 6 12 11 TRUE
8 9 18 17 TRUE
9 10 20 19 TRUE
EDIT Version 2:
Here's an example with 'TRUE' and 'FALSE' in the column.
import pandas as pd
import random
df = pd.DataFrame({'a':['TRUE' if random.randint(0,1) else 'FALSE' for _ in range(10)]})
print (df)
res = df.iloc[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)
print (res)
The output will be:
Original DataFrame:
a
0 TRUE
1 TRUE
2 FALSE
3 FALSE
4 FALSE
5 TRUE
6 TRUE
7 FALSE
8 FALSE
9 FALSE
Result from the DataFrame:
1 True
2 False
3 False
4 False
5 True
6 True
7 False
8 False
9 False
dtype: bool
You can also give loc instead of iloc:
res = df.loc[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)
As you can see, it skipped the first row.
Old answer
Here's an example:
import pandas as pd
df = pd.DataFrame({'a':range(1,11), 'b':range(2,21,2), 'c':range(1,20,2)})
print (df)
res = df.iloc[1:,:].apply(lambda x: x+10,axis=1)
print (res)
Original DataFrame:
a b c
0 1 2 1
1 2 4 3
2 3 6 5
3 4 8 7
4 5 10 9
5 6 12 11
6 7 14 13
7 8 16 15
8 9 18 17
9 10 20 19
Only rows 1 onwards got modified:
a b c
1 12 14 13
2 13 16 15
3 14 18 17
4 15 20 19
5 16 22 21
6 17 24 23
7 18 26 25
8 19 28 27
9 20 30 29

Related

reshape Pandas dataframe by appending column to column

i do have a Pandas df like (df1):
0 1 2 3 4 5
0 a b c d e f
1 1 4 7 10 13 16
2 2 5 8 11 14 17
3 3 6 9 12 15 18
and i want to generate an Dataframe like (df2):
0 1 2
0 a b c
1 1 4 7
2 2 5 7
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
additional information about the given df:
shape of given df ist unknown. b = df1.shape() -> b = [n,m]
it is a given fact the width of df1 is divisble by 3
i did try stack, melt and wide_to_long. By using stack the order of the rows is lost, the rows should behave as shown in exmeplary df2 . I would really appreciate any help.
Kind regards Hans

Use np.vstack and np.hsplit:
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
Another example:
>>> df
0 1 2 3 4 5 6 7 8
0 a b c d e f g h i
1 1 4 7 10 13 16 19 22 25
2 2 5 8 11 14 17 20 23 26
3 3 6 9 12 15 18 21 24 27
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
8 g h i
9 19 22 25
10 20 23 26
11 21 24 27

You can use DataFrame.append:
a = df[df.columns[: len(df.columns) // 3 + 1]]
b = df[df.columns[len(df.columns) // 3 + 1 :]]
b.columns = a.columns
df_out = a.append(b).reset_index(drop=True)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
EDIT: To handle unknown widths:
dfs = []
for i in range(0, len(df.columns), 3):
dfs.append(df[df.columns[i : i + 3]])
dfs[-1].columns = df.columns[:3]
df_out = pd.concat(dfs)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
0 d e f
1 10 13 16
2 11 14 17
3 12 15 18
0 g h i
1 19 22 25
2 20 23 26
3 21 24 27

Insert new dataframe to existing dataframe into specific row position in Pandas

I have a df1 and df2 as follows:
df1:
a b c
0 1 2 4
1 6 12 24
2 7 14 28
3 4 8 16
4 3 6 12
df2:
a b c
0 7 8 9
1 10 11 12
How can I insert df2 to df1 but after the second row? My desired output will like this.
a b c
0 1 2 4
1 6 12 24
2 7 8 9
3 10 11 12
4 7 14 28
5 4 8 16
6 3 6 12
Thank you.

Use concat with splitted first DataFrame by DataFrame.iloc:
df = pd.concat([df1.iloc[:2], df2, df1.iloc[2:]], ignore_index=False)
print (df)
a b c
0 1 2 4
1 6 12 24
0 7 8 9
1 10 11 12
2 7 14 28
3 4 8 16
4 3 6 12

Here is another way using np.r_:
df2.index=range(len(df1),len(df1)+len(df2)) #change index where df1 ends
final=pd.concat((df1,df2)) #concat
final.iloc[np.r_[0,1,df2.index,2:len(df1)]] #select ordering with iloc
#final.iloc[np.r_[0:2,df2.index,2:len(df1)]]
a b c
0 1 2 4
1 6 12 24
5 7 8 9
6 10 11 12
2 7 14 28
3 4 8 16
4 3 6 12

Pandas sum of variable number of columns

I have a pandas dataframe like this -
Time 1 A 2 A 3 A 4 A 5 A 6 A 100 A
5 10 4 6 6 4 6 4
3 7 19 2 7 7 9 18
6 3 6 3 3 8 10 56
2 5 9 1 1 9 12 13
The Time columns gives me the number of A columns that I need to sum up.So that the output looks like this -
Time 1 A 2 A 3 A 4 A 5 A 6 A 100 A Total
5 10 4 6 6 4 6 4 30
3 7 19 2 7 7 9 18 28
6 3 6 3 3 8 10 56 33
2 5 9 1 1 9 12 13 14
In other words, when the value in Time column is 3, it should sum up 1A, 2A and 3A
when the value in Time column is 5, it should sum up 1A, 2A, 3A, 4A and 5A
Note: There are other columns also in between the As. So I cant sum using simple indexing.
Highly appreciate any help in finding a solution.

Use numpy - idea is compare array created by np.arange with length of columns with Time columns converted to index with broadcasting to 2d mask, get matched values by numpy.where and last sum:
df1 = df.set_index('Time')
m = np.arange(len(df1.columns)) < df1.index.values[:, None]
df['new'] = np.where(m, df1.values, 0).sum(axis=1)
print (df)
Time 1 A 2 A 3 A 4 A 5 A 6 A 100 A new
0 5 10 4 6 6 4 6 4 30
1 3 7 19 2 7 7 9 18 28
2 6 3 6 3 3 8 10 56 33
3 2 5 9 1 1 9 12 13 14
Details:
print (df1)
1 A 2 A 3 A 4 A 5 A 6 A 100 A
Time
5 10 4 6 6 4 6 4
3 7 19 2 7 7 9 18
6 3 6 3 3 8 10 56
2 5 9 1 1 9 12 13
print (m)
[[ True True True True True False False]
[ True True True False False False False]
[ True True True True True True False]
[ True True False False False False False]]
print (np.where(m, df1.values, 0))
[[10 4 6 6 4 0 0]
[ 7 19 2 0 0 0 0]
[ 3 6 3 3 8 10 0]
[ 5 9 0 0 0 0 0]]

Try:
df['total'] = df.apply(lambda x: sum([x[i+1] for i in range(x['Time'])]), axis=1)

How to drop duplicates in python if consecutive values are the same in two columns?

I have a dataframe like below:
A B C
1 8 23
2 8 22
3 9 45
4 9 45
5 6 12
6 4 10
7 11 12
I want to drop duplicates where keep the first value in the consecutive occurence if the C is also the same.
E.G here occurence '9' is column B is repetitive and their correponding occurences in column 'C' is also repetitive '45'. In this case i want to retain the first occurence.
Expected Output:
A B C
1 8 23
2 8 22
3 9 45
5 6 12
6 4 10
7 11 12
I tried some group by, but didnot know how to drop.
code:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
test=df.groupby('consecutive',as_index=False).apply(lambda x: (x['B'].head(1),x.shape[0],
x['C'].iloc[-1] - x['C'].iloc[0]))
This group by returns me a series, but i want to drop.

Add DataFrame.drop_duplicates by 2 columns:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
df = df.drop_duplicates(['consecutive','C'])
print (df)
A B C consecutive
0 1 8 23 1
1 2 8 22 1
2 3 9 45 2
4 5 6 12 3
5 6 4 10 4
6 7 11 12 5
Or chain both conditions with | for bitwise OR:
df = df[(df['B'] != df['B'].shift()) | (df['C'] != df['C'].shift())]
print (df)
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12

the easy way to check the difference between row of B and C then drop value if difference is 0 (duplicate values), the code is
df[ ~((df.B.diff()==0) & (df.C.diff()==0)) ]

A oneliner to filter out such records is:
df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
Here we thus check if the columns ['B', 'C'] is the same as the shifted rows, if it is not, we retain the values:
>>> df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
This is quite scalable, since we can define a function that will easily operate on an arbitrary number of values:
def drop_consecutive_duplicates(df, *colnames):
dff = df[list(colnames)]
return df[(dff.shift() != dff).any(axis=1)]
So you can then filter with:
drop_consecutive_duplicates(df, 'B', 'C')

Using diff, ne and any over axis=1:
Note: this method only works for numeric columns
m = df[['B', 'C']].diff().ne(0).any(axis=1)
print(df[m])
Output
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Details
df[['B', 'C']].diff()
B C
0 NaN NaN
1 0.0 -1.0
2 1.0 23.0
3 0.0 0.0
4 -3.0 -33.0
5 -2.0 -2.0
6 7.0 2.0
Then we check if any of the values in a row are not equal (ne) to 0:
df[['B', 'C']].diff().ne(0).any(axis=1)
0 True
1 True
2 True
3 False
4 True
5 True
6 True
dtype: bool

You can compute a series of the rows to drop, and then drop them:
to_drop = (df['B'] == df['B'].shift())&(df['C']==df['C'].shift())
df = df[~to_drop]
It gives as expected:
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12

Code
df1 = df.drop_duplicates(subset=['B', 'C'])
Result
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12

If I understand your question correctly, given the following dataframe:
df = pd.DataFrame({'B': [8, 8, 9, 9, 6, 4, 11], 'C': [22, 23, 45, 45, 12, 10, 12],})
This one-line code solved your problem using the drop_duplicates method:
df.drop_duplicates(['B', 'C'])
It gives as expected results:
B C
0 8 22
1 8 23
2 9 45
4 6 12
5 4 10
6 11 12

Count occurrences of row values in a given set with pandas

I have a dataframe similar to
a b c d e
0 36 38 27 12 35
1 45 33 8 41 18
4 32 14 4 14 9
5 43 1 31 11 3
6 16 8 3 17 39
...
and I want, for each row, to count the occurrences of values in a given set.
I came up with the following code (Python 3) which seems to work, but I'm looking for efficiency, since my real dataframe is much more complex and big:
import pandas as pd
import numpy as np
def column():
return [np.random.randint(0,49) for _ in range(20)]
df = pd.DataFrame({'a': column(),'b': column(),'c': column(),'d': column(),'e': column()})
given_set = {3,8,11,18,22,24,35,36,42,47}
def count_occurrences(row):
return sum(col in given_set for col in (row.a,row.b,row.c,row.d,row.e))
df['count'] = df.apply(count_occurrences, axis=1)
print(df)
Is there a way to obtain the same result with pandas vectorial operators? (instead of Python function)
Thanks in advance.

IIUC you can use DataFrame.isin() method:
Data:
In [41]: given_set = {3,8,11,18,22,24,35,36,42,47}
In [42]: df
Out[42]:
a b c d e
0 36 38 27 12 35
1 45 33 8 41 18
4 32 14 4 14 9
5 43 1 31 11 3
6 16 8 3 17 39
Solution:
In [44]: df['new'] = df.isin(given_set).sum(1)
In [45]: df
Out[45]:
a b c d e new
0 36 38 27 12 35 2
1 45 33 8 41 18 2
4 32 14 4 14 9 0
5 43 1 31 11 3 2
6 16 8 3 17 39 2
Explanation:
In [49]: df.isin(given_set)
Out[49]:
a b c d e
0 True False False False True
1 False False True False True
4 False False False False False
5 False False False True True
6 False True True False False
In [50]: df.isin(given_set).sum(1)
Out[50]:
0 2
1 2
4 0
5 2
6 2
dtype: int64
UPDATE: if you want check for existence instead of counting, you can do it this way (thanks to #DSM):
In [6]: df.isin(given_set).any(1)
Out[6]:
0 True
1 True
4 False
5 True
6 True
dtype: bool
In [7]: df.isin(given_set).any(1).astype(np.uint8)
Out[7]:
0 1
1 1
4 0
5 1
6 1
dtype: uint8

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

df.apply() but skip the first row - python

I am trying to apply the following df.apply command to a dataframe but want it to skip the first row. Any advice on how to do that without setting the first row as the column headers? res = sheet1[sheet1.apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)]

You can select from the index one and on as follow: res = sheet1[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)

Related

reshape Pandas dataframe by appending column to column

Insert new dataframe to existing dataframe into specific row position in Pandas

Pandas sum of variable number of columns

How to drop duplicates in python if consecutive values are the same in two columns?

Count occurrences of row values in a given set with pandas

Categories

Resources