I'm trying to clean a dataset.
In the last 3 rows, I know if the column "B" is empty drop the whole row.
I haven't managed to figure out how to use dropna only on certain rows.
A B
1 1 3
2 5
3 6 5
4 2
5 3 6
Needs to become
A B
1 1 3
2 5
3 6 5
5 3 6
You slice the last three row then apply your condition pass that to drop
n=3
df=df.drop(df.tail(n).B.eq('').loc[lambda x : x].index)
A B
1 1 3
2 5
3 6 5
5 3 6
Related
Given a pandas data frame, how can I get the first row for each unique value in a column?
for example, given:
a b key
0 1 2 1
1 2 3 1
2 3 3 1
3 4 5 2
4 5 6 2
5 6 6 2
6 7 2 1
7 8 2 1
8 9 2 3
the result when analyzing by column key should be
a b key
0 1 2 1
3 4 5 2
8 9 2 3
p.s. df src:
pd.DataFrame([{'a':1,'b':2,'key':1},
{'a':2,'b':3,'key':1},
{'a':3,'b':3,'key':1},
{'a':4,'b':5,'key':2},
{'a':5,'b':6,'key':2},
{'a':6,'b':6,'key':2},
{'a':7,'b':2,'key':1},
{'a':8,'b':2,'key':1},
{'a':9,'b':2,'key':3}])
drop_duplicates does this. By default it keeps the first of the set, although that can be changed by other parameters.
df = df.drop_duplicates('key')
I'm trying to use pandas to identify sub-sections of a dataframe which are identical. So, for example, if I have a dataframe like:
id A B
0 1 1 2
1 1 2 3
2 1 5 6
3 2 1 2
4 2 2 3
5 2 5 6
6 3 8 9
7 3 4 0
8 3 9 7
I want to group by ID, so Rows 0 - 2 would form Group 1, Rows 3 - 5 would form Group 2, and Rows 6 - 8 would form Group 3. I know I can use pd.groupby() to group rows by ID. In the case here, Group 2 is a repetition of Group 1 (Columns A and B are identical in both)
What I then want to do is to remove repeated groups, so in this case I would want to remove the second group. My final dataframe would then look like:
id A B
0 1 1 2
1 1 2 3
2 1 5 6
6 3 8 9
7 3 4 0
8 3 9 7
Every column in the duplicate groups is the same, except for the ID which is different for each group. I only want to remove a group if it is identical for every row in the group. Any help would be much appreciated!
This is one way using a helper column and pd.Series.drop_duplicates.
The idea is to first create a mapping from id to a tuple of values representing all rows for that id. Then drop duplicates and extract the index of the remainder.
df['C'] = list(zip(df['A'], df['B']))
s = df.groupby('id')['C'].apply(tuple)\
.drop_duplicates().index
res = df.loc[df['id'].isin(s), ['id', 'A', 'B']]
print(res)
id A B
0 1 1 2
1 1 2 3
2 1 5 6
6 3 8 9
7 3 4 0
8 3 9 7
Check pd.crosstab
s=pd.crosstab(df.id,[df.A,df.B]).drop_duplicates().unstack()
s[s!=0].reset_index().drop(0,1)
Out[128]:
A B id
0 1 2 1
1 2 3 1
2 4 0 3
3 5 6 1
4 8 9 3
5 9 7 3
For a column in a pandas DataFrame with several rows I want to create a new column that has a specified number of rows that form sub-levels of the rows of the previous column. I'm trying this in order to create a large data matrix containing ranges of values as an input for a model later on.
As an example I have a small DataFrame as follows:
df:
A
1 1
2 2
3 3
. ..
To this DataFrame I would like to add 3 rows per row in the 'A' column of the DataFrame, forming a new column named 'B'. The result should be something like this:
df:
A B
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
. .. ..
I have tried various things of which a list comprehension combined with an if statement and using something to iterate over the rows in the DataFrame like iterrows() and subsequently 'append' the new rows seems most logic to me, however I cannot get it done. Especially the duplication of the 'A' column's rows.
Does anyone know how to do this?
Any suggestion is appreciated, many thanks in advance
I think you need numpy.repeat and numpy.tile with DataFrame constructor:
df = pd.DataFrame({'A':np.repeat(df['A'].values, 3),
'B':np.tile(df['A'].values, 3)})
print (df)
A B
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 3 1
7 3 2
8 3 3
In [28]: pd.DataFrame({'A':np.repeat(df.A.values, 3), 'B':np.tile(df.A.values,3)})
Out[28]:
A B
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 3 1
7 3 2
8 3 3
Here's another NumPy way with np.repeat to create one column and then re-using it for the another -
In [282]: df.A
Out[282]:
1 4
2 9
3 5
Name: A, dtype: int64
In [288]: r = np.repeat(df.A.values[:,None],3,axis=1)
In [289]: pd.DataFrame(np.c_[r.ravel(), r.T.ravel()], columns=[['A','B']])
Out[289]:
A B
0 4 4
1 4 9
2 4 5
3 9 4
4 9 9
5 9 5
6 5 4
7 5 9
8 5 5
I want to produce a column B in a dataframe that tracks the maximum value reached in column A since row Index 0.
A B
Index
0 1 1
1 2 2
2 3 3
3 2 3
4 1 3
5 3 3
6 4 4
7 2 4
I want to avoid iterating, so is there a vectorized solution and if so how could it look like ?
You're looking for cummax:
In [257]:
df['B'] = df['A'].cummax()
df
Out[257]:
A B
Index
0 1 1
1 2 2
2 3 3
3 2 3
4 1 3
5 3 3
6 4 4
7 2 4
I have a pandas dataframe containing rows with numbered columns:
1 2 3 4 5
a 0 0 0 0 1
b 1 1 2 1 9
c 2 2 2 2 2
d 5 5 5 5 5
e 8 9 9 9 9
How can I filter out the rows where a subset of columns are all above or below a certain value?
So, for example: I want to remove all rows where columns 1 to 3 all values are not > 3. In the above, that would leave me with only rows d and e.
The columns I am filtering and the value I am checking against are both arguments.
I've tried a few things, this is the closest I've gotten:
df[df[range(1,3)]>3]
Any ideas?
I used loc and all
in this function:
def filt(df, cols, thresh):
return df.loc[(df[cols] > thresh).all(axis=1)]
filt(df, [1, 2, 3], 3)
1 2 3 4 5
d 5 5 5 5 5
e 8 9 9 9 9
You can achieve this without using apply:
In [73]:
df[(df.ix[:,0:3] > 3).all(axis=1)]
Out[73]:
1 2 3 4 5
d 5 5 5 5 5
e 8 9 9 9 9
So this slices the df to just the first 3 columns using ix and then we compare against the scalar 3 and then call all(axis=1) to create a boolean series to mask the index