Remove lines in dataframe using a list in Pandas - python

It's a generic question about filtering a pandas dataframe using a list. The problem is the following:
I have a pandas dataframe df with a column field
I have a list of banned fields, for example ban_field=['field1','field2','field3']
All elements of ban_field appear in df.field
For the moment, to retrieve the dataframe without the banned field, I proceed as follows:
for f in ban_field:
df = df[df.field!=f]
Is there a more pythonic way to proceed (in one line?)?

Method #1: use isin and a boolean array selector:
In [47]: df = pd.DataFrame({"a": [2]*10, "field": range(10)})
In [48]: ban_field = [3,4,6,7,8]
In [49]: df[~df.field.isin(ban_field)]
Out[49]:
a field
0 2 0
1 2 1
2 2 2
5 2 5
9 2 9
[5 rows x 2 columns]
Method #2: use query:
In [51]: df.query("field not in #ban_field")
Out[51]:
a field
0 2 0
1 2 1
2 2 2
5 2 5
9 2 9
[5 rows x 2 columns]

You can remove it by using the isin function and the negation (~) operator.
df[~df.field.isin(ban_field)]

Related

Appending rows to existing pandas dataframe

I have a pandas dataframe df1
a b
0 1 2
1 3 4
I have another dataframe in the form of a dictionary
dictionary = {'2' : [5, 6], '3' : [7, 8]}
I want to append the dictionary values as rows in dataframe df1. I am using pandas.DataFrame.from_dict() to convert the dictionary into dataframe. The constraint is, when I do it, I cannot provide any value to the 'column' argument for the method from_dict().
So, when I try to concatenate the two dataframes, the pandas adds the contents of the new dataframe as new columns. I do not want that. The final output I want is in the format
a b
0 1 2
1 3 4
2 5 6
3 7 8
Can someone tell me how do I do this in least painful way?
Use concat with help of pd.DataFrame.from_dict, setting the columns of df1 during the conversion:
out = pd.concat([df1,
pd.DataFrame.from_dict(dictionary, orient='index',
columns=df1.columns)
])
Output:
a b
0 1 2
1 3 4
2 5 6
3 7 8
Another possible solution, which uses numpy.vstack:
pd.DataFrame(np.vstack([df.values, np.array(
list(dictionary.values()))]), columns=df.columns)
Output:
a b
0 1 2
1 3 4
2 5 6
3 7 8

Pivot table based on the first value of the group in Pandas

Have the following DataFrame:
I'm trying to pivot it in pandas and achieve the following format:
Actually I tried the classical approach with pd.pivot_table() but it does not work out:
pd.pivot_table(df,values='col2', index=[df.index], columns = 'col1')
Would be appreciate for some suggestions :) Thanks!
You can use pivot and then dropna for each column:
>>> df.pivot(columns='col1', values='col2').apply(lambda x: x.dropna().tolist()).astype(int)
col1 a b c
0 1 2 9
1 4 5 0
2 6 8 7
Another option is to create a Series of lists using groupby.agg; then construct a DataFrame:
out = df.groupby('col1')['col2'].agg(list).pipe(lambda x: pd.DataFrame(zip(*x), columns=x.index.tolist()))
Output:
A B C
0 1 2 9
1 4 5 0
2 6 8 7

In pandas Dataframe with multiindex how can I filter by order?

Assume the following dataframe
>>> import pandas as pd
>>> L = [(1,'A',9,9), (1,'C',8,8), (1,'D',4,5),(2,'H',7,7),(2,'L',5,5)]
>>> df = pd.DataFrame.from_records(L).set_index([0,1])
>>> df
2 3
0 1
1 A 9 9
C 8 8
D 4 5
2 H 7 7
L 5 5
I want to filter the rows in the nth position of level 1 of the multiindex, i.e. filtering the first
2 3
0 1
1 A 9 9
2 H 7 7
or filtering the third
2 3
0 1
1 D 4 5
How can I achieve this ?
You can filter rows with the help of GroupBy.nth after performing grouping on the first level of the multi-index DF. Since n follows the 0-based indexing approach, you need to provide the values appropriately to it as shown:
1) To select the first row grouped per level=0:
df.groupby(level=0, as_index=False).nth(0)
2) To select the third row grouped per level=0:
df.groupby(level=0, as_index=False).nth(2)

Rename Dataframe column based on column index

Is there a built in function to rename a pandas dataframe by index?
I thought I knew the name of my column headers, but it turns out the second column has some hexadecimal characters in it. I will likely come across this issue with column 2 in the future based on the way I receive my data, so I cannot hard code those specific hex characters into a dataframe.rename() call.
Is there a function that would be appropriately named rename_col_by_index() that I have not been able to find?
Ex:
>>> df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
>>> df.rename_col_by_index(1, 'new_name')
>>> df
a new_name
0 1 3
1 2 4
#MaxU's answer is better
df.rename(columns={"col1": "New name"})
More in docs
UPDATE: thanks to #Vincenzzzochi:
In [138]: df.rename(columns={df.columns[1]: 'new'})
Out[138]:
a new c
0 1 3 5
1 2 4 6
In [140]: df
Out[140]:
a b c
0 1 3 5
1 2 4 6
or bit more flexible:
In [141]: mapping = {df.columns[0]:'new0', df.columns[1]: 'new1'}
In [142]: df.rename(columns=mapping)
Out[142]:
new0 new1 c
0 1 3 5
1 2 4 6

Retrieving groups by their index. Are Pandas groups sorted?

Say I group a Pandas dataframe around some column
df.groupby(cols)
Are groups sorted according to any criteria?
One way to retrieve a group is:
ix = 0
grouped.get_group(grouped.groups.keys()[ix])
but it is a bit verbose, and it's not clear that keys() above will give the groups in order.
Another way:
df = df.set_index(col)
df.loc[idx[df.index.levels[0][0],:],:]
but again, that's really verbose.
Is there another way to get a group by its integer index?
groupby has a sort parameter which is True by default, thus the groups are sorted. As for getting the nth group, it looks like you'd have to define a function, and use an internal API:
In [123]: df = DataFrame({'a': [1,1,1,1,2,2,3,3,3,3], 'b': randn(10)})
In [124]: df
Out[124]:
a b
0 1 1.5665
1 1 -0.2558
2 1 0.0756
3 1 -0.2821
4 2 0.8670
5 2 -2.0043
6 3 -1.3393
7 3 0.3898
8 3 -0.3392
9 3 1.2198
[10 rows x 2 columns]
In [125]: gb = df.groupby('a')
In [126]: def nth_group(gb, n):
.....: keys = gb.grouper._get_group_keys()
.....: return gb.get_group(keys[n])
.....:
In [127]: nth_group(gb, 0)
Out[127]:
a b
0 1 1.5665
1 1 -0.2558
2 1 0.0756
3 1 -0.2821
[4 rows x 2 columns]
How about:
key, df2 = iter(grouped).next()

Categories