Say I have a data table
1 2 3 4 5 6 .. n
A x x x x x x .. x
B x x x x x x .. x
C x x x x x x .. x
And I want to slim it down so that I only have, say, columns 3 and 5 deleting all other and maintaining the structure. How could I do this with pandas? I think I understand how to delete a single column, but I don't know how to save a select few and delete all others.
If you have a list of columns you can just select those:
In [11]: df
Out[11]:
1 2 3 4 5 6
A x x x x x x
B x x x x x x
C x x x x x x
In [12]: col_list = [3, 5]
In [13]: df = df[col_list]
In [14]: df
Out[14]:
3 5
A x x
B x x
C x x
How do I keep certain columns in a pandas DataFrame, deleting everything else?
The answer to this question is the same as the answer to "How do I delete certain columns in a pandas DataFrame?" Here are some additional options to those mentioned so far, along with timings.
DataFrame.loc
One simple option is selection, as mentioned by in other answers,
# Setup.
df
1 2 3 4 5 6
A x x x x x x
B x x x x x x
C x x x x x x
cols_to_keep = [3,5]
df[cols_to_keep]
3 5
A x x
B x x
C x x
Or,
df.loc[:, cols_to_keep]
3 5
A x x
B x x
C x x
DataFrame.reindex with axis=1 or 'columns' (0.21+)
However, we also have reindex, in recent versions you specify axis=1 to drop:
df.reindex(cols_to_keep, axis=1)
# df.reindex(cols_to_keep, axis='columns')
# for versions < 0.21, use
# df.reindex(columns=cols_to_keep)
3 5
A x x
B x x
C x x
On older versions, you can also use reindex_axis: df.reindex_axis(cols_to_keep, axis=1).
DataFrame.drop
Another alternative is to use drop to select columns by pd.Index.difference:
# df.drop(cols_to_drop, axis=1)
df.drop(df.columns.difference(cols_to_keep), axis=1)
3 5
A x x
B x x
C x x
Performance
The methods are roughly the same in terms of performance; reindex is faster for smaller N, while drop is faster for larger N. The performance is relative as the Y-axis is logarithmic.
Setup and Code
import pandas as pd
import perfplot
def make_sample(n):
np.random.seed(0)
df = pd.DataFrame(np.full((n, n), 'x'))
cols_to_keep = np.random.choice(df.columns, max(2, n // 4), replace=False)
return df, cols_to_keep
perfplot.show(
setup=lambda n: make_sample(n),
kernels=[
lambda inp: inp[0][inp[1]],
lambda inp: inp[0].loc[:, inp[1]],
lambda inp: inp[0].reindex(inp[1], axis=1),
lambda inp: inp[0].drop(inp[0].columns.difference(inp[1]), axis=1)
],
labels=['__getitem__', 'loc', 'reindex', 'drop'],
n_range=[2**k for k in range(2, 13)],
xlabel='N',
logy=True,
equality_check=lambda x, y: (x.reindex_like(y) == y).values.all()
)
You could reassign a new value to your DataFrame, df:
df = df.loc[:,[3, 5]]
As long as there are no other references to the original DataFrame, the old DataFrame will get garbage collected.
Note that when using df.loc, the index is specified by labels. Thus above 3 and 5 are not ordinals, they represent the label names of the columns. If you wish to specify the columns by ordinal index, use df.iloc.
For those who are searching an method to do this inplace:
from pandas import DataFrame
from typing import Set, Any
def remove_others(df: DataFrame, columns: Set[Any]):
cols_total: Set[Any] = set(df.columns)
diff: Set[Any] = cols_total - columns
df.drop(diff, axis=1, inplace=True)
This will create the complement of all the columns in the dataframe and the columns which should be removed. Those can safely be removed. Drop works even on an empty set.
>>> df = DataFrame({"a":[1,2,3],"b":[2,3,4],"c":[3,4,5]})
>>> df
a b c
0 1 2 3
1 2 3 4
2 3 4 5
>>> remove_others(df, {"a","b","c"})
>>> df
a b c
0 1 2 3
1 2 3 4
2 3 4 5
>>> remove_others(df, {"a"})
>>> df
a
0 1
1 2
2 3
>>> remove_others(df, {"a","not","existent"})
>>> df
a
0 1
1 2
2 3
Another approach is to use filter:
In [5]: df.filter([3, 5])
Out[5]:
3 5
A x x
B x x
C x x
Related
For example, I have a dataframe (df) and the Target column is df['Z']. I have two other columns, df['X'] and df['Y']. I have received all this data from the real-world data collection.
How can I make an equation Z as the following functions in python: (i.e. fit Z as a function of X and Y)
> 1. Z = f(X)
> 2. Z = f(X,Y)
That's how you do that:
def function(x, y):
return x+y+4 # Obviously the function can be more complex
df["Z"] = function(df["A"], df["B"])
Example
data = {'A': [x for x in range(5)], 'B': [x for x in range(6,11)]}
df = pd.DataFrame(data)
def function(x,y):
return x+y+4
df["Z"] = function(df["A"], df["B"])
print(df)
Output:
A B Z
0 0 6 10
1 1 7 12
2 2 8 14
3 3 9 16
4 4 10 18
I have a data frame of the below format
variable val
0 'a','x','y' 10
I would like to unnlist(explode) the data in the below format.
variable1 variable2 value
0 a x 10
1 a y 10
2 x y 10
I have tried using df.explode which does not give me the relation between x and y. My code is as below. Can anyone guide me as to how can I proceed further to get the x and y data. Thanks in advance.
import pandas as pd
from ast import literal_eval
data = {'name':["'a','x','y'"], 'val' : [10]}
df = pd.DataFrame(data)
df2 = (df['name'].str.split(',',expand = True, n = 1)
.rename(columns = {0 : 'variable 1', 1 : 'variable 2'})
.join(df.drop(columns = 'name')))
df2['variable 2']=df2['variable 2'].map(literal_eval)
df2=df2.explode('variable 2',ignore_index=True)
print(df2)
OUTPUT:
variable 1 variable 2 val
0 'a' x 10
1 'a' y 10
If need each combinations per splitted values by , use:
print (df)
variable val
0 'a','x','y' 10
1 'a','x','y','f' 80
2 's' 4
from itertools import combinations
df['variable'] = df['variable'].str.replace("'", "", regex=True)
s = [x.split(',') if ',' in x else (x,x) for x in df['variable']]
L = [(*y, z) for x, z in zip(s, df['val']) for y in combinations(x, 2)]
df = pd.DataFrame(L, columns=['variable 1','variable 2','val'])
print (df)
variable 1 variable 2 val
0 a x 10
1 a y 10
2 x y 10
3 a x 80
4 a y 80
5 a f 80
6 x y 80
7 x f 80
8 y f 80
9 s s 4
quick question on how to remove rows that have multiple blanks opposed to one at a time?
for example:
1
2
3
4
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
My DataFrame has 20k entries so manually parsing it out is not possible. I would like to write a script that basically does something similar to an excel script that basically says:
If 2,3 and 4 are empty, drop row
Thanks!
df.dropna(thresh=2)
This will drop all rows which have 3 or more NAs
boolean mask
df[(~df.astype(bool)).sum(axis=1).le(2)]
1 2 3 4
0 X X X
1 X X X X
3 X X
4 X X X X
5 X X X X
Consider I have dataframe:
data = [[11, 10, 13], [16, 15, 45], [35, 14,9]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C'])
df
The data looks like:
A B C
0 11 10 13
1 16 15 45
2 35 14 9
The real data consists of a hundred columns and thousand rows.
I have a function, the aim of the function is to count how many values that higher than the minimum value of another column. The function looks like this:
def get_count_higher_than_min(df, column_name_string, df_col_based):
seriesObj = df.apply(lambda x: True if x[column_name_string] > df_col_based.min(skipna=True) else False, axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
return numOfRows
Example output from the function like this:
get_count_higher_than_min(df, 'A', df['B'])
The output is 3. That is because the minimum value of df['B'] is 10 and three values from df['A'] are higher than 10, so the output is 3.
The problem is I want to compute the pairwise of all columns using that function
I don't know what an effective and efficient way to solve this issue. I want the output in the form of a similar to confusion matrix or similar to correlation matrix.
Example output:
A B C
A X 3 X
B X X X
C X X X
This is O(n2m) where n is the number of columns and m the number of rows.
minima = df.min()
m = pd.DataFrame({c: (df > minima[c]).sum()
for c in df.columns})
Result:
>>> m
A B C
A 2 3 3
B 2 2 3
C 2 2 2
In theory O(n log(n) m) is possible.
from itertools import product
pairs = product(df.columns, repeat=2)
min_value = {}
output = []
for each_pair in pairs:
# making sure that we are calculating min only once
min_ = min_value.get(each_pair[1], df[each_pair[1]].min())
min_value[each_pair[1]] = min_
count = df[df[each_pair[0]]>min_][each_pair[0]].count()
output.append(count)
df_desired = pd.DataFrame(
[output[i: i+len(df.columns)] for i in range(0, len(output), len(df.columns))],
columns=df.columns, index=df.columns)
print(df_desired)
A B C
A 2 3 3
B 2 2 3
C 2 2 2
I have a dataframe with n records indexed (0 - n).
I want to remove a row at the 'x' index from the dataframe and store it elsewhere. I essentially am trying to do the equivalent to performing a pop() from a list in Python. Is there any function or easy way to do this using pandas dataframes?
I've tried using the drop() method but that will only return the same dataframe with the row removed.
dataframe df
row_needed = df.drop([2], axis=0)
)
Given a dataframe df:
A B C D
0 x y z y
1 x y y y
2 y e r z
I would like the following returned and the df updated as such when I remove the row at index 1:
A B C D
0 x y z y
2 y e r z
Row returned:
1 x y y y
You could simply define your own function which performs the drop inplace but only after storing the desired result:
def drop_return(df, index):
row = df.loc[index]
df.drop(index, inplace=True)
return row
With your given example:
In [16]: df
Out[16]:
A B C D
0 x y z y
1 x y y y
2 y e r z
In [17]: drop_return(df, 1)
Out[17]:
A x
B y
C y
D y
Name: 1, dtype: object
In [18]: df
Out[18]:
A B C D
0 x y z y
2 y e r z