This question already has answers here:
Delete a column from a Pandas DataFrame
(20 answers)
Closed 4 years ago.
I would like to create views or dataframes from an existing dataframe based on column selections.
For example, I would like to create a dataframe df2 from a dataframe df1 that holds all columns from it except two of them. I tried doing the following, but it didn't work:
import numpy as np
import pandas as pd
# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# Try to create a second dataframe df2 from df with all columns except 'B' and D
my_cols = set(df.columns)
my_cols.remove('B').remove('D')
# This returns an error ("unhashable type: set")
df2 = df[my_cols]
What am I doing wrong? Perhaps more generally, what mechanisms does pandas have to support the picking and exclusions of arbitrary sets of columns from a dataframe?
You can either Drop the columns you do not need OR Select the ones you need
# Using DataFrame.drop
df.drop(df.columns[[1, 2]], axis=1, inplace=True)
# drop by Name
df1 = df1.drop(['B', 'C'], axis=1)
# Select the ones you want
df1 = df[['a','d']]
There is a new index method called difference. It returns the original columns, with the columns passed as argument removed.
Here, the result is used to remove columns B and D from df:
df2 = df[df.columns.difference(['B', 'D'])]
Note that it's a set-based method, so duplicate column names will cause issues, and the column order may be changed.
Advantage over drop: you don't create a copy of the entire dataframe when you only need the list of columns. For instance, in order to drop duplicates on a subset of columns:
# may create a copy of the dataframe
subset = df.drop(['B', 'D'], axis=1).columns
# does not create a copy the dataframe
subset = df.columns.difference(['B', 'D'])
df = df.drop_duplicates(subset=subset)
Another option, without dropping or filtering in a loop:
import numpy as np
import pandas as pd
# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# include the columns you want
df[df.columns[df.columns.isin(['A', 'B'])]]
# or more simply include columns:
df[['A', 'B']]
# exclude columns you don't want
df[df.columns[~df.columns.isin(['C','D'])]]
# or even simpler since 0.24
# with the caveat that it reorders columns alphabetically
df[df.columns.difference(['C', 'D'])]
You don't really need to convert that into a set:
cols = [col for col in df.columns if col not in ['B', 'D']]
df2 = df[cols]
Also have a look into the built-in DataFrame.filter function.
Minimalistic but greedy approach (sufficient for the given df):
df.filter(regex="[^BD]")
Conservative/lazy approach (exact matches only):
df.filter(regex="^(?!(B|D)$).*$")
Conservative and generic:
exclude_cols = ['B','C']
df.filter(regex="^(?!({0})$).*$".format('|'.join(exclude_cols)))
You have 4 columns A,B,C,D
Here is a better way to select the columns you need for the new dataframe:-
df2 = df1[['A','D']]
if you wish to use column numbers instead, use:-
df2 = df1[[0,3]]
You just need to convert your set to a list
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
my_cols = set(df.columns)
my_cols.remove('B')
my_cols.remove('D')
my_cols = list(my_cols)
df2 = df[my_cols]
Here's how to create a copy of a DataFrame excluding a list of columns:
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df2 = df.drop(['B', 'D'], axis=1)
But be careful! You mention views in your question, suggesting that if you changed df, you'd want df2 to change too. (Like a view would in a database.)
This method doesn't achieve that:
>>> df.loc[0, 'A'] = 999 # Change the first value in df
>>> df.head(1)
A B C D
0 999 -0.742688 -1.980673 -0.920133
>>> df2.head(1) # df2 is unchanged. It's not a view, it's a copy!
A C
0 0.251262 -1.980673
Note also that this is also true of #piggybox's method. (Although that method is nice and slick and Pythonic. I'm not doing it down!!)
For more on views vs. copies see this SO answer and this part of the Pandas docs which that answer refers to.
In a similar vein, when reading a file, one may wish to exclude columns upfront, rather than wastefully reading unwanted data into memory and later discarding them.
As of pandas 0.20.0, usecols now accepts callables.1 This update allows more flexible options for reading columns:
skipcols = [...]
read_csv(..., usecols=lambda x: x not in skipcols)
The latter pattern is essentially the inverse of the traditional usecols method - only specified columns are skipped.
Given
Data in a file
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
filename = "foo.csv"
df.to_csv(filename)
Code
skipcols = ["B", "D"]
df1 = pd.read_csv(filename, usecols=lambda x: x not in skipcols, index_col=0)
df1
Output
A C
0 0.062350 0.076924
1 -0.016872 1.091446
2 0.213050 1.646109
3 -1.196928 1.153497
4 -0.628839 -0.856529
...
Details
A DataFrame was written to a file. It was then read back as a separate DataFrame, now skipping unwanted columns (B and D).
Note that for the OP's situation, since data is already created, the better approach is the accepted answer, which drops unwanted columns from an extant object. However, the technique presented here is most useful when directly reading data from files into a DataFrame.
A request was raised for a "skipcols" option in this issue and was addressed in a later issue.
Related
Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1
I'm new to python and especially to pandas so I don't really know what I'm doing. I have 10 columns with 100000 rows and 4 letter strings. I need to filter out rows which don't contain 'DDD' in all of the columns/rows.
I tried to do it with iloc and loc, but it doesn't work:
import pandas as pd
df = pd.read_csv("data_3.csv", delimiter = '!')
df.iloc[:,10:20].str.contains('DDD', regex= False, na = False)
df.head()
It returns me an error: 'DataFrame' object has no attribute 'str'
I suggest doing it without a for loop like this:
df[df.apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only string columns
df[df.select_dtypes(include='object').apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only some string columns
selected_cols = ['A','B']
df[df[selected_cols].apply(lambda x: x.str.contains('DDD')).all(axis=1)]
You can do this but if your all column type is StringType:
for column in foo.columns:
df = df[~df[c].str.contains('DDD')]
You can use str.contains, but only on Series not on DataFrames. So to use it we look at each column (which is a series) one by one by for looping over them:
>>> import pandas as pd
>>> df = pd.DataFrame([['DDDA', 'DDDB', 'DDDC', 'DDDD'],
['DDDE', 'DDDF', 'DDDG', 'DHDD'],
['DDDI', 'DDDJ', 'DDDK', 'DDDL'],
['DMDD', 'DNDN', 'DDOD', 'DDDP']],
columns=['A', 'B', 'C', 'D'])
>>> for column in df.columns:
df = df[df[column].str.contains('DDD')]
In our for loop we're overwriting the DataFrame df with df where the column contains 'DDD'. By looping over each column we cut out rows that don't contain 'DDD' in that column until we've looked in all of our columns, leaving only rows that contain 'DDD' in every column.
This gives you:
>>> print(df)
A B C D
0 DDDA DDDB DDDC DDDD
2 DDDI DDDJ DDDK DDDL
As you're only looping over 10 columns this shouldn't be too slow.
Edit: You should probably do it without a for loop as explained by Christian Sloper as it's likely to be faster, but I'll leave this up as it's slightly easier to understand without knowledge of lambda functions.
I need to perform some operations on a pandas DataFrame() in order to evaluate some measure but leave my DataFrame as is. So I thought that I should start by duplicating it in memory :
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame(df1)
When printing
print(id(df1), id(df2))
I do get two different system adresses. So in my sense, these are two different instances of DataFrame().
However, if I do:
df2['b'] = [4,5,6]
print(df1)
df1 appears with a 'b' column, although I only added it in df2.
Why is this happening?
How can I really duplicate my DataFrame so that operations on one do not modify the other?
I am on Python 3.5 and pandas 0.24.2
You need to use pd.DataFrame.copy
df2 = df1.copy()
An assignement, even when you assign to a new variable, is referencing the same data/indices in memory, which means a manipulation on df1 or df2 will change the same data in memory. Using copy however, df2 gets its own copy of data that can be manipulated independently.
Explanation:
Why do you get two different memory addresses when calling the pd.DataFrame on a DataFrame?
Simply put, pandas.DataFrame is a wrapper around numpy.ndarry. When you called the pd.DataFrame with df1 dataframe as input, there was a new pd.DataFrame wrapper that was created (thus a different memory address), but the data is exactly the same. You can verify that with the following code:
In [2]: import pandas as pd
...: df1 = pd.DataFrame({'a':[1,2,3]})
...: df2 = pd.DataFrame(df1)
...:
In [3]: print(id(df1), id(df2))
(4665009296, 4665009360)
In [4]: df1._data
Out[4]:
BlockManager
Items: Index([u'a'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
IntBlock: slice(0, 1, 1), 1 x 3, dtype: int64
In [5]: id(df1._data)
Out[5]: 4522343248
In [6]: id(df2._data)
Out[6]: 4522343248
As you can see, the memory address for df1._data and df2._data is exactly the same.
This is also clear when you read the DataFrame source code in github, where, at the beginning of the constructor, the same data is referenced by the new dataframe:
if isinstance(data, DataFrame):
data = data._data
include_cols_path = sys.argv[5]
with open(include_cols_path) as f:
include_cols = f.read().splitlines()
include_cols is a list of strings
df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True).toPandas()
df1 is a dataframe of a large file. I would like to only retain the columns with names that contain any of the strings in include_cols.
final_cols = [col for col in df.columns.values if col in include_cols]
df = df[final_cols]
Doing this in pandas is certainly a dupe. However, it seems that you are converting a spark DataFrame to a pandas DataFrame.
Instead of performing the (expensive) collect operation and then filtering the columns you want, it's better to just filter on the spark side using select():
df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True)
pandas_df = df1.select(include_cols).toPandas()
You should also think about whether or not converting to a pandas DataFrame is really what you want to do. Just about anything you can do in pandas can also be done in spark.
EDIT
I misunderstood your question originally. Based on your comments, I think this is what you're looking for:
selected_columns = [c for c in df1.columns if any([x in c for x in include_cols])]
pandas_df = df1.select(selected_columns).toPandas()
Explanation:
Iterate through the columns in df1 and keep only those for which at least one of the strings in include_cols is contained in the column name. The any() functions returns True if at least one of the conditions is True.
df1.loc[:, df1.columns.str.contains('|'.join(include_cols))]
For example:
df1 = pd.DataFrame(data=np.random.random((5, 5)), columns=list('ABCDE'))
include_cols = ['A', 'C', 'Z']
df1.loc[:, df1.columns.str.contains('|'.join(include_cols))]
>>> A C
0 0.247271 0.761153
1 0.390240 0.050055
2 0.333401 0.823384
3 0.821196 0.929520
4 0.210226 0.406168
The '|'.join(include_cols) part will create an or condition with all elements of the input list. In the above example A|C|Z. This conditions will be True if one of the element is contained in the column names using the .contains() method on the column names.
This question already has answers here:
How do I find numeric columns in Pandas?
(13 answers)
Closed 3 years ago.
In my application I load text files that are structured as follows:
First non numeric column (ID)
A number of non-numeric columns (strings)
A number of numeric columns (floats)
The number of the non-numeric columns is variable. Currently I load the data into a DataFrame like this:
source = pandas.read_table(inputfile, index_col=0)
I would like to drop all non-numeric columns in one fell swoop, without knowing their names or indices, since this could be doable reading their dtype. Is this possible with pandas or do I have to cook up something on my own?
To avoid using a private method you can also use select_dtypes, where you can either include or exclude the dtypes you want.
Ran into it on this post on the exact same thing.
Or in your case, specifically:
source.select_dtypes(['number']) or source.select_dtypes([np.number]
It`s a private method, but it will do the trick: source._get_numeric_data()
In [2]: import pandas as pd
In [3]: source = pd.DataFrame({'A': ['foo', 'bar'], 'B': [1, 2], 'C': [(1,2), (3,4)]})
In [4]: source
Out[4]:
A B C
0 foo 1 (1, 2)
1 bar 2 (3, 4)
In [5]: source._get_numeric_data()
Out[5]:
B
0 1
1 2
This would remove each column which doesn't include float64 numerics.
df = pd.read_csv('sample.csv', index_col=0)
non_floats = []
for col in df:
if df[col].dtypes != "float64":
non_floats.append(col)
df = df.drop(columns=non_floats)
I also have another possible solution for dropping the columns with categorical value with 2 lines of code, defining a list with columns of categorical values (1st line) and dropping them with the second line. df is our DataFrame
df before dropping:
to_be_dropped=pd.DataFrame(df.categorical).columns
df= df.drop(to_be_dropped,axis=1)
df after dropping: