Drop non-numeric columns from a pandas DataFrame [duplicate] - python

This question already has answers here:
How do I find numeric columns in Pandas?
(13 answers)
Closed 3 years ago.
In my application I load text files that are structured as follows:
First non numeric column (ID)
A number of non-numeric columns (strings)
A number of numeric columns (floats)
The number of the non-numeric columns is variable. Currently I load the data into a DataFrame like this:
source = pandas.read_table(inputfile, index_col=0)
I would like to drop all non-numeric columns in one fell swoop, without knowing their names or indices, since this could be doable reading their dtype. Is this possible with pandas or do I have to cook up something on my own?

To avoid using a private method you can also use select_dtypes, where you can either include or exclude the dtypes you want.
Ran into it on this post on the exact same thing.
Or in your case, specifically:
source.select_dtypes(['number']) or source.select_dtypes([np.number]

It`s a private method, but it will do the trick: source._get_numeric_data()
In [2]: import pandas as pd
In [3]: source = pd.DataFrame({'A': ['foo', 'bar'], 'B': [1, 2], 'C': [(1,2), (3,4)]})
In [4]: source
Out[4]:
A B C
0 foo 1 (1, 2)
1 bar 2 (3, 4)
In [5]: source._get_numeric_data()
Out[5]:
B
0 1
1 2

This would remove each column which doesn't include float64 numerics.
df = pd.read_csv('sample.csv', index_col=0)
non_floats = []
for col in df:
if df[col].dtypes != "float64":
non_floats.append(col)
df = df.drop(columns=non_floats)

I also have another possible solution for dropping the columns with categorical value with 2 lines of code, defining a list with columns of categorical values (1st line) and dropping them with the second line. df is our DataFrame
df before dropping:
to_be_dropped=pd.DataFrame(df.categorical).columns
df= df.drop(to_be_dropped,axis=1)
df after dropping:

Related

How do I filter out multiple columns witha certain string in Python

I'm new to python and especially to pandas so I don't really know what I'm doing. I have 10 columns with 100000 rows and 4 letter strings. I need to filter out rows which don't contain 'DDD' in all of the columns/rows.
I tried to do it with iloc and loc, but it doesn't work:
import pandas as pd
df = pd.read_csv("data_3.csv", delimiter = '!')
df.iloc[:,10:20].str.contains('DDD', regex= False, na = False)
df.head()
It returns me an error: 'DataFrame' object has no attribute 'str'
I suggest doing it without a for loop like this:
df[df.apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only string columns
df[df.select_dtypes(include='object').apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only some string columns
selected_cols = ['A','B']
df[df[selected_cols].apply(lambda x: x.str.contains('DDD')).all(axis=1)]
You can do this but if your all column type is StringType:
for column in foo.columns:
df = df[~df[c].str.contains('DDD')]
You can use str.contains, but only on Series not on DataFrames. So to use it we look at each column (which is a series) one by one by for looping over them:
>>> import pandas as pd
>>> df = pd.DataFrame([['DDDA', 'DDDB', 'DDDC', 'DDDD'],
['DDDE', 'DDDF', 'DDDG', 'DHDD'],
['DDDI', 'DDDJ', 'DDDK', 'DDDL'],
['DMDD', 'DNDN', 'DDOD', 'DDDP']],
columns=['A', 'B', 'C', 'D'])
>>> for column in df.columns:
df = df[df[column].str.contains('DDD')]
In our for loop we're overwriting the DataFrame df with df where the column contains 'DDD'. By looping over each column we cut out rows that don't contain 'DDD' in that column until we've looked in all of our columns, leaving only rows that contain 'DDD' in every column.
This gives you:
>>> print(df)
A B C D
0 DDDA DDDB DDDC DDDD
2 DDDI DDDJ DDDK DDDL
As you're only looping over 10 columns this shouldn't be too slow.
Edit: You should probably do it without a for loop as explained by Christian Sloper as it's likely to be faster, but I'll leave this up as it's slightly easier to understand without knowledge of lambda functions.

Convert pandas data frame to series

I'm somewhat new to pandas. I have a pandas data frame that is 1 row by 23 columns.
I want to convert this into a series? I'm wondering what the most pythonic way to do this is?
I've tried pd.Series(myResults) but it complains ValueError: cannot copy sequence with size 23 to array axis with dimension 1. It's not smart enough to realize it's still a "vector" in math terms.
Thanks!
You can transpose the single-row dataframe (which still results in a dataframe) and then squeeze the results into a series (the inverse of to_frame).
df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df.squeeze(axis=0)
a0 0
a1 1
a2 2
a3 3
a4 4
Name: 0, dtype: int64
Note: To accommodate the point raised by #IanS (even though it is not in the OP's question), test for the dataframe's size. I am assuming that df is a dataframe, but the edge cases are an empty dataframe, a dataframe of shape (1, 1), and a dataframe with more than one row in which case the use should implement their desired functionality.
if df.empty:
# Empty dataframe, so convert to empty Series.
result = pd.Series()
elif df.shape == (1, 1)
# DataFrame with one value, so convert to series with appropriate index.
result = pd.Series(df.iat[0, 0], index=df.columns)
elif len(df) == 1:
# Convert to series per OP's question.
result = df.T.squeeze()
else:
# Dataframe with multiple rows. Implement desired behavior.
pass
This can also be simplified along the lines of the answer provided by #themachinist.
if len(df) > 1:
# Dataframe with multiple rows. Implement desired behavior.
pass
else:
result = pd.Series() if df.empty else df.iloc[0, :]
It's not smart enough to realize it's still a "vector" in math terms.
Say rather that it's smart enough to recognize a difference in dimensionality. :-)
I think the simplest thing you can do is select that row positionally using iloc, which gives you a Series with the columns as the new index and the values as the values:
>>> df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df
a0 a1 a2 a3 a4
0 0 1 2 3 4
>>> df.iloc[0]
a0 0
a1 1
a2 2
a3 3
a4 4
Name: 0, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>
You can retrieve the series through slicing your dataframe using one of these two methods:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randn(1,8))
series1=df.iloc[0,:]
type(series1)
pandas.core.series.Series
You can also use stack()
df= DataFrame([list(range(5))], columns = [“a{}”.format(I) for I in range(5)])
After u run df, then run:
df.stack()
You obtain your dataframe in series
If you have a one column dataframe df, you can convert it to a series:
df.iloc[:,0] # pandas Series
Since you have a one row dataframe df, you can transpose it so you're in the previous case:
df.T.iloc[:,0]
Another way -
Suppose myResult is the dataFrame that contains your data in the form of 1 col and 23 rows
# label your columns by passing a list of names
myResult.columns = ['firstCol']
# fetch the column in this way, which will return you a series
myResult = myResult['firstCol']
print(type(myResult))
In similar fashion, you can get series from Dataframe with multiple columns.
data = pd.DataFrame({"a":[1,2,3,34],"b":[5,6,7,8]})
new_data = pd.melt(data)
new_data.set_index("variable", inplace=True)
This gives a dataframe with index as column name of data and all data are present in "values" column
Another way is very simple
df= df.iloc[3].reset_index(drop=True).squeeze()
Squeeze -> is the one that converts to Series.

How to drop rows from pandas data frame that contains a particular string in a particular column? [duplicate]

This question already has answers here:
Search for "does-not-contain" on a DataFrame in pandas
(9 answers)
Closed 4 years ago.
I have a very large data frame in python and I want to drop all rows that have a particular string inside a particular column.
For example, I want to drop all rows which have the string "XYZ" as a substring in the column C of the data frame.
Can this be implemented in an efficient way using .drop() method?
pandas has vectorized string operations, so you can just filter out the rows that contain the string you don't want:
In [91]: df = pd.DataFrame(dict(A=[5,3,5,6], C=["foo","bar","fooXYZbar", "bat"]))
In [92]: df
Out[92]:
A C
0 5 foo
1 3 bar
2 5 fooXYZbar
3 6 bat
In [93]: df[~df.C.str.contains("XYZ")]
Out[93]:
A C
0 5 foo
1 3 bar
3 6 bat
If your string constraint is not just one string you can drop those corresponding rows with:
df = df[~df['your column'].isin(['list of strings'])]
The above will drop all rows containing elements of your list
This will only work if you want to compare exact strings.
It will not work in case you want to check if the column string contains any of the strings in the list.
The right way to compare with a list would be :
searchfor = ['john', 'doe']
df = df[~df.col.str.contains('|'.join(searchfor))]
Slight modification to the code. Having na=False will skip empty values. Otherwise you can get an error TypeError: bad operand type for unary ~: float
df[~df.C.str.contains("XYZ", na=False)]
Source: TypeError: bad operand type for unary ~: float
new_df = df[df.C != 'XYZ']
Reference: https://chrisalbon.com/python/data_wrangling/pandas_dropping_column_and_rows/
The below code will give you list of all the rows:-
df[df['C'] != 'XYZ']
To store the values from the above code into a dataframe :-
newdf = df[df['C'] != 'XYZ']
if you do not want to delete all NaN, use
df[~df.C.str.contains("XYZ") == True]

"Expanding" pandas dataframe by using cell-contained list

I have a dataframe in which third column is a list:
import pandas as pd
pd.DataFrame([[1,2,['a','b','c']]])
I would like to separate that nest and create more rows with identical values of first and second column.
The end result should be something like:
pd.DataFrame([[[1,2,'a']],[[1,2,'b']],[[1,2,'c']]])
Note, this is simplified example. In reality I have multiple rows that I would like to "expand".
Regarding my progress, I have no idea how to solve this. Well, I imagine that I could take each member of nested list while having other column values in mind. Then I would use the list comprehension to make more list. I would continue so by and add many lists to create a new dataframe... But this seems just a bit too complex. What about simpler solution?
Create the dataframe with a single column, then add columns with constant values:
import pandas as pd
df = pd.DataFrame({"data": ['a', 'b', 'c']})
df['col1'] = 1
df['col2'] = 2
print df
This prints:
data col1 col2
0 a 1 2
1 b 1 2
2 c 1 2
Not exactly the same issue that the OR described, but related - and more pandas-like - is the situation where you have a dict of lists with lists of unequal lengths. In that case, you can create a DataFrame like this in long format.
import pandas as pd
my_dict = {'a': [1,2,3,4], 'b': [2,3]}
df = pd.DataFrame.from_dict(my_dict, orient='index')
df = df.unstack() # to format it in long form
df = df.dropna() # to drop nan values which were generated by having lists of unequal length
df.index = df.index.droplevel(level=0) # if you don't want to store the index in the list
# NOTE this last step results duplicate indexes

Selecting/excluding sets of columns in pandas [duplicate]

This question already has answers here:
Delete a column from a Pandas DataFrame
(20 answers)
Closed 4 years ago.
I would like to create views or dataframes from an existing dataframe based on column selections.
For example, I would like to create a dataframe df2 from a dataframe df1 that holds all columns from it except two of them. I tried doing the following, but it didn't work:
import numpy as np
import pandas as pd
# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# Try to create a second dataframe df2 from df with all columns except 'B' and D
my_cols = set(df.columns)
my_cols.remove('B').remove('D')
# This returns an error ("unhashable type: set")
df2 = df[my_cols]
What am I doing wrong? Perhaps more generally, what mechanisms does pandas have to support the picking and exclusions of arbitrary sets of columns from a dataframe?
You can either Drop the columns you do not need OR Select the ones you need
# Using DataFrame.drop
df.drop(df.columns[[1, 2]], axis=1, inplace=True)
# drop by Name
df1 = df1.drop(['B', 'C'], axis=1)
# Select the ones you want
df1 = df[['a','d']]
There is a new index method called difference. It returns the original columns, with the columns passed as argument removed.
Here, the result is used to remove columns B and D from df:
df2 = df[df.columns.difference(['B', 'D'])]
Note that it's a set-based method, so duplicate column names will cause issues, and the column order may be changed.
Advantage over drop: you don't create a copy of the entire dataframe when you only need the list of columns. For instance, in order to drop duplicates on a subset of columns:
# may create a copy of the dataframe
subset = df.drop(['B', 'D'], axis=1).columns
# does not create a copy the dataframe
subset = df.columns.difference(['B', 'D'])
df = df.drop_duplicates(subset=subset)
Another option, without dropping or filtering in a loop:
import numpy as np
import pandas as pd
# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# include the columns you want
df[df.columns[df.columns.isin(['A', 'B'])]]
# or more simply include columns:
df[['A', 'B']]
# exclude columns you don't want
df[df.columns[~df.columns.isin(['C','D'])]]
# or even simpler since 0.24
# with the caveat that it reorders columns alphabetically
df[df.columns.difference(['C', 'D'])]
You don't really need to convert that into a set:
cols = [col for col in df.columns if col not in ['B', 'D']]
df2 = df[cols]
Also have a look into the built-in DataFrame.filter function.
Minimalistic but greedy approach (sufficient for the given df):
df.filter(regex="[^BD]")
Conservative/lazy approach (exact matches only):
df.filter(regex="^(?!(B|D)$).*$")
Conservative and generic:
exclude_cols = ['B','C']
df.filter(regex="^(?!({0})$).*$".format('|'.join(exclude_cols)))
You have 4 columns A,B,C,D
Here is a better way to select the columns you need for the new dataframe:-
df2 = df1[['A','D']]
if you wish to use column numbers instead, use:-
df2 = df1[[0,3]]
You just need to convert your set to a list
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
my_cols = set(df.columns)
my_cols.remove('B')
my_cols.remove('D')
my_cols = list(my_cols)
df2 = df[my_cols]
Here's how to create a copy of a DataFrame excluding a list of columns:
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df2 = df.drop(['B', 'D'], axis=1)
But be careful! You mention views in your question, suggesting that if you changed df, you'd want df2 to change too. (Like a view would in a database.)
This method doesn't achieve that:
>>> df.loc[0, 'A'] = 999 # Change the first value in df
>>> df.head(1)
A B C D
0 999 -0.742688 -1.980673 -0.920133
>>> df2.head(1) # df2 is unchanged. It's not a view, it's a copy!
A C
0 0.251262 -1.980673
Note also that this is also true of #piggybox's method. (Although that method is nice and slick and Pythonic. I'm not doing it down!!)
For more on views vs. copies see this SO answer and this part of the Pandas docs which that answer refers to.
In a similar vein, when reading a file, one may wish to exclude columns upfront, rather than wastefully reading unwanted data into memory and later discarding them.
As of pandas 0.20.0, usecols now accepts callables.1 This update allows more flexible options for reading columns:
skipcols = [...]
read_csv(..., usecols=lambda x: x not in skipcols)
The latter pattern is essentially the inverse of the traditional usecols method - only specified columns are skipped.
Given
Data in a file
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
filename = "foo.csv"
df.to_csv(filename)
Code
skipcols = ["B", "D"]
df1 = pd.read_csv(filename, usecols=lambda x: x not in skipcols, index_col=0)
df1
Output
A C
0 0.062350 0.076924
1 -0.016872 1.091446
2 0.213050 1.646109
3 -1.196928 1.153497
4 -0.628839 -0.856529
...
Details
A DataFrame was written to a file. It was then read back as a separate DataFrame, now skipping unwanted columns (B and D).
Note that for the OP's situation, since data is already created, the better approach is the accepted answer, which drops unwanted columns from an extant object. However, the technique presented here is most useful when directly reading data from files into a DataFrame.
A request was raised for a "skipcols" option in this issue and was addressed in a later issue.

Categories