I have a list of columns in a Pandas DataFrame and looking to create a list of certain columns without manual entry.
My issue is that I am learning and not knowledgable enough yet.
I have tried searching around the internet but nothing was quite my case. I apologize if there is a duplicate.
The list I am trying to cut from looks like this:
['model',
'displ',
'cyl',
'trans',
'drive',
'fuel',
'veh_class',
'air_pollution_score',
'city_mpg',
'hwy_mpg',
'cmb_mpg',
'greenhouse_gas_score',
'smartway']
Here is the code that I wrote on my own: dataframe.columns.tolist()[:6,8:10,11]
In this case scenario I am trying to select everything but 'air_pollution_score' and 'greenhouse_gas_score'
My ultimate goal is to understand the syntax and how to select pieces of a list.
You could do that, or you could just use drop to remove the columns you don't want:
dataframe.drop(['air_pollution_score', 'greenhouse_gas_score'], axis=1).columns
Note that you need to specify axis=1 so that pandas knows you want to remove columns, not rows.
Even if you wanted to use list syntax, I would say that it's better to use a list comprehension instead; something like this:
exclude_columns = ['air_pollution_score', 'greenhouse_gas_score']
[col for col in dataframe.columns if col not in exclude_columns]
This gets all the columns in the dataframe unless they are present in exclude_columns.
Let's say df is your dataframe. You can actually use filters and lambda, though it quickly becomes too long. I present this as a "one-liner" alternative to the answer of #gmds.
df[
list(filter(
lambda x: ('air_pollution_score' not in x) and ('greenhouse_gas_x' not in x),
df.columns.values
))
]
What's happening here are:
filter applies a function to a list to only include elements following a defined function/
We defined that function using lambda to only check if 'air_pollution_score' or 'greenhouse_gas_x' are in the list.
We're filtering on the df.columns.values list; so the resulting list will only retain the elements that weren't the ones we mentioned.
We're using the df[['column1', 'column2']] syntax, which is "make a new dataframe but only containing the 2 columns I define."
Simple solution with pandas
import pandas as pd
data = pd.read_csv('path to your csv file')
df = data['column1','column2','column3',....]
Note: data is your source you have already loaded using pandas, new selected columns will be stored in a new data frame df
Related
I've got some big csv's. They can easily have over 300k rows and 500 columns. So obviously I like to get rid of some unneeded data in the resulting dataframe to safe resources.
There are some fix labeled columns and also some variable number of columns having similar lables but being numbered.
example=pd.DataFrame(columns=["fix","variable 1","variable 2","waste 1","waste 2"])
I want to get all these variable columns, which I can get via
example.filter(regex="var")
but I want to include "fix" as well. As df.loc doesn't allow regex' and df.filter only supports a single argument, is there a smooth way to do this? Or do I have to create a quite complex callable?
thanks in advance
Just modify your regex to do a full match for "fix":
df.filter(regex=r"var|(^fix$)")
Empty DataFrame
Columns: [fix, variable 1, variable 2]
Index: []
Another option is using Index.str.contains in the same fashion:
df.loc[:,df.columns.str.contains(r'var|(?:^fix$)') ]
Empty DataFrame
Columns: [fix, variable 1, variable 2]
Index: []
I made the group non-capturing, otherwise pandas complains.
I have a pandas data frame with different data types. I want to convert more than one column in the data frame to string type. I have individually done for each column but want to know if there is an efficient way?
So at present I am doing something like this:
repair['SCENARIO']=repair['SCENARIO'].astype(str)
repair['SERVICE_TYPE']= repair['SERVICE_TYPE'].astype(str)
I want a function that would help me pass multiple columns and convert them to strings.
To convert multiple columns to string, include a list of columns to your above-mentioned command:
df[['one', 'two', 'three']] = df[['one', 'two', 'three']].astype(str)
# add as many column names as you like.
That means that one way to convert all columns is to construct the list of columns like this:
all_columns = list(df) # Creates list of all column headers
df[all_columns] = df[all_columns].astype(str)
Note that the latter can also be done directly (see comments).
I know this is an old question, but I was looking for a way to turn all columns with an object dtype to strings as a workaround for a bug I discovered in rpy2. I'm working with large dataframes, so didn't want to list each column explicitly. This seemed to work well for me so I thought I'd share in case it helps someone else.
stringcols = df.select_dtypes(include='object').columns
df[stringcols] = df[stringcols].fillna('').astype(str)
The "fillna('')" prevents NaN entries from getting converted to the string 'nan' by replacing with an empty string instead.
You can also use list comprehension:
df = [df[col_name].astype(str) for col_name in df.columns]
You can also insert a condition to test if the columns should be converted - for example:
df = [df[col_name].astype(str) for col_name in df.columns if 'to_str' in col_name]
new to pandas here. I have a df:
inked=tracker[['A','B','C','D','AA','BB','CC', 'DD', 'E', 'F']]
single letter column names contain names and double letter column names contain numbers but also NaN.
I am converting all NaN to zeros by using this:
inked.loc[:,'AA':'DD'].fillna(0)
and it works, but when I do
inked.head()
I get the original df with the NaN. How can I make the change permanently in the df?
By default, fillna() is not performed in place. If you were operating directly on the DataFrame, then you could use the inplace=True argument, like this:
inked.fillna(0, inplace=True)
However, if you first select a subset of the columns, using loc, then the results are lost.
This was covered here. Basically, you need to re-assign the updated DataFrame back to the original DataFrame. For a list of columns (rather than a range, like you originally tried), you can do this:
inked[['AA','DD']] = inked[['AA','DD']].fillna(0)
In general when performing dataframe operations, when you want to alter a dataframe you either need to re-assign it to itself, or to a new variable. (In my experience at least)
inked = inked.loc[:,'AA':'DD'].fillna(0)
Apologies as I'm new to all this.
I'm playing around with pandas at the moment. I want to drop one particular column across two dataframes stored within a list. This is what I've written.
combine = [train, test]
for dataset in combine:
dataset = dataset.drop('Id', axis=1)
However, this doesn't work. If I do this explicitly, such as train = train.drop('Id', axis=1), this works fine.
I appreciate in this case it's two lines either way, but is there some way I can use the list of dataframes to drop the column from both?
The reason why your solution didn't work is because dataset is a name that points to the item in the list combine. You had the right idea to reassign it with dataset = dataset.drop('Id', axis=1) but all you did was overwrite the name dataset and not really place a new dataframe in the list combine
Option 1
Create a new list
combine = [d.drop('Id', axis=1) for d in combine]
Option 2
Or alter each dataframe in place with inplace=True
for d in combine:
d.drop('Id', axis=1, inplace=True)
Or maybe
combine = [df1, df2]
for i in range(len(combine)):
combine[i]=combine[i].drop('Id', axis=1)
Two quick pandas questions for you.
I have a list of dataframes I would like to apply a filter to.
countries = [us, uk, france]
for df in countries:
df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
When I run this, the df's don't change afterwards. Why is that?
If I loop through the dataframes to create a new column, as below, this works fine, and changes each df in the list.
for df in countries:
df["Continent"] = "Europe"
As a follow up question, I noticed something strange when I created a list of dataframes for different countries. I defined the list then applied transformations to each df in the list. After I transformed these different dfs, I called the list again. I was surprised to see that the list still pointed to the unchanged dataframes, and I had to redefine the list to update the results. Could anybody shed any light on why that is?
Taking a look at this answer, you can see that for df in countries: is equivalent to something like
for idx in range(len(countries)):
df = countries[idx]
# do something with df
which obviously won't actually modify anything in your list. It is generally bad practice to modify a list while iterating over it in a loop like this.
A better approach would be a list comprehension, you can try something like
countries = [us, uk, france]
countries = [df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
for df in countries]
Notice that with a list comprehension like this, we aren't actually modifying the original list - instead we are creating a new list, and assigning it to the variable which held our original list.
Also, you might consider placing all of your data in a single DataFrame with an additional country column or something along those lines - Python-level loops are generally slower and a list of DataFrames is often much less convenient to work with than a single DataFrame, which can fully leverage the vectorized pandas methods.
For why
for df in countries:
df["Continent"] = "Europe"
modifies countries, while
for df in countries:
df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
does not, see why should I make a copy of a data frame in pandas. df is a reference to the actual DataFrame in countries, and not the actual DataFrame itself, but modifications to a reference affect the original DataFrame as well. Declaring a new column is a modification. However, taking a subset is not a modification. It is just changing what the reference is referring to in the original DataFrame.