I have a pandas data frame with different data types. I want to convert more than one column in the data frame to string type. I have individually done for each column but want to know if there is an efficient way?
So at present I am doing something like this:
repair['SCENARIO']=repair['SCENARIO'].astype(str)
repair['SERVICE_TYPE']= repair['SERVICE_TYPE'].astype(str)
I want a function that would help me pass multiple columns and convert them to strings.
To convert multiple columns to string, include a list of columns to your above-mentioned command:
df[['one', 'two', 'three']] = df[['one', 'two', 'three']].astype(str)
# add as many column names as you like.
That means that one way to convert all columns is to construct the list of columns like this:
all_columns = list(df) # Creates list of all column headers
df[all_columns] = df[all_columns].astype(str)
Note that the latter can also be done directly (see comments).
I know this is an old question, but I was looking for a way to turn all columns with an object dtype to strings as a workaround for a bug I discovered in rpy2. I'm working with large dataframes, so didn't want to list each column explicitly. This seemed to work well for me so I thought I'd share in case it helps someone else.
stringcols = df.select_dtypes(include='object').columns
df[stringcols] = df[stringcols].fillna('').astype(str)
The "fillna('')" prevents NaN entries from getting converted to the string 'nan' by replacing with an empty string instead.
You can also use list comprehension:
df = [df[col_name].astype(str) for col_name in df.columns]
You can also insert a condition to test if the columns should be converted - for example:
df = [df[col_name].astype(str) for col_name in df.columns if 'to_str' in col_name]
Related
I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))
You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)
Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T
I have a list of columns in a Pandas DataFrame and looking to create a list of certain columns without manual entry.
My issue is that I am learning and not knowledgable enough yet.
I have tried searching around the internet but nothing was quite my case. I apologize if there is a duplicate.
The list I am trying to cut from looks like this:
['model',
'displ',
'cyl',
'trans',
'drive',
'fuel',
'veh_class',
'air_pollution_score',
'city_mpg',
'hwy_mpg',
'cmb_mpg',
'greenhouse_gas_score',
'smartway']
Here is the code that I wrote on my own: dataframe.columns.tolist()[:6,8:10,11]
In this case scenario I am trying to select everything but 'air_pollution_score' and 'greenhouse_gas_score'
My ultimate goal is to understand the syntax and how to select pieces of a list.
You could do that, or you could just use drop to remove the columns you don't want:
dataframe.drop(['air_pollution_score', 'greenhouse_gas_score'], axis=1).columns
Note that you need to specify axis=1 so that pandas knows you want to remove columns, not rows.
Even if you wanted to use list syntax, I would say that it's better to use a list comprehension instead; something like this:
exclude_columns = ['air_pollution_score', 'greenhouse_gas_score']
[col for col in dataframe.columns if col not in exclude_columns]
This gets all the columns in the dataframe unless they are present in exclude_columns.
Let's say df is your dataframe. You can actually use filters and lambda, though it quickly becomes too long. I present this as a "one-liner" alternative to the answer of #gmds.
df[
list(filter(
lambda x: ('air_pollution_score' not in x) and ('greenhouse_gas_x' not in x),
df.columns.values
))
]
What's happening here are:
filter applies a function to a list to only include elements following a defined function/
We defined that function using lambda to only check if 'air_pollution_score' or 'greenhouse_gas_x' are in the list.
We're filtering on the df.columns.values list; so the resulting list will only retain the elements that weren't the ones we mentioned.
We're using the df[['column1', 'column2']] syntax, which is "make a new dataframe but only containing the 2 columns I define."
Simple solution with pandas
import pandas as pd
data = pd.read_csv('path to your csv file')
df = data['column1','column2','column3',....]
Note: data is your source you have already loaded using pandas, new selected columns will be stored in a new data frame df
I've got some big csv's. They can easily have over 300k rows and 500 columns. So obviously I like to get rid of some unneeded data in the resulting dataframe to safe resources.
There are some fix labeled columns and also some variable number of columns having similar lables but being numbered.
example=pd.DataFrame(columns=["fix","variable 1","variable 2","waste 1","waste 2"])
I want to get all these variable columns, which I can get via
example.filter(regex="var")
but I want to include "fix" as well. As df.loc doesn't allow regex' and df.filter only supports a single argument, is there a smooth way to do this? Or do I have to create a quite complex callable?
thanks in advance
Just modify your regex to do a full match for "fix":
df.filter(regex=r"var|(^fix$)")
Empty DataFrame
Columns: [fix, variable 1, variable 2]
Index: []
Another option is using Index.str.contains in the same fashion:
df.loc[:,df.columns.str.contains(r'var|(?:^fix$)') ]
Empty DataFrame
Columns: [fix, variable 1, variable 2]
Index: []
I made the group non-capturing, otherwise pandas complains.
new to pandas here. I have a df:
inked=tracker[['A','B','C','D','AA','BB','CC', 'DD', 'E', 'F']]
single letter column names contain names and double letter column names contain numbers but also NaN.
I am converting all NaN to zeros by using this:
inked.loc[:,'AA':'DD'].fillna(0)
and it works, but when I do
inked.head()
I get the original df with the NaN. How can I make the change permanently in the df?
By default, fillna() is not performed in place. If you were operating directly on the DataFrame, then you could use the inplace=True argument, like this:
inked.fillna(0, inplace=True)
However, if you first select a subset of the columns, using loc, then the results are lost.
This was covered here. Basically, you need to re-assign the updated DataFrame back to the original DataFrame. For a list of columns (rather than a range, like you originally tried), you can do this:
inked[['AA','DD']] = inked[['AA','DD']].fillna(0)
In general when performing dataframe operations, when you want to alter a dataframe you either need to re-assign it to itself, or to a new variable. (In my experience at least)
inked = inked.loc[:,'AA':'DD'].fillna(0)
I've got a dataframe (haveleft) full of people who have left a service and their reason for leaving. The 'text' column is their reason, but some of them aren't strings. Not many, so I just want to remove those rows, either in place or to a new dataframe. Below code just gives me a dataframe populated with only NaN. Why doesn't it work?
cleanedleft = pd.DataFrame()
cleanedleft = haveleft[haveleft[haveleft['text'] == str]]
print(holder[0:10])
or if I remove one of the 'haveleft[ ]' I get an empty dataframe
cleanedleft = pd.DataFrame()
cleanedleft = haveleft[haveleft['text'] == str]
print(holder[0:10])
I've tried to add a type() but can't seem to figure out the way to do this.
It doesn't work because DataFrame columns cannot contain mixed types; your text column will be string or object, even if some values are numerical. You'll want to figure out how to characterize unwanted data and drop them accordingly.
For instance, to drop rows where 'text' consists only of digits as in the single-line example you give:
cleaned = df[~df['text'].str.match('^\d+$')]