new to pandas here. I have a df:
inked=tracker[['A','B','C','D','AA','BB','CC', 'DD', 'E', 'F']]
single letter column names contain names and double letter column names contain numbers but also NaN.
I am converting all NaN to zeros by using this:
inked.loc[:,'AA':'DD'].fillna(0)
and it works, but when I do
inked.head()
I get the original df with the NaN. How can I make the change permanently in the df?
By default, fillna() is not performed in place. If you were operating directly on the DataFrame, then you could use the inplace=True argument, like this:
inked.fillna(0, inplace=True)
However, if you first select a subset of the columns, using loc, then the results are lost.
This was covered here. Basically, you need to re-assign the updated DataFrame back to the original DataFrame. For a list of columns (rather than a range, like you originally tried), you can do this:
inked[['AA','DD']] = inked[['AA','DD']].fillna(0)
In general when performing dataframe operations, when you want to alter a dataframe you either need to re-assign it to itself, or to a new variable. (In my experience at least)
inked = inked.loc[:,'AA':'DD'].fillna(0)
Related
I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))
You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)
Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T
So, I read CSV-files that are generated using excel.
Those can contain empty columns and rows on the right side - resp. below the data range/table.
Empty here meaning really empty. So: No column header, no data whatsoever, clearly an artifact.
In a first iteration I just used
pd.read_csv().dropna(axis=1, how='all', inplace=False).dropna(axis='index', how='all', inplace=False)
which seemed to work fine.
But it also removes correctly empty columns. Correctly empty here meaning regular columns including a column name, that are really supposed to be empty because that is their data.
I do want to keep all columns that
have a proper column name OR contain data -> someone might have just forgotten to give a column name, but it is a proper column
So, per https://stackoverflow.com/a/43983654/2215053 I first used
unnamed_cols_mask = basedata_df2.columns.str.contains('^Unnamed')
basedata_df2.loc[:, ~unnamed_cols_mask] + basedata_df2.loc[:, unnamed_cols_mask].dropna(axis=1, how='all', inplace=False)
which looks and feels clean, but it scrambles the column order.
So now I go with:
df = pd.read_csv().dropna(axis='index', how='all', inplace=False)
df = df[[column_name for column_name in df.columns.array if not column_name.startswith('Unnamed: ') or not df[column_name].isnull().all()]]
Which works.
But there should be an obviously right way to accomplish this frequently occuring task?
So how could I do this better?
Specifically: Is there a way to make sure the column names starting with 'Unnamed: ' were created by the pd.read_csv() and not originally imported from the csv?
Unfortunately, I think there is no built-in function. Also not in pandas.read_csv. But you can apply the following code:
# get all rows which contain only nas
ser_all_na= df.isna().all(axis='rows')
# get all rows which got a generic name Unnamed...
del_indexer= ser_all_na.index.str.startswith('Unnamed: ')
# now delete all columns which got no explicit name and only contain nas
del_indexer&= ser_all_na
df.drop(columns=ser_all_na[del_indexer].index, inplace=True)
I have a pandas DataFrame, say df, and I'm trying to drop certain rows by an index. Specifically:
myindex = df[df.column2 != myvalue].index
df.drop(myindex, inplace = True)
This seems to work just fine for most DataFrames but strange things seem to happen with one DataFrame where I get a non-unique index myindex (I am not quite sure why since the DataFrame has no duplicate rows). To be more precise, a lot more values get dropped than there are in the index (in the extreme case I actually drop all rows even though there are several hundred rows where column2 has myvalue). Extracting only unique values (myindex.unique() and dropping the rows using the unique index doesn't help either. At the same time,
df = df[df.column2 != myvalue]
works just as I'd like it to. I'd rather use the inplace drop however but more importantly I would like to understand why the results are not the same with the direct asignment and with the drop method using the index.
Unfortunately, I cannot provide the data as those cannot be published and since I am not sure what is wrong exactly, I cannot simulate them either. However, I suspect it probably has something to do with myindex being nonunique (which also confuses me since there are no duplicate rows in df but it might very well be that I misunderstand the way the index is created).
If there are repeated values in your index, doing reset_index before might help. That will set your current index as a column and add a new sequential index (with unique values) instead.
df = df.reset_index()
The reason the 2 methods are not the same is that in one case you are passing a series of booleans that represents with rows to keep and which ones to drop (index values are not relevant here). In the case with the drop, you are passing a list of index values (which map to several positions).
Finally, to check is your index has duplicates, you shouldn't check for duplicate rows. Simply do:
df.index.has_duplicates
I have a pandas data frame with different data types. I want to convert more than one column in the data frame to string type. I have individually done for each column but want to know if there is an efficient way?
So at present I am doing something like this:
repair['SCENARIO']=repair['SCENARIO'].astype(str)
repair['SERVICE_TYPE']= repair['SERVICE_TYPE'].astype(str)
I want a function that would help me pass multiple columns and convert them to strings.
To convert multiple columns to string, include a list of columns to your above-mentioned command:
df[['one', 'two', 'three']] = df[['one', 'two', 'three']].astype(str)
# add as many column names as you like.
That means that one way to convert all columns is to construct the list of columns like this:
all_columns = list(df) # Creates list of all column headers
df[all_columns] = df[all_columns].astype(str)
Note that the latter can also be done directly (see comments).
I know this is an old question, but I was looking for a way to turn all columns with an object dtype to strings as a workaround for a bug I discovered in rpy2. I'm working with large dataframes, so didn't want to list each column explicitly. This seemed to work well for me so I thought I'd share in case it helps someone else.
stringcols = df.select_dtypes(include='object').columns
df[stringcols] = df[stringcols].fillna('').astype(str)
The "fillna('')" prevents NaN entries from getting converted to the string 'nan' by replacing with an empty string instead.
You can also use list comprehension:
df = [df[col_name].astype(str) for col_name in df.columns]
You can also insert a condition to test if the columns should be converted - for example:
df = [df[col_name].astype(str) for col_name in df.columns if 'to_str' in col_name]
Recently I have been developing some code to read a csv file and store key data columns in a dataframe. Afterwards I plan to have some mathematical functions performed on certain columns in the dataframe.
I've been fairly successful in storing the correct columns in the dataframe. I have been able to have it do whatever maths is necessary such as summations, additions of dataframe columns, averaging etc.
My problem lies in accessing specific columns once they are stored in the dataframe. I was working with a test file to get everything working and managed this no problem. The problems arise when I open a different csv file, it will store the data in the dataframe, but the accessing the column I want no longer works and it stops at the calculation part.
From what I can tell the problem lies with how it reads the column name. The column names are all numbers. For example, df['300'], df['301'] etc. When accessing the column df['300'] works fine in the testfile, while the next file requires df['300.0']. If I switch to a different file it may require df['300'] again. All the data was obtained in the same way so I am not certain why some are read as 300 and the others 300.0.
Short of constantly changing the column labels each time I open a different file, is there anyway to have it automatically distinguish between '300' and '300.0' when opening the file, or force '300.0' = '300'?
Thanks
In your dataframe df, one way to keep consistency may be to convert to similar types of columns. You can update all the column name to string value of integer from float i.e. '300.0' to '300' using .columns as below. Then, I think using integer value of string should work i.e. df['300] or any other columns other than 300.
df.columns = [str(int(float(column))) for column in df.columns]
Or, if integer value is not required,extra int conversion can be removed and float string value can be used:
df.columns = [str(float(column)) for column in df.columns]
Then, df['300.0'] can be used instead of df['300'].
If string type is not required then, I think converting them float would work as well.
df.columns = [float(column) for column in df.columns]
Then, df[300.0] would work as well.
Other alternative to change column names may be using map:
Changing to float value for all columns, then as mentioned above use df[300.0]:
df.columns = map(float, df.columns)
Changing to string value of float, then df['300.0']:
df.columns = map(str, map(float, df.columns))
Changing to string value of int, then df['300']:
df.columns = map(str, map(int, map(float, df.columns)))
Some solutions:
Go through all the files, change the columns names, then save the result in a new folder. Now when you read a file, you can go to the new folder and read it from there.
Wrap the normal file read function in another function that automatically changes the column names, and call that new function when you read a file.
Wrap column selection in a function. Use a try/except block to have the function try to access the given column, and if it fails, use the other form.
This answer assumes you want only the integer part to remain in the column name. It takes the column names and does a float->int->string conversion to strip the decimal places.
Be careful, if you have numbers like '300.5' as a column name, this will turn them into '300'.
cols = df.columns.tolist()
new_columns = dict([(c,str(int(float(c)))) for c in cols])
df = df.rename(columns = new_columns)
For clarity, most of the 'magic' is happening on the middle line. I iterate over the currently existing columns, and turn them into tuples of the form (old_name, new_name). df.rename takes that dictionary and then does the renaming for you.
My thanks to user Nipun Batra for this answer that explained df.rename.