My question has been asked multiple times and I implemented the provided answers but none of them worked. I have a dataframe which contains an object column where all of its cells are empty strings. I have been trying to drop it through using the following methods separately each time:
data.dropna()
data.dropna(axis=1, inplace=True)
data.dropna(axis='columns', how='all', inplace=True)
data.mask(data.astype(bool)).dropna(axis=1, how='all')
data.dropna(subset=['columnName'], inplace=True)
filter = data['columnName'] != ""
data = data[filter]
Also, once I tried to replace the empty cells with Nan by using:
data['columnName'].replace('', np.nan, inplace=True)
and then drop the column but strangely nothing was even changed to NaN in the corresponding column. In the above lines of code whereever it was required I assigned the result of the methods to data again but non of them worked. I wonder what to use instead that works?
This is a sample data:
BUILDING CATEGORY MEANS OF ACCESS ADDRESS PRICE
rental UK £700000
commercial UK £5000000
I intend to drop MEANS OF ACCESS.
UPDATE
The code snippet is as follows:
# Remove the initial 2 rows
data = pd.read_csv(file, skiprows=2, low_memory=False)
# Remove the irrelevant columns
data = data.drop(['REGION', 'NUMBER'], axis=1)
# Remove '$' sign
data['PRICE'] = [x[1:] for x in data['PRICE']]
columns = ['WHOLE UNITS', 'AREA', 'PRICE']
# Remove comma
data[columns] = data[columns].apply(lambda x: x.str.replace(',', ''))
# Convert to numeric
data[columns] = data[columns].apply(pd.to_numeric)
# Remove duplicate rows
data.drop_duplicates(inplace=True)
print((data['MEANS OF ACCESS'] == "").sum()) #returns 0 but it shouldn't
If you want to drop the column 'column_name', then simply use
df = df.drop(labels=['column_name'], axis=1)
If you want to drop all columns which contain only empty strings, then use
df = df.replace('', pd.NA).dropna(axis=1, how='all')
Related
I have the following code:
df1 = pd.read_excel(f, sheet_name=0, header=6)
# Drop Columns by position
df1 = df1.drop([df1.columns[5],df1.columns[8],df1.columns[10],df1.columns[14],df1.columns[15],df1.columns[16],df1.columns[17],df1.columns[18],df1.columns[19],df1.columns[21],df1.columns[22],df1.columns[23],df1.columns[24],df1.columns[25]], axis=1)
# rename cols
This is where I am struggling, as each time I attempt to rename the cols by position it returns "None" which is a <class 'NoneType'> ( when I use print(type(df1)) ). Note that df1 returns the dataframe as expected after dropping the columns
I get this with everything I have tried below:
column_indices = [0,1,2,3,4,5,6,7,8,9,10,11]
new_names = ['AWG Item Code','Description','UPC','PK','Size','Regular Case Cost','Unit Scan','AMAP','Case Bill Back','Monday Start Date','Sunday End Date','Net Unit']
old_names = df1.columns[column_indices]
df1 = df1.rename(columns=dict(zip(old_names, new_names)), inplace=True)
And with:
df1 = df1.rename({df1.columns[0]:"AWG Item Code",df1.columns[1]:"Description",df1.columns[2]:"UPC",df1.columns[3]:"PK",df1.columns[4]:"Size",df1.columns[5]:"Regular Case Cost",df1.columns[6]:"Unit Scan",df1.columns[7]:"AMAP",df1.columns[8]:"Case Bill Back",df1.columns[9]:"Monday Start Date",df1.columns[10]:"Sunday End Date",df1.columns[11]:"Net Unit"}, inplace = True)
When I remove the inplace=True essentially setting it to false, it returns the dataframe but without any of the changes I am wanting.
The tricky part is that in this program my column headers will change each time, but the columns the data is in will not. Otherwise I would just use df = df.rename(columns=["a":"newname"])
One simpler version of your code could be :
df1.columns = new_names
It should work as intended, i.e. renaming columns in the index order.
Otherwise, in your own code : if you print df1.columns[column_indices]
You do not get a list but a pandas.core.indexes.base.Index
So to correct your code you just need to change the 2 last lines by :
old_names = df1.columns[column_indices].tolist()
df1.rename(columns=dict(zip(old_names, new_names)), inplace=True)
Have a nice day
I was dumb and missing columns=
df1.rename(columns={df1.columns[0]:"AWG Item Code",df1.columns[1]:"Description",df1.columns[2]:"UPC",df1.columns[3]:"PK",df1.columns[4]:"Size",df1.columns[5]:"Regular Case Cost",df1.columns[6]:"Unit Scan",df1.columns[7]:"AMAP",df1.columns[8]:"Case Bill Back",df1.columns[9]:"Monday Start Date",df1.columns[10]:"Sunday End Date",df1.columns[11]:"Net Unit"}, inplace = True)
works fine
I am not sure whether this answers your question:
There is a simple way to rename the columns:
If I have a data frame: say df1. I can see the columns name using the following code:
df.columns.to_list()
which gives me suppose following columns name:
['A', 'B', 'C','D']
And I want to keep the first three columns and rename them as 'E', 'F' and 'G' respectively. The following code gives me the desired outcome:
df = df[['A','B','C']]
df.columns = ['E','F','G]
new outcome:
df.columns.to_list()
output: ['E','F','G']
I have a dataset that has many columns, I want to extract the numeric columns and replace with the mean of the columns and then these modified columns replacing the ones in the original dataframe.
df1 = df.select_dtypes(include = ["number"]).apply(lambda x: x.fillna(x.mean()),axis=0)
df.loc[df.select_dtypes(include = ["number"])] = df1
I managed to extract the numeric columns but I couldn't replace them, the idea is not to manually indicate which are those numeric columns.
It's probably easier to assign a new/changed DataFrame. This will only change the columns you altered.
new_df = df.assign(**df.select_dtypes('number').apply(lambda x: x.fillna(x.mean())))
If you want to preserve the original DataFrame, you can do it in steps:
cols = df.select_dtypes('number').columns
df[cols] = df[cols].apply(lambda x: x.fillna(x.mean()))
Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1
enter image description hereWhen applying the below code , i am getting NAN values in the entire column of QSTS_ID
df['QSTS_ID'] = df['QSTS_ID'].str.split('.',expand=True)
df
I want to copy the entire QSTS_ID column and append it at the end. I also have to delimit it by fullstop and apply new headers
Problem is if add parameter expand=True it return DataFrame with one or more columns, so assign return NaNs.
Solution is add new columns with join or concat to original DataFrame, also add_prefix is for change new columns names:
df = df.join(df['QSTS_ID'].str.split('.',expand=True).add_prefix('QSTS_ID_'))
df = pd.concat([df, df['QSTS_ID'].str.split('.',expand=True).add_prefix('QSTS_ID_')], axis=1)
If want also remove original column:
df = df.join(df.pop('QSTS_ID').str.split('.',expand=True).add_prefix('QSTS_ID_'))
df = pd.concat([df,
df.pop('QSTS_ID').str.split('.',expand=True).add_prefix('QSTS_ID_')], axis=1)
Sample:
df = pd.DataFrame({
'QSTS_ID':['val_k.lo','val2.s','val3.t'],
'F':list('abc')
})
df1 = df['QSTS_ID'].str.split('.',expand=True).add_prefix('QSTS_ID_')
df = df.join(df1)
print (df)
QSTS_ID F QSTS_ID_0 QSTS_ID_1
0 val_k.lo a val_k lo
1 val2.s b val2 s
2 val3.t c val3 t
#check columns names of new columns
print (df1.columns)
I have a sample data (Data_sample_truncated.txt) which I truncated from a big data. It has 3 fields - "Index", "Time" and "RxIn.Density[**x**, ::]" Here I used x as integer as x can vary for any range. In this data it is 0-15. The combination of the 3 column fields is unique. For different "Index" field the "Time" and "RxIn.Density[**x**, ::]" can be same or different. For each new "Index" value the data has a blank line and almost similar column headers except for "RxIn.Density[**x**, ::]" where x is increasing when new "Index" value is reached. The data which I export from ADS (circuit simulation software) gives me like this format while exporting.
Now I want to format the data so that all the data are merged together under 3 unique column fields - "Index", "Time" and "RxIn.Density". You can see I want to remove the strings [**x**, ::] in the new dataframe of the 3rd column. Here is the sample final data file that I want after formatting (Data-format_I_want_after_formatting.txt). So I want the following -
The blank lines (or rows) to be removed
All the other header lines to be removed keeping the top header only and changing the 3rd column header to "RxIn.Density"
Keeping all the data merged under the unique column fields - "Index", "Time" and "RxIn.Density", even if the data values are duplicate.
My MATLAB code is in the below:
import pandas as pd
#create DataFrame from csv with columns f and v
df = pd.read_csv('Data_sample_truncated.txt', sep="\s+", names=['index','time','v'])
#boolean mask for identify columns of new df
m = df['v'].str.contains('RxIn')
#new column by replace NaNs by forward filling
df['g'] = df['v'].where(m).ffill()
#get original ordering for new columns
#cols = df['g'].unique()
#remove rows with same values in v and g columns
#df = df[df['v'] != df['g']]
df = df.drop_duplicates(subset=['index', 'time'], keep=False)
df.to_csv('target.txt', index=False, sep='\t')
The generated target.txt file is not what I wanted. You can check it here. Can anyone help what is wrong with my code and what to do to fix it so that I wan my intended formatting?
I am using Spyder 3.2.6 (Anaconda) where python 3.6.4 64-bit is embedded.
You can just filter out rows, that you do not want(check this):
import pandas as pd
df = pd.read_csv('Data_sample_truncated.txt', sep="\s+")
df.columns = ["index","time","RxIn.Density","1"]
del df["1"]
df = df[~df["RxIn.Density"].str.contains("Rx")].reset_index(drop=True)
df.to_csv('target.txt', index=False, sep='\t')
Try this:
df = pd.read_csv('Data_sample_truncated.txt', sep='\s+', names=['index', 'time', 'RxIn.Density', 'mask'], header=None)
df = df[df['mask'].isna()].drop(['mask'], axis=1)
df.to_csv('target.txt', index=False, sep='\t')