deleting pandas dataframe column - python

I am trying to delete the column of a pandas dataframe and I get the following error: ValueError: labels [' 5'] not contained in axis. However my print df.columns returnsInt64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64'). See bellow the code as well:
df = pd.read_csv(StringIO(data),skiprows=186,sep=";",header=None)
#df.drop(' 5', inplace=True)
b= df.columns.tolist()
print df.columns

df=df.drop(0,axis=1) #to delete/remove single columns
df=df.drop([0,2,3,4,5,6],axis=1) #To delete columns at 0 2,3,4,5,6
df.drop(columns=['B', 'C']) #to Delete columns by name

Related

Remove duplicate values in each row of the column

I have a Data Frame, which has a column that shows repeated values. It was the result of an inverse "explode" operation... trello_dataframe = trello_dataframe.groupby(['Card ID', 'ID List'], as_index=True).agg({'Member (Full Name)': lambda x: x.tolist()})
How do I remove duplicate values in each row of the column?
I attach more information: https://prnt.sc/RjGazPcMBX47
I would like to have the data frame like this: https://prnt.sc/y0VjKuewp872
Thanks in advance!
You will need to target the column and with a np.unique
import pandas as pd
import numpy as np
data = {
'Column1' : ['A', 'B', 'C'],
'Column2' : [[5, 0, 5, 0, 5], [5,0,5], [5]]
}
df = pd.DataFrame(data)
df['Column2'] = df['Column2'].apply(lambda x : np.unique(x))
df

Selecting a Range of Adjacent Columns for Dataframe

I am not understanding how to essentially say: columns= [0:6, 12:15])
When I try this I get invalid syntax at the :
import pandas as pd
data = pd.read_excel (rf'C:\Users\dusti\Desktop\bulk export.xlsx',
sheet_name=1,
header=None)
df = pd.DataFrame(data,
columns= [0,1,2,3,4,5,6,12,13,14,15])
df.to_csv(rf'C:\Users\dusti\Desktop\bulk export1.csv',
header=False,
index=False)
print (df)
this thing that you trying is slicing. it used for select a subset of a list
You can use the range function for create numbers and convert it to a list with the list function
list(range(0,6+1)) + list(range(12,15+1))
#output :
[0, 1, 2, 3, 4, 5, 6, 12, 13, 14, 15]

Merge columns into one while dropping nan values and duplicates

I am trying to merge multiple columns into a single column while dropping duplicates and dropping null values but keeping the rows.
What I have:
df= pd.DataFrame(np.array([['nan', 'nan', 'nan'], ['nan', 2, 2], ['nan', 'x', 'nan']]), columns=['a', 'b', 'c'])
What I need:
df= pd.DataFrame(np.array([[''], [ 2], [ 1]]), columns=['a'])
I have tried this but I get 1,nan for the last row:
df['a]=df[['a','b','c]].agg(', '.join, axis=1)
I have also tried the following but I cannot get this to work:
.stack().unstack()
and
.join
but I cannot get these to drop duplicates for each row
This will find the maximum value of a row and replace 'nan' with '':
new_df = pd.DataFrame(df.astype(float).max(axis=1).replace(np.nan, ''), columns=[df.columns[0]])
output:
a
0
1 2.0
2 1.0

Intepreting Pandas Column Referencing Syntax

I have a basic background in using R for data wrangling but am new to Python. I came across this code snippet from a tutorial on Coursera.
Can someone please explain to me what columns ={col:'Gold' + col[4:]}, inplace = True means?
(1) From my understanding, df.rename is to rename the existing column name to (in the case of first line, Gold) but why is there a need to +col[4:] after it?
(2) Does declaring the function inplace as True mean to assign the resulting df output to the original df?
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
Thank you in advance.
It means:
#for each column name
for col in df.columns:
#check first 2 chars for 01
if col[:2]=='01':
#replace column name with text gold and all characters after 4th letter
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
#similar like above
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
#similar like above
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
#check first letter
if col[:1]=='№':
#add # after first letter
df.rename(columns={col:'#'+col[1:]}, inplace=True)
Does declaring the function inplace as True mean to assign the resulting df output to the original dataframe
Yes, you are right. It replace inplace columns names.
if col[:2]=='01':
#replace column name with text gold and all characters after 4th letter
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
(1). If col has a column name of '01xx1234',
1. col[:2] = 01 is True
2. 'Gold'+col[4:] => 'Gold'+col[4:] => 'Gold1234'
3. so, '01xx1234' is replaced by 'Gold1234'.
(2) inplace = True applies directly to a dataframe and does not return a result.
If you do not add this option, you have to do like this.
df = df.rename(columns={col:'Gold'+col[4:]})
inplace=True means: The columns will be renamed in your original dataframe (df)
Your case (inplace=True):
import pandas as pd
df = pd.DataFrame(columns={"A": [1, 2, 3], "B": [4, 5, 6]})
df.rename(columns={"A": "a", "B": "c"}, inplace=True)
print(df.columns)
# Index(['a', 'c'], dtype='object')
# df already has the renamed columns, because inplace=True.
If you wouldn't use inplace=True, then the rename method would generate a new dataframe, like this:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
new_frame = df.rename(columns={"A": "a", "B": "c"})
print(df.columns)
# Index(['A', 'B'], dtype='object')
# It contains the old column names
print(new_frame.columns)
# Index(['a', 'c'], dtype='object')
# It's a new dataframe and has renamed columns
NOTE: In this case, better approach to assign the new dataframe to the original dataframe (df)
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df = df.rename(columns={"A": "a", "B": "c"})

Selecting columns from dataframe based on the name of other dataframe

I have 3 dataframes,
df
df = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'AC007', 'AC007', 'AC007'],
'AA_ID': [22, 22, 2, 2, 2],
'BB_ID':[4, 5, 6, 8, 9],
'CC_ID' : [2, 2, 3, 3, 3],
'DD_RE': [4,7,8,9,0],
'EE_RE':[5,8,9,9,10]})
and df_ID,
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'CFV', 'SAP', 'SOS']})
and the other one isdf_RE, both of these data frames has the column Name, so I need to merge it to data frame df, then I need to select the columns based on the last part of the data frame's name. That is, for example, if the data frame is df_ID then I need all columns ending with "ID" + "Name" for all matching rows from Name from data frame df, and if the data frame id df_REL then I need I all columns ends with "RE" + "Name" from df and I wanted to save it separately.
I know I could call inside the loop as,
for dfs in dataframes:
ID=[col for col in df.columns if '_ID' in col]
df_ID=pd.merge(df,df_ID,on='Name')
df_ID=df_ID[ID]
But here the ID , has to change again when the data frames ends with RE and so on , I have a couple of file with different strings so any better solution would be great
So at the end I need for df_ID as having all the columns ending with ID
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16'],
'AA_ID': [22, 22'],
'BB_ID':[4, 5],
'CC_ID' : [2, 2]})
Any help would be great
Assuming your columns in df are Name and anything with a suffix such as the examples you have listed (e.g. _ID, _RE), then what you could do is parse through the column names to first extract all unique possible suffixes:
# since the suffixes follow a pattern of `_*`, then I can look for the `_` character
suffixes = list(set([col[-3:] for col in df.columns if '_' in col]))
Now, with the list of suffixes, you next want to create a dictionary of your existing dataframes, where the keys in the dictionary are suffixes, and the values are the dataframes with the suffix names (e.g. df_ID, df_RE):
dfs = {}
dfs['_ID'] = df_ID
dfs['_RE'] = df_RE
... # and so forth
Now you can loop through your suffixes list to extract the appropriate columns with each suffix in the list and do the merges and column extractions:
for suffix in suffixes:
cols = [col for col in df.columns if suffix in col]
dfs[suffix] = pd.merge(df, dfs[suffix], on='Name')
dfs[suffix] = dfs[suffix][cols]
Now you have your dictionary of suffixed dataframes. If you want your dataframes as separate variables instead of keeping them in your dictionary, you can now set them back as individual objects:
df_ID = dfs['_ID']
df_RE = dfs['_RE']
... # and so forth
Putting it all together in an example
import pandas as pd
df = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'AC007', 'AC007', 'AC007'],
'AA_ID': [22, 22, 2, 2, 2],
'BB_ID': [4, 5, 6, 8, 9],
'CC_ID': [2, 2, 3, 3, 3],
'DD_RE': [4, 7, 8, 9, 0],
'EE_RE': [5, 8, 9, 9, 10]})
# Get unique suffixes
suffixes = list(set([col[-3:] for col in df.columns if '_' in col]))
dfs = {} # dataframes dictionary
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'CFV', 'SAP', 'SOS']})
df_RE = pd.DataFrame({'Name': ['AC007']})
dfs['_ID'] = df_ID
dfs['_RE'] = df_RE
for suffix in suffixes:
cols = [col for col in df.columns if suffix in col]
dfs[suffix] = pd.merge(df, dfs[suffix], on='Name')
dfs[suffix] = dfs[suffix][cols]
df_ID = dfs['_ID']
df_RE = dfs['_RE']
print(df_ID)
print(df_RE)
Result:
AA_ID BB_ID CC_ID
0 22 4 2
1 22 5 2
DD_RE EE_RE
0 8 9
1 9 9
2 0 10
You can first merge df with df_ID and then take the columns end with ID.
pd.merge(df,df_ID,on='Name')[[e for e in df.columns if e.endswith('ID') or e=='Name']]
Out[121]:
AA_ID BB_ID CC_ID Name
0 22 4 2 CTA15
1 22 5 2 CTA16
Similarly, this can be done for the df_RE df as well.
pd.merge(df,df_RE,on='Name')[[e for e in df.columns if e.endswith('RE') or e=='Name']]

Categories