I have a basic background in using R for data wrangling but am new to Python. I came across this code snippet from a tutorial on Coursera.
Can someone please explain to me what columns ={col:'Gold' + col[4:]}, inplace = True means?
(1) From my understanding, df.rename is to rename the existing column name to (in the case of first line, Gold) but why is there a need to +col[4:] after it?
(2) Does declaring the function inplace as True mean to assign the resulting df output to the original df?
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
Thank you in advance.
It means:
#for each column name
for col in df.columns:
#check first 2 chars for 01
if col[:2]=='01':
#replace column name with text gold and all characters after 4th letter
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
#similar like above
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
#similar like above
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
#check first letter
if col[:1]=='№':
#add # after first letter
df.rename(columns={col:'#'+col[1:]}, inplace=True)
Does declaring the function inplace as True mean to assign the resulting df output to the original dataframe
Yes, you are right. It replace inplace columns names.
if col[:2]=='01':
#replace column name with text gold and all characters after 4th letter
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
(1). If col has a column name of '01xx1234',
1. col[:2] = 01 is True
2. 'Gold'+col[4:] => 'Gold'+col[4:] => 'Gold1234'
3. so, '01xx1234' is replaced by 'Gold1234'.
(2) inplace = True applies directly to a dataframe and does not return a result.
If you do not add this option, you have to do like this.
df = df.rename(columns={col:'Gold'+col[4:]})
inplace=True means: The columns will be renamed in your original dataframe (df)
Your case (inplace=True):
import pandas as pd
df = pd.DataFrame(columns={"A": [1, 2, 3], "B": [4, 5, 6]})
df.rename(columns={"A": "a", "B": "c"}, inplace=True)
print(df.columns)
# Index(['a', 'c'], dtype='object')
# df already has the renamed columns, because inplace=True.
If you wouldn't use inplace=True, then the rename method would generate a new dataframe, like this:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
new_frame = df.rename(columns={"A": "a", "B": "c"})
print(df.columns)
# Index(['A', 'B'], dtype='object')
# It contains the old column names
print(new_frame.columns)
# Index(['a', 'c'], dtype='object')
# It's a new dataframe and has renamed columns
NOTE: In this case, better approach to assign the new dataframe to the original dataframe (df)
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df = df.rename(columns={"A": "a", "B": "c"})
Related
I did a pandas merge and now have two columns - col_x and col_y. I'd like to fill values in col_x with col_y, but only for rows where where col_y is not NaN or has a value. I'd like to keep the original values in col_x and only replace from col_y if NaN.
import pandas as pd
df = pd.DataFrame({
'i': [0, 1, 2, 3],
'c': [np.nan, {'a':'A'}, np.nan, {'b':'B'}],
'd': [{'c':'C'}, np.nan, {'d':'D'}, np.nan]
})
Expected output:
i c d
0 {'c':'C'} {'c':'C'}
1 {'a':'A'} np.nan
2 {'d':'D'} {'d':'D'}
3 {'b':'B'} np.nan
Are you just trying to fillna?
df.c.fillna(df.d, inplace=True)
You can use np.where()
So something like
df['c'] = np.where(df['c'].isna(), df['d'], df['c'])
should do the trick! The first parameter is the condition to check, the second is what to return if the condition is true, and the third is what to return if the condition is false.
Try:
df["c"] = [y if str(x) == "nan" else x for x,y in zip(df.c,df.d)]
Probably cleaner way but this is one line
I have a Data Frame, which has a column that shows repeated values. It was the result of an inverse "explode" operation... trello_dataframe = trello_dataframe.groupby(['Card ID', 'ID List'], as_index=True).agg({'Member (Full Name)': lambda x: x.tolist()})
How do I remove duplicate values in each row of the column?
I attach more information: https://prnt.sc/RjGazPcMBX47
I would like to have the data frame like this: https://prnt.sc/y0VjKuewp872
Thanks in advance!
You will need to target the column and with a np.unique
import pandas as pd
import numpy as np
data = {
'Column1' : ['A', 'B', 'C'],
'Column2' : [[5, 0, 5, 0, 5], [5,0,5], [5]]
}
df = pd.DataFrame(data)
df['Column2'] = df['Column2'].apply(lambda x : np.unique(x))
df
After renaming a DataFrame's column(s), I get an error when merging on the new column(s):
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2]})
df2 = pd.DataFrame({'b': [3, 1]})
df1.columns = [['b']]
df1.merge(df2, on='b')
TypeError: only integer scalar arrays can be converted to a scalar index
When renaming columns, use DataFrame.columns = [list], not DataFrame.columns = [[list]]:
df1 = pd.DataFrame({'a': [1, 2]})
df2 = pd.DataFrame({'b': [3, 1]})
df1.columns = ['b']
df1.merge(df2, on='b')
# b
# 0 1
Replaced the code tmp.columns = [['POR','POR_PORT']] with tmp.rename(columns={'Locode':'POR', 'Port Name':'POR_PORT'}, inplace=True) and it worked.
I have a dataframe, which consists of summary statistics of another dataframe:
df = sample[['Place','Lifeexp']]
df = df.groupby('Place').agg(['count','mean', 'max','min']).reset_index()
df = df.sort_values([('Lifeexp', 'count')], ascending=False)
When looking at the structure, the dataframe has a multi index, which makes plot creations difficult:
df.columns
MultiIndex(levels=[['Lifeexp', 'Place'], ['count', 'mean', 'max', 'min', '']],
labels=[[1, 0, 0, 0, 0], [4, 0, 1, 2, 3]])
I tried the solutions of different questions here (e.g. this), but somehow don't get the desired result. I want df to have Place, count, mean,max, min as column names and delete Lifeexp so that I can create easy plots e.g. df.plot.bar(x = "Place", y = 'count')
I think solution should be simplify define column after groupby for prevent MultiIndex in columns:
df = df.groupby('Place')['Lifeexp'].agg(['count','mean', 'max','min']).reset_index()
df = df.sort_values('count', ascending=False)
Different from creating an empty dataframe and populating rows later , I have many many dataframes that needs to be concatenated.
If there were only two data frames, I can do this:
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df1.append(df2, ignore_index=True)
Imagine I have millions of df that needs to be appended/concatenated each time I read a new file into a DataFrame object.
But when I tried to initialize an empty dataframe and then adding the new dataframes through a loop:
import pandas as pd
alldf = pd.DataFrame(, columns=list('AB'))
for filename in os.listdir(indir):
df = pd.read_csv(indir+filename, delimiter=' ')
alldf.append(df, ignore_index=True)
This would return an empty alldf with only the header row, e.g.
alldf = pd.DataFrame(columns=list('AB'))
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
for df in [df1, df2]:
alldf.append(df, ignore_index=True)
df.concat() over an array of dataframes is probably the way to go, especially for clean CSVs. But in case you suspect your CSVs are either dirty or could get recognized by read_csv() with mixed types between files, you may want to explicity create each dataframe in a loop.
You can initialize a dataframe for the first file, and then each subsequent file start with an empty dataframe based on the first.
df2 = pd.DataFrame(data=None, columns=df1.columns,index=df1.index)
This takes the structure of dataframe df1 but no data, and create df2. If you want to force data type on columns, then you can do it to df1 when it is created, before its structure is copied.
more details
From #DSM comment, this works:
import pandas as pd
dfs = []
for filename in os.listdir(indir):
df = pd.read_csv(indir+filename, delimiter=' ')
dfs(df)
alldf = pd.concat(dfs)