Change Column Names in Dataframe via list of Column Names - python

I have a dataframe with a ton of columns. I would like to change a list of a sub set of the column names to all uppercase.
The code below doesn't change the column names and the other code I've tried produces errors:
df[cols_to_cap].columns = df[cols_to_cap].columns.str.upper()
What am I missing?

Try the below code, this uses the rename function.
rename_dict = {}
for each_column in list_of_cols_in_lower_case:
rename_dict[each_column] = each_column.upper()
df.rename(columns = rename_dict , inplace = True ) #inplace to True if you want the change to be applied to the dataframe

Related

Selecting specific columns in where condition using Pandas

I have a below Dataframe with 3 columns:
df = DataFrame(query, columns=["Processid", "Processdate", "ISofficial"])
In Below code, I get Processdate based on Processid==204 (without Column Names):
result = df[df.Processid == 204].Processdate.to_string(index=False)
But I wan the same result for Two columns at once without column names, Something like below code:
result = df[df.Processid == 204].df["Processdate","ISofficial"].to_string(index=False)
I know how to get above result but I dont want Column names, Index and data type.
Can someone help?
I think you are looking for header argument in to_string parameters. Set it to False.
df[df.Processid==204][['Processdate', 'ISofficial']].to_string(index=False, header=False)

Assigning value to pandas column not working

I'm checking if a dataframe is empty and then assigning a value if it is. Dataframe has columns "NAME" and "ROLE"
df = pd.DataFrame(columns = ['NAME', 'ROLE'])
if df.empty:
df["NAME"] = "Jake"
After assigning "Jake" to "NAME". The dataframe is still empty like so:
NAME
ROLE
but I want the dataframe to look like this:
NAME
ROLE
Jake
Assigning a scalar to a pandas dataframe sets each value in that column to the scalar. Since you have zero rows, df["NAME"] = "Jake" doesn't assign anything. If you assign a list however, the dataframe is extended for that list. To get a single row in the dataframe
df["NAME"] = ["Jake"]
You could create more rows by adding additional values to the list being assigned.
As people are saying in the comments, there are no rows in your empty dataframe to assign the value "Jake" to the "Name" column. Showing that in the first example:
df = pd.DataFrame(columns=['Name','Role'])
df['Name'] = 'Jake'
print(df)
I'm guessing instead you want to add a row:
df = pd.DataFrame(columns=['Name','Role'])
df = df.append({'Name':'Jake','Role':None},ignore_index=True)
print(df)
If you want to change more than one row value you can use .loc method in Pandas module:
change_index = data_["sample_column"].sample(150).index # index of rows whose value will change
data_["sample_column"].loc[sample_index] = 1 # value at specified index and specified column is changed or assigned to 1
Note : If you write as data_.loc[sample_index]["sample_column"]. = 1 it is not working! Because ["sample_column"] condition write as before .loc methods.

call dataframe from list of dataframes python

I have a use case where i have an unknown list of dfs that are generated from a groupby. The groupby is contained in a list. Once the groupby gets done, a unique df is created for each iteration.
I can dynamically create a list and dictionary of the dataframe names, however, I cannot figure out how to use the position in the list/dictionary as the name of the dataframe. data and Code below:
Code so far:
data_list = [['A','F','B','existing'],['B','F','W','new'],['C','M','H','new'],['D','M','B','existing'],['E','F','A','existing']];
# Create the pandas DataFrame
data_long = pd.DataFrame(data_list, columns = ['PAT_ID', 'sex', 'race_ethnicity','existing_new_client'])
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
def sex_race_summary(groupbyColumns):
grouplist = groupbyColumns[grouping].copy();
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index();
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
return df;
for grouping in range(len(groupbyColumns)):
exec(f'df_{grouping} = sex_race_summary(groupbyColumns)');
print(df_0)
# create a dictionary
dict_of_dfs = dict();
# a list of the dataframes
df_names = [];
for i in range(len(groupbyColumns)):
df_names.append('df_'+ str(i))
print (df_names)
What I'd like to do next is:
# loop through every dataframe and transpose the dataframe
for i in df_names:
df_transposed = i.pivot_table(index='sex', columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index();
print(i)
The index of the list matches the suffix of the dataframe.
But i is being passed as a string and thus throwing an error. The reason I need to build it this way and not hard code the dataframes is because I will not know how many df_x will be created
Thanks for your help!
Update based on R Y A N comment:
Thank you so much for this! you actually gave me another idea, so i made some tweeks: this is my final code that results in what I want: a summary and transposed table for each iteration. However, I am learning that globals() are bad practice and I should use a dictionary instead. How could I convert this to a dictionary based process? #create summary tables for each pair of group by columns
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
for grouping in range(len(groupbyColumns)):
grouplist = groupbyColumns[grouping].copy();
prefix = grouplist[0];
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index().fillna(0);
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
#transpose the dataframe from long to wide
df_transposed = df.pivot_table(index= df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index().fillna(0);
# create new df with the index as suffix
globals()['%s_summary'%prefix] = df;
globals()['%s_summary_transposed' %prefix] = df_transposed;
You can use the globals() method to index the global variable (here a DataFrame) based on the string value i in the loop. See below:
# loop through every dataframe and transpose the dataframe
for i in df_names:
# call globals() to return a dictionary of global vars and index on i
i_df = globals()[i]
# changing the index= argument to indirectly reference the first column since it's 'sex' in df_0 but 'race_ethnicity' in df_1
df_transposed = i_df.pivot_table(index=i_df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum')
print(df_transposed)
However with just this line added, the .pivot_table() function gives an index error since the 'sex' column exists only in df_0 and not df_1 in df_names. To fix this, you can indirectly reference the first column (i.e. index=i_df.columns[0]) in the .pivot_table() method to handle the different column names of the dfs coming through the loop

Writing string to empty dataframe not working, but works in other dataframes, how to fix?

I have created a dataframe on my local machine like so:
df1 = pd.DataFrame()
Next, I have added to columns to dataframe plus I want to assign a string to only one column. I am attempting to do this like so:
df1['DF_Name'] = 'test'
df1['DF_Date'] = ''
The columns get added successfully, but the string 'test' does not. When I apply this to other dataframe it works perfectly fine. What am I doing wrong?
I am also trying to append dates into the same dataframe except using logic, and that is also not working. Logic is as so:
if max_created > max_updated:
df1['DF_Date'] = max_created
else:
df1['DF_Date'] = max_updated
Not sure what I am doing wrong.
Thank you in advance.
You need to add brackets, like:
df1 = pd.DataFrame()
df1['DF_Name'] = ['test']
df1['DF_Date'] = ['']
df1
you can't assign a single value to a pd.Series. With the brackets, it's a list instead.

How to add suffix and prefix to all columns in python/pyspark dataframe

I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff

Categories