I am able to change the sequence of columns using below code I found on stackoverflow, now I am trying to convert it into a function for regular use but it doesnt seem to do anything. Pycharm says local variable df_name value is not used in last line of my function.
Working Code
columnsPosition = list(df.columns)
F, H = columnsPosition.index('F'), columnsPosition.index('H')
columnsPosition[F], columnsPosition[H] = columnsPosition[H], columnsPosition[F]
df = df[columnsPosition]
My Function - Doesnt work, need to make this work
def change_col_seq(df_name, old_col_position, new_col_position):
columnsPosition = list(df_name.columns)
F, H = columnsPosition.index(old_col_position), columnsPosition.index(new_col_position)
columnsPosition[F], columnsPosition[H] = columnsPosition[H], columnsPosition[F]
df_name = df_name[columnsPosition] # pycharm has issue on this line
I have tried adding return on last statement of function but I am unable to make it work.
To re-order the Columns
To change the position of 2 columns:
def change_col_seq(df_name:pd.DataFrame, old_col_position:str, new_col_position:str):
df_name[new_col_position], df_name[old_col_position] = df_name[old_col_position].copy(), df_name[new_col_position].copy()
df = df_name.rename(columns={old_col_position:new_col_position, new_col_position:old_col_position})
return df
To Rename the Columns
You can use the rename method (Documentation)
If you want to change the name of just one column:
def change_col_name(df_name, old_col_name:str, new_col_name:str):
df = df_name.rename(columns={old_col_name: new_col_name})
return df
If you want to change the name of multiple column:
def change_col_name(df_name, old_col_name:list, new_col_name:list):
df = df_name.rename(columns=dict(zip(old_col_name, new_col_name)))
return df
Related
I am trying to read 3 CSV files into 3 pandas DataFrame. But after executing the function the variable seems not available. Tries to create a blank data frame outside the function and read and set the frame in the function. But the frame is blank.
# Load data from the csv file
def LoadFiles():
x = pd.read_csv('columns_description.csv', index_col=None)
print("Columns Description")
print(f"Number of rows/records: {x.shape[0]}")
print(f"Number of columns/variables: {x.shape[1]}")
LoadFiles()
x.head()
Python Notebook for above code with Error
In the second approach, I am trying to create a new data frame with some consolidated information from the dataset. The issue reappears as the variable seems to be no longer available.
# Understand the variables
y = pd.read_csv('columns_description.csv', index_col=None)
def refresh_y():
var_y = pd.DataFrame(columns=['Variable','Number of unique values'])
for i, var in enumerate(y.columns):
var_y.loc[i] = [y, y[var].nunique()]
refresh_y()
Screenshot with error code and solution restructuring in the function
I am a bit new to Python, The code is a sample and does not represent actual data and in the function, an example is with a single column. I have multiple columns to refresh in this derived data set based on changes further hence the function approach.
When defining a function, if you want to use a variable that is defined in the function, you should end with return var. Check this: Function returns None without return statement and some tutorials on defining a function (https://learnpython.com/blog/define-function-python/).
A basic example to help you start with defining functions:
def sum_product(arg1,arg2): #your function takes 2 arguments
var1 = arg1 + arg2
var2 = arg1*arg2
return var1,var2 #returns two values
new_var1, new_var2 = sum_product(3,4)
For the first example try modifying it like:
def LoadFiles():
var = pd.read_csv('columns_description.csv', index_col=None)
print("Columns Description")
print(f"Number of rows/records: {var.shape[0]}")
print(f"Number of columns/variables: {var.shape[1]}")
return var
x = LoadFiles()
x.head()
try following code
# Load data from the csv file
def LoadFiles():
x = pd.read_csv('columns_description.csv', index_col=None)
print("Columns Description")
print(f"Number of rows/records: {x.shape[0]}")
print(f"Number of columns/variables: {x.shape[1]}")
return x
x2 = LoadFiles()
x2.head()
Variables in a function is only available inside function. You may need study about scope. I recommend the following simple site about scope in Python.
https://www.w3schools.com/python/python_scope.asp
def unique_unit_split(df):
df_unit_list = df_master.loc[df_master['type'] == 'unit']
df_unit_list = df_unit_list.key.tolist()
for i in range(len(df_unit_list)):
df_unit_list[i] = int(df_unit_list[i])
split_1 = df_units.units.str.split('[","]',expand=True).stack()
df_units_update = df_units.join(pd.Series(index=split_1.index.droplevel(1), data=split_1.values, name='unit_split'))
df_units_final = df_units_update[df_units_update['unit_split'].isin(df_unit_list)]
return(df)
Updated script: still not working
df_unit_list = []
split_1 = pd.DataFrame()
df_units_update = pd.DataFrame()
df_units_final = pd.DataFrame()
def unique_unit_split(df):
df_unit_list = df_master.loc[df_master['type'] == 'unit']
df_unit_list = df_unit_list.key.tolist()
for i in range(len(df_unit_list)):
df_unit_list[i] = int(df_unit_list[i])
split_1 = df_units.units.str.split('[","]',expand=True).stack()
df_units_update = df_units.join(pd.Series(index=split_1.index.droplevel(1), data=split_1.values, name='unit_split'))
df_units_final = df_units_update[df_units_update['unit_split'].isin(df_unit_list)]
return(df)
Above function originally worked when I split up the two actions (code inclusive of the for loop and above was in a function then everything below split_1 was in another function). Now that I tried to condense them, I am getting a NameError (image attached). Anyone know how I can resolve this issue and ensure my final df (df_units_final) is defined?
For more insight on this function: I have a df with comma separated values in one column and I needed to split that column, drop the [] and only keep rows with the #s I need which were defined in the list created "df_unit_list".
NameError Details
The issue was stated above (not defining df_units_final) AND my for_loop was forcing the list to be int when the values in the other df were actually strings.
Working Code
I would like to read in an excel file, and using method chaining, convert the column names into lower case and replace any white space into _. The following code runs fine
def supp_read(number):
filename = f"supplemental-table{number}.xlsx"
df = (pd.read_excel(filename,skiprows=5)
.rename(columns = str.lower))
return df
But the code below does not
def supp_read(number):
filename = f"supplemental-table{number}.xlsx"
df = (pd.read_excel(filename,skiprows=5)
.rename(columns = str.lower)
.rename(columns = str.replace(old=" ",new="_")))
return df
After adding the str.replace line I get the following error: No value for argument 'self' in unbound method call. Can someone shed some light on what I can do to fix this error and why the above does not work?
In addition, when I use str.lower() I get the same error. Why does str.lower work but not str.lower()?
Here's a different syntax which I frequently use:
def supp_read(number):
filename = f"supplemental-table{number}.xlsx"
df = pd.read_excel(filename,skiprows=5)
df.columns = df.columns.str.lower().replace(" ", "_")
return df
In function, I can't use argument to define the name of the df in df.to_csv().
I have a long script to pull apart and understand. To do so I want to save the different dataframes it uses and store them in order. I created a function to do this and add the order number 01 (number_of_interim_exports) to the name (from argument).
My problem is that I need to use this for multiple dataframe names, but the df.to_csv part won't accept an argument in place of df...
def print_interim_results_any(name, num_exports, df_name):
global number_of_interim_exports
global print_interim_outputs
if print_interim_outputs == 1:
csvName = str(number_of_interim_exports).zfill(2) + "_" +name
interimFileName = "interim_export_"+csvName+".csv"
df.to_csv(interimFileName, sep=;, encoding='utf-8', index=False)
number_of_interim_exports += 1
I think i just screwed something else up: this works fine:
import pandas as pd
df = pd.DataFrame({1:[1,2,3]})
def f(frame):
frame.to_csv("interimFileName.csv")
f(df)
I want to filter my spark dataframe. In this dataframe, there is an col of URL.
I have tried to use os.path.exists(col("url")) to filter my dataframe, but I got errors like
"string is needed, but column has been found".
here is part of my code, pandas has been used in codes, and now i want to use spark to implement the following code
bob_ross = pd.DataFrame.from_csv("/dbfs/mnt/umsi-data-science/si618wn2017/bob_ross.csv")
bob_ross['image'] = ""
# create a column for each of the 85 colors (these will be c0...c84)
# we'll do this in a separate table for now and then merge
cols = ['c%s'%i for i in np.arange(0,85)]
colors = pd.DataFrame(columns=cols)
colors['EPISODE'] = bob_ross.index.values
colors = colors.set_index('EPISODE')
# figure out if we have the image or not, we don't have a complete set
for s in bob_ross.index.values:
b = bob_ross.loc[s]['TITLE']
b = b.lower()
b = re.sub(r'[^a-z0-9\s]', '',b)
b = re.sub(r'\s', '_',b)
img = b+".png"
if (os.path.exists("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)):
bob_ross.set_value(s,"image","/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
t = getColors("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
colors.loc[s] = t
bob_ross = bob_ross.join(colors)
bob_ross = bob_ross[bob_ross.image != ""]
here is how i try to implement it with spark, i am stuck at the error line
from pyspark.sql.functions import *
bob_ross = spark.read.csv('/mnt/umsi-data-science/si618wn2017/bob_ross.csv',header=True)
bob_ross=bob_ross.withColumn("image",concat(lit("/dbfs/mnt/umsi-data-science/si618wn2017/images/"),concat(regexp_replace(regexp_replace(lower(col('TITLE')),r'[^a-z0-9\s]',''),r'\s','_'),lit(".png"))))
#error line ---filter----
bob_ross.filter(os.path.exists(col("image")))
print(bob_ross.head())
You should be using filter function, not an OS function
For example
df.filter("image is not NULL")
os.path.exists only operates on the local filesystem, while Spark is meant to run on many servers, so that should be a sign you're not using the correct function