remove commas/quotation marks in column name in pandas or sql - python

I am trying to pull some columns from a snowflake table using python/sqlalchemy into a pandas dataframe and subsequently do additional operations using Python/Pandas.
However, it appears that the resulting dataframe has some quotation marks/commas in the column names.
Code follows below:
sql = '''SELECT 'concept_name', 'ndc'
FROM db.schema.tbl'''
df = pd.read_sql(sql, conn)
df.columns.to_list() #print out column names
This is the output I get for column names: ["'CONCEPT_NAME'", "'NDC'"]
How do I remove the special characters in each column name either in SQL itself or in pandas?

You can use the str.strip method to remove the special characters from the column names:
df.rename(columns=lambda x: x.strip("'"), inplace=True)

Related

Replacing strings in one column with the details of another column in a single dataframe

I need to be able to replace values in one column from another column in a single dataframe
I imported this excel file as a pandas dataframe but how do I replace the values in left column (Freedom Town) with the values in the right column that come before the hyphen ( Fluent) in a pandas dataframe?
You can use str.split to split on ' - ' then keep only the second part:
# replace colA & colB by real column names
df['colA'] = df['colB'].str.split(' - ').str[1]
Use Series.str.extract for values after -:
df['colA'] = df['colB'].str.extract('\s+-\s+(.*)', expand=False)

PySpark Replace Characters using regex and remove column on Databricks

I am tring to remove a column and special characters from the dataframe shown below.
The code below used to create the dataframe is as follows:
dt = pd.read_csv(StringIO(response.text), delimiter="|", encoding='utf-8-sig')
The above produces the following output:
I need help with regex to remove the characters  and delete the first column.
As regards regex, I have tried the following:
dt.withColumn('COUNTRY ID', regexp_replace('COUNTRY ID', #"[^0-9a-zA-Z_]+"_ ""))
However, I'm getting a syntax error.
Any help much appreciated.
If the position of incoming column is fixed you can use regex to remove extra characters from column name like below
import re
colname = pdf.columns[0]
colt=re.sub("[^0-9a-zA-Z_\s]+","",colname)
print(colname,colt)
pdf.rename(columns={colname:colt}, inplace = True)
And for dropping index column you can refer to this stack answer
You have read in the data as a pandas dataframe. From what I see, you want a spark dataframe. Convert from pandas to spark and rename columns. That will dropn pandas default index column which in your case you refer to as first column. You then can rename the columns. Code below
df=spark.createDataFrame(df).toDF('COUNTRY',' COUNTRY NAME').show()

Python Pandas: dropping all null columns

Importing a sql datatable as a pandas dataframe and dropping all completely empty columns:
equip = %sql select * from [coswin].[dbo].[Work Order]
df = equip.DataFrame()
#dropping empty columns
df.dropna(axis=1, how="all", inplace=True)
the problem is I am still finding the null columns without any errors in the output.
Are you sure the columns you want to remove are full of null values? You might check with df.isna().sum() if you haven't.
Also, you could use pd.read_sql() to read your data directly into a DataFrame.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html

How can I mask a pandas dataframe column in logging output?

I am having to log some pandas dataframe outputs that contain sensitive information. I would rather not have this info in the logs or print in the terminal.
I normally write a little function that can take a string and mask it with a regex, but I am having trouble doing that with a dataframe. Is there anyway to mask a column(s) of sensitive info in a data frame just for logging? The method I have tried below changes the dataframe, making the column unusable down the line.
def hide_by_pd_df_columns(dataframe,columns,replacement=None):
'''hides/replaces a pandas dataframe column with a replacement'''
for column in columns:
replacement = '*****' if replacement is None else replacement
dataframe[column] = replacement
return dataframe
What I want to happen is the ***** mask to only exist in logging and not in the rest of the operations.
Make sure to df.copy the dataframe if you want to leave the original df as is:
def hide_by_pd_df_columns(dataframe,columns,replacement=None):
'''hides/replaces a pandas dataframe column with a replacement'''
df=dataframe.copy()
for column in columns:
replacement = '*****' if replacement is None else replacement
df[column] = replacement
return df

How to get python dataframe column name using filter

I have imported one csv file to data frame & it has some 250+ columns and last column name starts with 'Unnamed:***' & some digit attached to it like "Unnamed: 1272"
I want to get that column name which starts 'Unnamed'. Below script didn't help.
dfColumns = pd.DataFrame(data.columns, columns=['columnName'])
UnnamedColumnName = str(dfColumns.loc[dfColumns['columnName'].str.contains('Unnamed')])
Result: ' columnName\n1272 Unnamed: 1272'
Below script also tried but no use:
data.columns.str.contains('Unnamed')
Expected results in 'Unnamed: 1272' in string variable "UnnamedColumnName", I want use this variable in delete columns script.
If it's always the last column you can just do
last_col = df.columns[-1]
You can also rename this using rename:
df = df.rename(columns={'new_name':df.columns[-1]})
Also str.contains returns you a boolean mask of the columns that match the string, you need to use this mask against the columns array:
data.columns[data.columns.str.contains('Unnamed')]
will return you an array with all columns where the boolean condition is met

Categories