Python Pandas: dropping all null columns - python

Importing a sql datatable as a pandas dataframe and dropping all completely empty columns:
equip = %sql select * from [coswin].[dbo].[Work Order]
df = equip.DataFrame()
#dropping empty columns
df.dropna(axis=1, how="all", inplace=True)
the problem is I am still finding the null columns without any errors in the output.

Are you sure the columns you want to remove are full of null values? You might check with df.isna().sum() if you haven't.
Also, you could use pd.read_sql() to read your data directly into a DataFrame.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html

Related

remove commas/quotation marks in column name in pandas or sql

I am trying to pull some columns from a snowflake table using python/sqlalchemy into a pandas dataframe and subsequently do additional operations using Python/Pandas.
However, it appears that the resulting dataframe has some quotation marks/commas in the column names.
Code follows below:
sql = '''SELECT 'concept_name', 'ndc'
FROM db.schema.tbl'''
df = pd.read_sql(sql, conn)
df.columns.to_list() #print out column names
This is the output I get for column names: ["'CONCEPT_NAME'", "'NDC'"]
How do I remove the special characters in each column name either in SQL itself or in pandas?
You can use the str.strip method to remove the special characters from the column names:
df.rename(columns=lambda x: x.strip("'"), inplace=True)

Deleting Rows in a Dataframe

I want to delete some specific rows in a Dataframe in Python . The dataframe consists of a series of tables and we have to delete the rows where only the first cell has values . For example the in the bottom , rows highlighted in yellows.
If Unwanted rows have specific part string common then you could explicitly delete those using
df_new = df[~df.columnName.str.contains("FINANCIAL SERVICES")]
and if the row cells are NULL use dropna
df.dropna(subset=df.columns[1:], how= 'all', inplace = True)

Drop the entire row in the dataframe based on column 'Amount.Requested' having missing values

Consider if I have a column Amount.Requested and it has some missing values, so now based on those missing values from Amount.Requested I want to drop the entire row, because if the column Amount.Requested has missing values then there is no point in keeping the data of that client for my sample code.
If you have nulls, then to remove rows with nulls alone try
df = df.loc[~df['Amount.Requested'].isna()]
or
df = df.loc[df['Amount.Requested'] > 0]

appending in pandas - row wise

I'm trying to append two columns of my dataframe to an existing dataframe with this:
dataframe.append(df2, ignore_index = True)
and this does not seem to be working.
This is what I'm looking for (kind of) --> a dataframe with 2 columns and 6 rows:
although this is not correct and it's using two print statements to print the two dataframes, I thought it might be helpful to have a selection of the data in mind.
I tried to use concat(), but that leads to some issues as well.
dataframe = pd.concat([dataframe, df2])
but that appears to concat the second dataframe in columns rather than rows, in addition to gicing NaN values:
any ideas on what I should do?
I assume this happened because your dataframes have different column names. Try assigning the second dataframe column names with the first dataframe column names.
df2.columns = dataframe.columns
dataframe_new = pd.concat([dataframe, df2], ignore_index=True)

Spark DataFrame equivalent of pandas.DataFrame.set_index / drop_duplicates vs. dropDuplicates

The drop duplicates methods of Spark DataFrames is not working and I think it is because the index column which was part of my dataset is being treated as a column of data. There definitely are duplicates in there, I checked it by comparing COUNT() and COUNT(DISTINCT()) on all the columns except the index. I'm new to Spark DataFrames but if I was using Pandas, at this point I would do pandas.DataFrame.set_index on that column.
Does anyone know how to handle this situation?
Secondly, there appears to be 2 methods on a Spark DataFrame, drop_duplicates and dropDuplicates. Are they the same?
If you don't want the index column to be considered while checking for the distinct records, you can drop the column using below command or select only the columns required.
df = df.drop('p_index') // Pass column name to be dropped
df = df.select('name', 'age') // Pass the required columns
drop_duplicates() is an alias for dropDuplicates().
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates

Categories