How can I mask a pandas dataframe column in logging output? - python

I am having to log some pandas dataframe outputs that contain sensitive information. I would rather not have this info in the logs or print in the terminal.
I normally write a little function that can take a string and mask it with a regex, but I am having trouble doing that with a dataframe. Is there anyway to mask a column(s) of sensitive info in a data frame just for logging? The method I have tried below changes the dataframe, making the column unusable down the line.
def hide_by_pd_df_columns(dataframe,columns,replacement=None):
'''hides/replaces a pandas dataframe column with a replacement'''
for column in columns:
replacement = '*****' if replacement is None else replacement
dataframe[column] = replacement
return dataframe
What I want to happen is the ***** mask to only exist in logging and not in the rest of the operations.

Make sure to df.copy the dataframe if you want to leave the original df as is:
def hide_by_pd_df_columns(dataframe,columns,replacement=None):
'''hides/replaces a pandas dataframe column with a replacement'''
df=dataframe.copy()
for column in columns:
replacement = '*****' if replacement is None else replacement
df[column] = replacement
return df

Related

How to convert a dataframe to JSON without the index?

I have the following dataframe output that I would like to convert to json, but it adds a leading zero, which gets added to the json. How do I remove it? Pandas by default numbers each row.
id version ... token type_id
0 10927076529 0 ... 56599bb6-3b56-425b-8688-8fc0c73fbedc 3
{"0":{"id":10927076529,"version":0,"token":"56599bb6-3b56-425b-8688-8fc0c73fbedc","type_id":3}}
df = df.rename(columns={'id': 'version', 'token': 'type_id' })
df2 = df.to_json(orient="index")
print(df2)
Pandas has that 0 value as the row index for your single DataFrame entry. You can't remove it in the actual DataFrame as far as I know.
This is showing up in your JSON specifically because you're using the "index" option for the "orient" parameter.
If you want each row in your final dataframe to be a separate entry, you can try the "records" option instead of "index".
df2 = df.to_json(orient="records")
This hyperlink has a good illustration of the different options.
Another option you have is to set one of your columns as an index that you want to use, such as id/version. This will preserve a title, but without using the default indexing scheme provided by Pandas.
df = df.set_index('version')
df2 = df.to_json(orient="index")

PySpark Replace Characters using regex and remove column on Databricks

I am tring to remove a column and special characters from the dataframe shown below.
The code below used to create the dataframe is as follows:
dt = pd.read_csv(StringIO(response.text), delimiter="|", encoding='utf-8-sig')
The above produces the following output:
I need help with regex to remove the characters  and delete the first column.
As regards regex, I have tried the following:
dt.withColumn('COUNTRY ID', regexp_replace('COUNTRY ID', #"[^0-9a-zA-Z_]+"_ ""))
However, I'm getting a syntax error.
Any help much appreciated.
If the position of incoming column is fixed you can use regex to remove extra characters from column name like below
import re
colname = pdf.columns[0]
colt=re.sub("[^0-9a-zA-Z_\s]+","",colname)
print(colname,colt)
pdf.rename(columns={colname:colt}, inplace = True)
And for dropping index column you can refer to this stack answer
You have read in the data as a pandas dataframe. From what I see, you want a spark dataframe. Convert from pandas to spark and rename columns. That will dropn pandas default index column which in your case you refer to as first column. You then can rename the columns. Code below
df=spark.createDataFrame(df).toDF('COUNTRY',' COUNTRY NAME').show()

Retrieve a column name that has today date with respect to the specific row in dataframe

I have a pandas dataframe read in from Excel, as below. The column labels are dates.
Given today's date is 2020-04-13, I want to retrieve the C row values for next 7 days.
Currently, I set the index and retrieve the values of C, when I print rows I get the output for all the dates and their values of C.
I know that I should use date.today(). Can someone let me know how to capture the column of today's date (2020-04-13) for the C row? I am beginner to python/pandas and am just learning the concepts of dataframes.
input_path = "./data/input/data.xlsx"
pd_xls_obj = pd.ExcelFile(input_path)
data = pd.read_excel(pd_xls_obj,sheet_name="Sheet1",index_col='Names')
rows = data.loc["C"]
Easy way to do it, is to load the data from the workbook with headers and then do (in words) something like: Show me from data[column: where date is today][row: where data['Names'] is equal to 'C']
i would not go for the version, where you use a column (which anyway only has unique values) as index..
code example below; I needed to use "try: _____ except: ", because one of your headers is a String and the ".date()" would through an error.
import pandas as pd
import datetime
INPUT_PATH = "C:/temp/data.xlsx"
pd_xls_obj = pd.ExcelFile(INPUT_PATH)
data = pd.read_excel(pd_xls_obj, sheet_name="Sheet1")
column_headers = data.columns
# loop though headers and check if todays date is equal to the column header
for column_header in column_headers:
# python would throw you an Attribute Error for all Headers which are not
# in the format datetime. In order to avoid that, we use a try - except
try:
# if the dates are equal, print an output
if column_header.date() == datetime.date.today():
# you can read the statement which puts together your result as
# follows:
# data[header_i_am_interested_in][where: data['Names'] equals 'C']
result = data[column_header][data['Names'] == 'C']
print(result)
except AttributeError:
pass
It's unorthodox in pandas to use the date as column labels, instead of as row index, and since pandas dtypes go by column not by row, that means pandas won't correctly detect the column label type as 'datetime', rather than string/object, and hence comparison and arithmetic operators on it won't work properly, so you'll have to do lots of unnecessary avoidable manual work and conversions to/from datetime. Instead:
You should transpose the dataframe immediately at read-time:
data = pd.read_excel(...).T
Now your dates will be in one single column with the same dtype, and you can convert it with pd.to_datetime().
Then, make sure the dtypes are correct, i.e. the index's dtype should be 'datetime', not 'object', 'string' etc. (Please post your dataset or URL in the question to make this reproducible).
Now 'C' will be a column instead of a row.
You can access your entire 'C' column with:
rows = data[:, 'C']
... and similarly you can write an expression for your subset of rows for your desired dates. Waiting for your data snippet, to show the code.

Pandas - Merge DataFrame to Series when all column values are the same.

I just started to using pandas and I would to reduce amount of data that I get by merging my DataFrames in that way:
Load df
Check in which columns all values are the same
Delete other columns
Reduce df to single Series
Return
def merge_df(in_df):
alist = []
for col in in_df.columns:
if len(in_df[col].unique()) == 1:
alist.append(col)
return in_df[alist].T.squeeze()[1]
Is there any more elegent way to do it? E.g. without looping through all columns?
Yeah you can remove duplicate data by pandas simple function.
df.drop_duplicates()
You can refer documentation here.
For removing particular column redundant data you can pass column name as a parameter "subset". It will remove whole row for duplicate data.

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

Categories