How to update/apply validation to pandas columns - python

I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!

Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])

for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.

Related

How do I separate this dataframe column by month?

A few rows of my dataframe
The third column shows the time of completion of my data. Ideally, I'd want the second row to just show the date, removing the second half of the elements, but I'm not sure how to change the elements. I was able to change the (second) column of strings into a column of floats without the pound symbol in order to find the sum of costs. However, this column has no specific keyword I just select for all of the elements to remove.
Second part of my question is is it is possible to easy create another dataframe that contains 2021-05-xx or 2021-06-xx. I know there's a way to make another dataframe selecting certain rows like the top 15 or bottom 7. But I don't know if there's a way to make a dataframe finding what I mentioned. I'm thinking it follows the Series.str.contains(), but it seems like when I put '2021-05' in the (), it shows a entire dataframe of False's.
Extracting just the date and ignoring the time from the datetime column can be done by changing the formatting of the column.
df['date'] = pd.to_datetime(df['date']).dt.date
To the second part of the question about creating a new dataframe that is filtered down to only contain rows between 2021-05-xx and 2021-06-xx, we can use pandas filtering.
df_filtered = df[(df['date'] >= pd.to_datetime('2021-05-01')) & (df['date'] <= pd.to_datetime('2021-06-30'))]
Here we take advantage of two things: 1) Pandas making it easy to compare the chronology of different dates using numeric operators. 2) Us knowing that any date that contains 2021-05-xx or 2021-06-xx must come on/after the first day of May and on/before the last day of June.
There are also a few GUI's that make it easy to change the formatting of columns and to filter data without actually having to write the code yourself. I'm the creator of one of these tools, Mito. To filter dates in Mito, you can just enter the dates using our calendar input fields and Mito will generate the equivalent pandas code for you!

How to insert data into a existing dataframe, replacing values according to a conditional

I'm looking to insert information into a existing dataframe, this dataframe shape is 2001 rows × 13 columns, however, only the first column has information.
I have 12 more columns, but these are not the same dimension as the main dataframe, so I'd like to insert this additional columns into the main one using a conditional.
Example dataframe:
This in an example, I want to insert the var column into the 2001 × 13 dataframe, using the date as a conditional and in case there is no date, it skips the row or simply adds a 0.
I'm really new to python and programming in general.
Without a minimal working example it is hard to provide you with clear recommendations, but I think what you are looking for is the .loc a pd.DataFrame. What I would recommend you doing is the following:
Selection of rows with .loc works better in your case if the dates are first converted to date-time, so a first step is to make this conversion as:
# Pandas is quite smart about guessing date format. If this fails, please check the
# documentation https://docs.python.org/3/library/datetime.html to learn more about
# format strings.
df['date'] = pd.to_datetime(df['date'])
# Make this the index of your data frame.
df.set_index('date', inplace=True)
It is not clear how you intend to use conditionals/what is the content of your other columns. Using .loc this is pretty straightforward
# At Feb 1, 2020, add a value to columns 'var'.
df.loc['2020-02-01', 'var'] = 0.727868
This could also be used for ranges:
# Assuming you have a second `df2` which as a datetime columns 'date' with the
# data you wish to add to `df`. This will only work if all df2['date'] are found
# in df.index. You can workout the logic for your case.
df.loc[df2['date'], 'var2'] = df2['vals']
If the logic is to complex and the dataframe is not too large, iterating with .iterrows could be easier, specially if you are beginning with Python.
for idx, row in df.iterrows():
if idx in list_of_other_dates:
df.loc[i, 'var'] = (some code here)
Please clarify a bit your problem and you will get better answers. Do not forget to check the documentation.

Retrieve a column name that has today date with respect to the specific row in dataframe

I have a pandas dataframe read in from Excel, as below. The column labels are dates.
Given today's date is 2020-04-13, I want to retrieve the C row values for next 7 days.
Currently, I set the index and retrieve the values of C, when I print rows I get the output for all the dates and their values of C.
I know that I should use date.today(). Can someone let me know how to capture the column of today's date (2020-04-13) for the C row? I am beginner to python/pandas and am just learning the concepts of dataframes.
input_path = "./data/input/data.xlsx"
pd_xls_obj = pd.ExcelFile(input_path)
data = pd.read_excel(pd_xls_obj,sheet_name="Sheet1",index_col='Names')
rows = data.loc["C"]
Easy way to do it, is to load the data from the workbook with headers and then do (in words) something like: Show me from data[column: where date is today][row: where data['Names'] is equal to 'C']
i would not go for the version, where you use a column (which anyway only has unique values) as index..
code example below; I needed to use "try: _____ except: ", because one of your headers is a String and the ".date()" would through an error.
import pandas as pd
import datetime
INPUT_PATH = "C:/temp/data.xlsx"
pd_xls_obj = pd.ExcelFile(INPUT_PATH)
data = pd.read_excel(pd_xls_obj, sheet_name="Sheet1")
column_headers = data.columns
# loop though headers and check if todays date is equal to the column header
for column_header in column_headers:
# python would throw you an Attribute Error for all Headers which are not
# in the format datetime. In order to avoid that, we use a try - except
try:
# if the dates are equal, print an output
if column_header.date() == datetime.date.today():
# you can read the statement which puts together your result as
# follows:
# data[header_i_am_interested_in][where: data['Names'] equals 'C']
result = data[column_header][data['Names'] == 'C']
print(result)
except AttributeError:
pass
It's unorthodox in pandas to use the date as column labels, instead of as row index, and since pandas dtypes go by column not by row, that means pandas won't correctly detect the column label type as 'datetime', rather than string/object, and hence comparison and arithmetic operators on it won't work properly, so you'll have to do lots of unnecessary avoidable manual work and conversions to/from datetime. Instead:
You should transpose the dataframe immediately at read-time:
data = pd.read_excel(...).T
Now your dates will be in one single column with the same dtype, and you can convert it with pd.to_datetime().
Then, make sure the dtypes are correct, i.e. the index's dtype should be 'datetime', not 'object', 'string' etc. (Please post your dataset or URL in the question to make this reproducible).
Now 'C' will be a column instead of a row.
You can access your entire 'C' column with:
rows = data[:, 'C']
... and similarly you can write an expression for your subset of rows for your desired dates. Waiting for your data snippet, to show the code.

python distinguish between '300' and '300.0' for a dataframe column

Recently I have been developing some code to read a csv file and store key data columns in a dataframe. Afterwards I plan to have some mathematical functions performed on certain columns in the dataframe.
I've been fairly successful in storing the correct columns in the dataframe. I have been able to have it do whatever maths is necessary such as summations, additions of dataframe columns, averaging etc.
My problem lies in accessing specific columns once they are stored in the dataframe. I was working with a test file to get everything working and managed this no problem. The problems arise when I open a different csv file, it will store the data in the dataframe, but the accessing the column I want no longer works and it stops at the calculation part.
From what I can tell the problem lies with how it reads the column name. The column names are all numbers. For example, df['300'], df['301'] etc. When accessing the column df['300'] works fine in the testfile, while the next file requires df['300.0']. If I switch to a different file it may require df['300'] again. All the data was obtained in the same way so I am not certain why some are read as 300 and the others 300.0.
Short of constantly changing the column labels each time I open a different file, is there anyway to have it automatically distinguish between '300' and '300.0' when opening the file, or force '300.0' = '300'?
Thanks
In your dataframe df, one way to keep consistency may be to convert to similar types of columns. You can update all the column name to string value of integer from float i.e. '300.0' to '300' using .columns as below. Then, I think using integer value of string should work i.e. df['300] or any other columns other than 300.
df.columns = [str(int(float(column))) for column in df.columns]
Or, if integer value is not required,extra int conversion can be removed and float string value can be used:
df.columns = [str(float(column)) for column in df.columns]
Then, df['300.0'] can be used instead of df['300'].
If string type is not required then, I think converting them float would work as well.
df.columns = [float(column) for column in df.columns]
Then, df[300.0] would work as well.
Other alternative to change column names may be using map:
Changing to float value for all columns, then as mentioned above use df[300.0]:
df.columns = map(float, df.columns)
Changing to string value of float, then df['300.0']:
df.columns = map(str, map(float, df.columns))
Changing to string value of int, then df['300']:
df.columns = map(str, map(int, map(float, df.columns)))
Some solutions:
Go through all the files, change the columns names, then save the result in a new folder. Now when you read a file, you can go to the new folder and read it from there.
Wrap the normal file read function in another function that automatically changes the column names, and call that new function when you read a file.
Wrap column selection in a function. Use a try/except block to have the function try to access the given column, and if it fails, use the other form.
This answer assumes you want only the integer part to remain in the column name. It takes the column names and does a float->int->string conversion to strip the decimal places.
Be careful, if you have numbers like '300.5' as a column name, this will turn them into '300'.
cols = df.columns.tolist()
new_columns = dict([(c,str(int(float(c)))) for c in cols])
df = df.rename(columns = new_columns)
For clarity, most of the 'magic' is happening on the middle line. I iterate over the currently existing columns, and turn them into tuples of the form (old_name, new_name). df.rename takes that dictionary and then does the renaming for you.
My thanks to user Nipun Batra for this answer that explained df.rename.

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

Categories