I have some data in an excel file and I read it using pandas read_excel method.
However I want to read the entire data in all columns as strings including the date column.
The problem is that I want to leave the date column in its original format as string. For example, I have '31.01.2017' in the excel and it is formatted as date and I want to have '31.01.2017' in my data frame.
I thought using dytpes parameter of read_excel with dtype=str was the correct approach. But pandas then reads the date column as datetime and then converts it to string. So at the end I always have '2017-01-31 00:00:00' in my data frame.
Is there any way to do this?
The behavior of pandas makes sense:
If the excel-format of your date column is text, pandas will read the
dates as strings by default.
If the excel-format of your date column is date, pandas will read the dates as dates.
However, you point out that in the Excelfile the date column is formatted as a date. If that's the case, there is no string in your Excelfile to begin with. The underlying data of the date column is stored as a float. The string you are seeing is not the actual data. You can't read something as a raw string if it isn't a string.
Some more info: https://xlrd.readthedocs.io/en/latest/formatting.html
But let's say, for some reason, you want Python to display the same format as Excel, but in string form, without looking in the Excel.
First you'd have to find the format:
from openpyxl import load_workbook
wb = load_workbook('data.xlsx')
ws = wb.worksheets[0]
print(ws.cell(1,5).number_format) # look at the cell you are interested in
> '[$]dd/mm/yyyy;#'
and then convert is to something the strftime function understands.
https://www.programiz.com/python-programming/datetime/strftime#format-code
form = form[3:-2]
form = form.replace('dd','%d')
form = form.replace('mm','%m')
form = form.replace('yyyy','%Y')
print(form)
> '%d/%m/%Y'
And apply it
df.loc[:,"date_field"].apply(lambda x: x.strftime(form))
> 0 01/02/2018
1 02/02/2018
2 03/02/2018
3 04/02/2018
4 05/02/2018
However, if you're working with multiple Excel date-formats you'd have to make a strf-time mapping for each of them.
Probably there will be more practical ways of doing this, like receiving the data in csv format or just keeping the dates in excel's text format in the first place.
As you are trying to keep the date column in the initial type, the following code may help you. In the first row we insert to the variable "cols" all the columns except the date column, and then in the following two lines we just change the type of the rest columns:
cols=[i for i in df.columns if i not in ["Date_column"]]
for col in cols:
df[col]=df[col].astype('category')
Hope it helps! :-)
df['date_column'] = df['date_column'].dt.strftime('%d.%m.%Y')
Related
I am in a tricky situation.
I have two columns with string type and it is supposed to be dateType in the final path. So, to achieve that, I am passing it as a python dictionary
d = {"col1": DateType(), "col2": DateType()}
The above is passed to a data frame where I am supposed to do this transformation.
df = func.cast_attribute(df2, p_property.d)
The col1, and col2 are in the "yyyy-MM-dd" format. But as per the business requirement, I need the "dd-mm-yyyy" format. To do that I was suggested to use pyspark in-built function date_format()
col_df = df.withColumn("col1", date_format("col1", format="dd-MM-yyyy"))
But the problem is that the above transformation to convert yyyy-MM-dd to dd-MM-yyyy is converting the data type of the two cols back to string
Note
func.cast_attribute()
def cast_attribute(df, mapper={}):
""" Cast columns of the dataframe \n
Usage: cast_attribute(your_df, {'column_name', StringType()})
Args:
df (DataFrame):DataFrame to convert some of the attributes.
mapper (dict, optional): Dictionary with old column datatype to new datatype.
Returns:
DataFrame: Returns the dataframe with converted columns
"""
for column, dtype in mapper.items():
df = df.withColumn(column, col(column).cast(dtype))
return df
Please suggest, how to keep the date type in the final data frame..
A date object does not have a format attached to it. It is just a date. The format only makes sense when you want to print that date somewhere (console, text file...). In that case, spark needs to decide how to transform that date into a string before printing it. By default, spark uses its default format yyyy-MM-dd.
Here is the documentation of date_format:
Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
Therefore, if you want to keep the date type, do not use date_format. If your problem is how the date is serialized when written to a CSV or a JSON file, you can configure the format with the dateFormat option of the DataFrameWritter.
df = spark.range(1).withColumn("date", F.current_date())
df.write.option("dateFormat", "dd-MM-yyyy").csv("data.csv")
> cat data.csv/*
0,31-12-2022
I have a pandas dataframe read in from Excel, as below. The column labels are dates.
Given today's date is 2020-04-13, I want to retrieve the C row values for next 7 days.
Currently, I set the index and retrieve the values of C, when I print rows I get the output for all the dates and their values of C.
I know that I should use date.today(). Can someone let me know how to capture the column of today's date (2020-04-13) for the C row? I am beginner to python/pandas and am just learning the concepts of dataframes.
input_path = "./data/input/data.xlsx"
pd_xls_obj = pd.ExcelFile(input_path)
data = pd.read_excel(pd_xls_obj,sheet_name="Sheet1",index_col='Names')
rows = data.loc["C"]
Easy way to do it, is to load the data from the workbook with headers and then do (in words) something like: Show me from data[column: where date is today][row: where data['Names'] is equal to 'C']
i would not go for the version, where you use a column (which anyway only has unique values) as index..
code example below; I needed to use "try: _____ except: ", because one of your headers is a String and the ".date()" would through an error.
import pandas as pd
import datetime
INPUT_PATH = "C:/temp/data.xlsx"
pd_xls_obj = pd.ExcelFile(INPUT_PATH)
data = pd.read_excel(pd_xls_obj, sheet_name="Sheet1")
column_headers = data.columns
# loop though headers and check if todays date is equal to the column header
for column_header in column_headers:
# python would throw you an Attribute Error for all Headers which are not
# in the format datetime. In order to avoid that, we use a try - except
try:
# if the dates are equal, print an output
if column_header.date() == datetime.date.today():
# you can read the statement which puts together your result as
# follows:
# data[header_i_am_interested_in][where: data['Names'] equals 'C']
result = data[column_header][data['Names'] == 'C']
print(result)
except AttributeError:
pass
It's unorthodox in pandas to use the date as column labels, instead of as row index, and since pandas dtypes go by column not by row, that means pandas won't correctly detect the column label type as 'datetime', rather than string/object, and hence comparison and arithmetic operators on it won't work properly, so you'll have to do lots of unnecessary avoidable manual work and conversions to/from datetime. Instead:
You should transpose the dataframe immediately at read-time:
data = pd.read_excel(...).T
Now your dates will be in one single column with the same dtype, and you can convert it with pd.to_datetime().
Then, make sure the dtypes are correct, i.e. the index's dtype should be 'datetime', not 'object', 'string' etc. (Please post your dataset or URL in the question to make this reproducible).
Now 'C' will be a column instead of a row.
You can access your entire 'C' column with:
rows = data[:, 'C']
... and similarly you can write an expression for your subset of rows for your desired dates. Waiting for your data snippet, to show the code.
I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!
Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])
for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.
Like the title states, I have two csv files I have read into a Pandas dataframe and I want to join the two tables on their "Dates" column values. I'm having an issue converting the special character "/" to "-" and switching the ordering to year-month-day. Any easy quick way that will convert all the row values from the "mm/dd/yyyy" format to the correct "yyyy-mm-dd" format for the join?
When you read your csv adding parse_dates
pd.read_csv('q.csv',parse_dates=True)
Or
pd.read_csv('q.csv',parse_dates=['Dates'])# your date format column here
I have a new Data-frame df. Which was created using:
df= pd.DataFrame()
I have a date value called 'day' which is in format dd-mm-yyyy and a cost value called 'cost'.
How can I append the date and cost values to the df and assign the date as the index?
So for example if I have the following values
day = 01-01-2001
cost = 123.12
the resulting df would look like
date cost
01-01-2001 123.12
I will eventually be adding paired values for multiple days, so the df will eventually look something like:
date cost
01-01-2001 123.12
02-01-2001 23.25
03-01-2001 124.23
: :
01-07-2016 2.214
I have tried to append the paired values to the data frame but am unsure of the syntax. I've tried various thinks including the below but without success.
df.append([day,cost], columns='date,cost',index_col=[0])
There are a few things here. First, making a column the index goes like this, though you can also do it when you load the dataframe from a file (see below):
df.set_index('date', inplace=True)
To add new rows, you should write them out to file first. Pandas isn't great at adding rows dynamically, and this way you can just read the data in when you need it for analysis.
new_row = ... #a row of new data in string format with values
#separated by commas and ending with \n
with open(path, 'a') as f:
f.write(new_row)
You can do this in a loop, or singly, as many time as you need. Then when you're ready to work with it, you use:
df = pd.read_csv(path, index_col=0, parse_dates=True)
index_col can't take a string name for the index column, so you have to use the index of the order on disk; in my case it makes the first column the index. Passing parse_dates=True will make it turn your datetime strings that you declared as the index into datetime objects.
Try this:
dfapp = [day,cost]
df.append(dfapp)