how to reproduce this excel formula in python-pandas - python

I have a pandas data frame with a column representing dates in the format yyyy-mm-dd. This are sorted oldest to newest. I want to add a column next to it with the difference in time between the date at that row and the previous date.
In excel this would be something like:

Assuming your "date" column is stored as a datetime64 type, you can just do
df['difference'] = df.date.diff()
Check df.dtypes to ensure the date type is correct first.

solved it
data['lowered'] = data['date'].shift(+1)
data['difference'] = data['date'] - data['lowered']

Related

Python how to select specific cells on excel with pandas

I have an excel here as shown in this picture:
I am using pandas to read my excel file and it is working fine, this code below can print all the data in my excel:
import pandas as pd
df = pd.read_csv('alpha.csv')
print(df)
I want to get the values from C2 cell to H9 cell which month is October and day is Monday only. And I want to store these values in my python array below:
mynumbers= []
but I am not sure how should I do it, can you please help me?
You should consider slicing your dataframe and then using .values to story them. If you want them as a list, then you can use to_list():
First transform the Date column to a datetime:
df['Date'] = pd.to_datetime(df['Date'],dayfirst=True,infer_datetime_format=True)
Then, slice and return the values for the Column Number 2
mynumbers = df[(df['Date'].dt.month == 10) & \
(df['Date'].dt.weekday == 0)]['Column 2'].values.tolist()
Assigning the following values to mynumbers:
[11,8]
A first step would be to convert your Date column to datetime objects
import datetime
myDate = "10-11-22"
myDate = datetime.datetime.strptime(myDate, '%d-%m-%y')
Then using myDate.month and myDate.weekday() you can select for mondays in October

Length between two dates in a time series in pandas data frame

I have a time series composed of weekdays with anomalous/unpredictable holidays. On any given day, I want to know the length/number of rows to a date specified under column 'date1'. See below.
len(df.loc['2019-10-18':'2019-11-15']) returns the correct answer
I am trying to create a column 'shift' that will calculate the above.
Both DatetimeIndex and the 'date1' are dtype 'datetime64[ns]'
df['shift']=len(df.loc[df.index : df['date1']]) clearly doesn't work but might there be a solution that does?
IIUC use:
df['len'] = (df.index - df['date1']).dt.days

Converting a pandas datframe column to date type with a particular format of date?

This question is different from all the available questions and answers available in stack overflow because I do not want to change my data type to string in order to obtain desired output.
I find it as a most confusing and not able to find proper solution of my problem.
I read an excel file which have one column as following-
Date
9/20/2017 7:27:30 PM
9/20/2017 7:27:30 PM
11/21/2018 8:28:30 AM
7/18/2019 9:30:08 PM
.
.
.
I am taking this data from excel sheet with the help of dataframe
df = pd.read_excel("data.xlsx")
Firstly I want to remove time from this column. I am doing it as -
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(df['Date'], errors='ignore', format='%d/%b/%Y').dt.date
It produces following output and datatype as datetime.date
Date
20/9/2017
20/9/2017
21/11/2018
18/7/2019
.
.
.
But I want it as following type without changing it into string.Because I want to store this data into another excel file and this column must behave as a date column if we apply filtering in my excel sheet.
Date
20/Sep/2017
20/Sep/2017
21/Nov/2018
18/Jul/2019
.
.
.
I can produce above output by
df['Date'] = df['Date'].apply(lambda x: x.strftime('%d/%b/%Y'))
But again this date column will be changed into string.But I do not want it as string. I want it as datetime type excluding time values from each cell.
A possible solution after converting it from string to datetime is as following but it will again add time values in it-
df['Date'] = pd.to_datetime(df['Date'])
After executing above two steps it will also include time as 12:00:00 AM or 00:00:00 AM along with date value.
Hope I am clear.
How to obtained the desired result with final column value as date type
But I want it as following type without changing it into string
No it is not possible, if want datetimes without times there is only pattern YYYY-MM-DD in python/pandas.
#datetimes with no times
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y %I:%M:%S %p').dt.floor('d')
#python dates
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y %I:%M:%S %p').dt.date
For all custom formats are datetimes converted to strings like:
df['Date'] = df['Date'].dt.strftime('%d/%b/%Y')
You can set the date_format in the excelwriter
writer = pd.ExcelWriter("pandas_datetime.xlsx",
engine='xlsxwriter',
date_format='%d/%b/%Y')
df.to_excel(writer)
think i am bit late here, as a workaround
do not format the date column , let it be a regular df date column, save the excel workbook and then open the excel again and using openpyxl module format that column range
import openpyxl
workbook = openpyxl.load_workbook(file_path)
sheet = workbook['Sheet1'] # get the active sheet
#-- assuming that the column is M and data starts from M2
last_line_end = 'M' + str(len(df)+1)
for row in sheet['M2:' + last_line_end]:
for cell in row:
cell.number_format = "DD/MM/YY"
workbook.save(file_name) # save workbook
workbook.close()

Select Pandas dataframe rows based on 'hour' datetime

I have a pandas dataframe 'df' with a column 'DateTimes' of type datetime.time.
The entries of that column are hours of a single day:
00:00:00
.
.
.
23:59:00
Seconds are skipped, it counts by minutes.
How can I choose rows by hour, for example the rows between 00:00:00 and 00:01:00?
If I try this:
df.between_time('00:00:00', '00:00:10')
I get an error that index must be a DateTimeIndex.
I set the index as such with:
df=df.set_index(keys='DateTime')
but I get the same error.
I can't seem to get 'loc' to work either. Any suggestions?
Here a working example of what you are trying to do:
times = pd.date_range('3/6/2012 00:00', periods=100, freq='S', tz='UTC')
df = pd.DataFrame(np.random.randint(10, size=(100,1)), index=times)
df.between_time('00:00:00', '00:00:30')
Note the index has to be of type DatetimeIndex.
I understand you have a column with your dates/times. The problem probably is that your column is not of this type, so you have to convert it first, before setting it as index:
# Method A
df.set_index(pd.to_datetime(df['column_name'], drop=True)
# Method B
df.index = pd.to_datetime(df['column_name'])
df = df.drop('col', axis=1)
(The drop is only necessary if you want to remove the original column after setting it as index)
Check out these links:
convert column to date type: Convert DataFrame column type from string to datetime
filter dataframe on dates: Filtering Pandas DataFrames on dates
Hope this helps

Python - Pandas - Convert YYYYMM to datetime

Beginner python (and therefore pandas) user. I am trying to import some data into a pandas dataframe. One of the columns is the date, but in the format "YYYYMM". I have attempted to do what most forum responses suggest:
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'], format='%Y%m')
This doesn't work though (ValueError: unconverted data remains: 3). The column actually includes an additional value for each year, with MM=13. The source used this row as an average of the past year. I am guessing to_datetime is having an issue with that.
Could anyone offer a quick solution, either to strip out all of the annual averages (those with the last two digits "13"), or to have to_datetime ignore them?
pass errors='coerce' and then dropna the NaT rows:
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'], format='%Y%m', errors='coerce').dropna()
The duff month values will get converted to NaT values
In[36]:
pd.to_datetime('201613', format='%Y%m', errors='coerce')
Out[36]: NaT
Alternatively you could filter them out before the conversion
df_cons['YYYYMM'] = pd.to_datetime(df_cons.loc[df_cons['YYYYMM'].str[-2:] != '13','YYYYMM'], format='%Y%m', errors='coerce')
although this could lead to alignment issues as the returned Series needs to be the same length so just passing errors='coerce' is a simpler solution
Clean up the dataframe first.
df_cons = df_cons[~df_cons['YYYYMM'].str.endswith('13')]
df_cons['YYYYMM'] = pd.to_datetime(df_cons['YYYYMM'])
May I suggest turning the column into a period index if YYYYMM column is unique in your dataset.
First turn YYYYMM into index, then convert it to monthly period.
df_cons = df_cons.reset_index().set_index('YYYYMM').to_period('M')

Categories