Dataframe date sorting is reversed. How to fix it? - python

So, I have a dataframe (mean_df) with a very messy column with dates. It's messy because it is in this format: 1/1/2018, 1/2/2018, 1/3/2018.... When it should be 01/01/2018, 02/01/2018, 03/01/2018... Not only has the wrong format, but it's ascending by the first day of every month, and then following second day of every month, and so on...
So I wrote this code to fix the format:
mean_df["Date"] = mean_df["Date"].astype('datetime64[ns]')
mean_df["Date"] = mean_df["Date"].dt.strftime('%d-%m-%Y')
Then, from displaying this:
It's now showing this (I have to run the same cell 3 times to make it work, it always throws error the first time):
Finally, in the last few hours I've been trying to sort the 'Dates' column, in an ascending way, but it keeps sorting it the wrong way:
mean_df = mean_df.sort_values(by='Date') # I tried this
But this is the output:
As you can see, it is still ascending prioritizing days.
Can someone guide me in the right direction?
Thank you in advance!

Make it into right format
mean_df["sort_date"] = pd.to_datetime(mean_df["Date"],format = '%d/%m/%Y')
mean_df = mean_df.sort_values(by='sort_date') # Try this now

You should sort the date just after convert it to datetime since dt.strftime convert datetime to string
mean_df["Date"] = pd.to_datetime(mean_df["Date"], dayfirst=True)
mean_df = mean_df.sort_values(by='Date')
mean_df["Date"] = mean_df["Date"].dt.strftime('%d-%m-%Y')

Here is my sample code.
import pandas as pd
df = pd.DataFrame()
df['Date'] = "1/1/2018, 1/2/2018, 1/3/2018".split(", ")
df['Date1'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['Date2'] = df['Date1'].dt.strftime('%d/%m/%Y')
df.sort_values(by='Date2')
First, I convert Date to datetime format. As I observed, you data follows '%d/%m/%Y' format. If you want to show data in another form, try the following line, for example
df['Date2'] = df['Date1'].dt.strftime('%d/%m/%Y')

Related

How can I change the Date column format in pandas?

I need to convert the date to Day, Month and Year. I tried some alternatives, but I was unsuccessful.
import pandas as pd
df = pd.read_excel(r"C:\__Imagens e Planilhas Python\Instagram\Postagem.xlsx")
print(df)
It's very confusing, because you're using two different formats between the image and the expected result (and you write you want the same).
Clarify that data is a date with:
df['data']= = pd.to_datetime(df['data'])
Once you have this, just change the format with:
my_format = '%m-%d-%Y'
df['data'] = df['data'].dt.strftime(my_format)

Convert month, day in string to date in Python

I have 2 columns as month and day in my dataframe which are of the datatypes objects. I want to sort those in ascending order (Jan, Feb, Mar) but in order to do that, I need to convert them to date format. I tried using the following code, and some more but nothing seems to work.
ff['month'] = dt.datetime.strptime(ff['month'],format='%b')
and
ff['month'] = pd.to_datetime(ff['month'], format="%b")
Data Frame
Any help would be appreciated. Thank you
This works to convert Month Names to Integers:
import datetime as dt
ff['month'] = [dt.datetime.strptime(m, "%b").month for m in ff['month']]
(Basically, you're just passing strings one by one to the first function you mentioned, to make it work.)
You can then manipulate (e.g. sort) them.
Working with dataframe:
ff['month'] = ff['month'].apply(lambda x: dt.datetime.strptime(x, "%b"))
ff = ff.sort_values(by=['month'])
ff['month'] = ff['month'].apply(lambda x: x.strftime("%b"))

Speed up the apply function in pandas (python)

I am working with a Dataframe containing date in string format. Dates look like this: 19620201 so with year first, then month, then day.
I want to convert those dates into Datetime. I tried to use this:
pd.to_datetime(df.Date)
But it doesn't work because some date have the day to "00" sometimes it's the month and sometimes it's even the year.
I don't wanna drop those dates because I still wnat the years or month.
So i tried to write a function like this one:
def handle_the_00_case(date):
try:
if date.endswith("0000"):
return pd.to_datetime(date[:-4], format="%Y")
elif date.endswith("00"):
return pd.to_datetime(date[:-2], format="%Y%m")
return pd.to_datetime(date, format="%Y%m%d")
except ValueError:
return
And use the following statement:
df.Date.apply(handle_the_00_case)
But this is really too long to compute.
Do you have an idea on how I can improve the speed of this ?
I tried the np.vectorize() and the swifter library but this doesn't work, I know I should change the way I wrote the function but i don't know how.
Thank you if you can help me ! :)
You should first convert the column to valid dates, and then convert to datetime only once:
date = df['Date'].str.replace('0000$','0101')
date = date.str.replace('00$','01')
date = pd.to_datetime(date, format="%Y%m%d")
First idea is use vectorized solution with pass column to to_datetime and generate ouput column by numpy.where:
d1 = pd.to_datetime(df['Date'].str[:-4], format="%Y", errors='coerce')
d2 = pd.to_datetime(df['Date'].str[:-2], format="%Y%m", errors='coerce')
d3 = pd.to_datetime(df['Date'], format="%Y%m%d", errors='coerce')
m1 = df['Date'].str.endswith("0000")
m2 = df['Date'].str.endswith("00")
df['Date_out'] = np.where(m1, d1, np.where(m2, d2, d3))

How to find missing dates in an excel file by python

I'm a beginner in python. I have an excel file. This file shows the rainfall amount between 2016-1-1 and 2020-6-30. It has 2 columns. The first column is date, another column is rainfall. Some dates are missed in the file (The rainfall didn't estimate). For example there isn't a row for 2016-05-05 in my file. This a sample of my excel file.
Date rainfall (mm)
1/1/2016 10
1/2/2016 5
.
.
.
12/30/2020 0
I want to find the missing dates but my code doesn't work correctly!
import pandas as pd
from datetime import datetime, timedelta
from matplotlib import dates as mpl_dates
from matplotlib.dates import date2num
df=pd.read_excel ('rainfall.xlsx')
a= pd.date_range(start = '2016-01-01', end = '2020-06-30' ).difference(df.index)
print(a)
Here' a beginner friendly way of doing it.
First you need to make sure, that the Date in your dataframe is really a date and not a string or object.
Type (or print) df.info().
The date column should show up as datetime64[ns]
If not, df['Date'] = pd.to_datetime(df['Date'], dayfirst=False)fixes that. (Use dayfirst to tell if the month is first or the day is first in your date string because Pandas doesn't know. Month first is the default, if you forget, so it would work without...)
For the tasks of finding missing days, there's many ways to solve it. Here's one.
Turn all dates into a series
all_dates = pd.Series(pd.date_range(start = '2016-01-01', end = '2020-06-30' ))
Then print all dates from that series which are not in your dataframe "Date" column. The ~ sign means "not".
print(all_dates[~all_dates.isin(df['Date'])])
Try:
df = pd.read_excel('rainfall.xlsx', usecols=[0])
a = pd.date_range(start = '2016-01-01', end = '2020-06-30').difference([l[0] for l in df.values])
print(a)
And the date in the file must like 2016/1/1
To find the missing dates from a list, you can apply Conditional Formatting function in Excel. 4. Click OK > OK, then the position of the missing dates are highlighted. Note: The last date in the date list will be highlighted.
this TRICK Is not with python,a NORMAL Trick

How to sort dates imported from a CSV file?

I'm trying to write a program that can print a list of sorted dates but it keeps sorting by the 'day' instead of the full date, day,month,year
Im very new to python so theres probably a lot i'm doing wrong but any help would be greatly appreciated.
So I have it so that you can view the list over two pages.
the dates will sort
12/03/2004
13/08/2001
15/10/2014
but I need the full date sorted
df = pd.read_csv('Employee.csv')
df = df.sort_values('Date of Employment.')
List1 = df.iloc[:50, 1:]
List2 = df.iloc[50:99, 1:]
The datetime data type has to be used for the dates to be sorted correctly
You need to use either one of these approaches to convert the dates to datetime objects:
Approach 1
pd.to_datetime + DataFrame.sort_values:
df['Date of Employment.'] = pd.to_datetime(df['Date of Employment.']')
Approach 2
You can parse the dates at the same time that the Pandas DataFrame is being loaded:
df = pd.read_csv('Employee.csv', parse_dates=['Date of Employement.'])
This is equivalent to the first approach with the exception that everything is done in one step.
Next you need to sort the datetime values in either ascending or descending order.
Ascending:
`df.sort_values('Date of Employment.')`
Descending
`df.sort_values('Date of Employment.',ascending=False)`
You need to convert Date of Employment. to a Date before sorting
df['Date of Employment.'] = pd.to_datetime(df['Date of Employment.'],format= '%d/%m/%Y')
Otherwise it's just strings for Python

Categories