I have a column full of dates of 2 million rows. The data format is 'Year-Month-Day', ex: '2019-11-28'. Each time I load the document I have to change the format of the column (which takes a long time) doing:
pd.to_datetime(df['old_date'])
I would like to rearrange the order to 'Month-Day-Year' so that I wouldn't have to change the format of the column each time I load it.
I tried doing:
df_1['new_date'] = df_1['old_date'].dt.month+'-'+df_1['old_date'].dt.day+'-'+df_1['old_date'].dt.year
But I received the following error: 'unknown type str32'
Could anyone help me?
Thanks!
You could use pandas.Series.dt.strftime (documentation) to change the format of your dates. In the code below I have a column with your old format dates, I create a new columns with this method:
import pandas as pd
df = pd.DataFrame({'old format': pd.date_range(start = '2020-01-01', end = '2020-06-30', freq = 'd')})
df['new format'] = df['old format'].dt.strftime('%m-%d-%Y')
Output:
old format new format
0 2020-01-01 01-01-2020
1 2020-01-02 01-02-2020
2 2020-01-03 01-03-2020
3 2020-01-04 01-04-2020
4 2020-01-05 01-05-2020
5 2020-01-06 01-06-2020
6 2020-01-07 01-07-2020
7 2020-01-08 01-08-2020
8 2020-01-09 01-09-2020
9 2020-01-10 01-10-2020
Related
I have a dataframe that looks similar to this :
Price From To
300€ 2020-01-01 2020-01-07
250€ 2020-01-04 2020-01-08
150€ 2020-02-01 2020-02-04
350€ 2020-02-04 2020-02-08
And then I have a list of dates. For example: list = [2020-01-03, 2020-02-04]
I would like to keep only the rows of the dataframe where the dates are in between the From column and the To column.
So, after transformation I would have the following dataframe.
Price From To
300€ 2020-01-01 2020-01-07
150€ 2020-02-01 2020-02-04
350€ 2020-02-04 2020-02-08
First I thought of using a lambda with an apply but I thought it was not very efficient because my dataset is very large. Is there a simpler way to do this with pandas ?
The result would be contained in one single dataframe
Let's try with numpy broadcasting:
x, y = df[['From', 'To']].values.T
a = np.array(['2020-01-03', '2020-02-04'], dtype=np.datetime64)
mask = ((x[:, None] <= a) & (y[:, None] >= a)).any(1)
df[mask]
Price From To
0 300€ 2020-01-01 2020-01-07
2 150€ 2020-02-01 2020-02-04
3 350€ 2020-02-04 2020-02-08
One option is with Pandas IntervalIndex:
dates = ['2020-01-03', '2020-02-04']
dates = pd.to_datetime(dates)
intervals = pd.IntervalIndex.from_arrays(df.From, df.To, closed='both')
df.iloc[intervals.get_indexer_for(dates)] # for duplicates, you can use .unique
Price From To
0 300€ 2020-01-01 2020-01-07
2 150€ 2020-02-01 2020-02-04
3 350€ 2020-02-04 2020-02-08
I have this dataframe:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
3 2020-03-01 0.603989 S2B-SR
4 2020-03-11 0.717264 S2B-SR
5 2020-04-02 0.737118 L8-SR
I would like to groupby by the date columns where I keep the rows according to a ranking/importance of my chooseing from the source columns. For example, my ranking is L8-SR>S2B-SR>GP6_r, meaning that for all rows with the same date, keep the row where source==L8-SR, if none contain L8-SR, then keep the row where source==S2B-SR etc. How can I accomplish that in pandas groupby
Output should look like this:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
3 2020-03-11 0.717264 S2B-SR
4 2020-04-02 0.737118 L8-SR
Let's try category dtype and drop_duplicates:
orders = ['L8-SR','S2B-SR','GP6_r']
df.source = df.source.astype('category')
df.source.cat.set_categories(orders, ordered=True)
df.sort_values(['date','source']).drop_duplicates(['date'])
Output:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
4 2020-03-11 0.717264 S2B-SR
5 2020-04-02 0.737118 L8-SR
TRY below code for the group by operation. For ordering after this operation you can perform sortby:
# Import pandas library
import pandas as pd
# Declare a data dictionary contains the data mention in table
pandasdata_dict = {'date':['2020-02-14', '2020-02-15', '2020-03-01', '2020-03-01', '2020-03-11', '2020-04-02'],
'value':[0.438767, 0.422867, 0.657453, 0.603989, 0.717264, 0.737118],
'source':['L8-SR', 'S2A-SR', 'L8-SR', 'S2B-SR', 'S2B-SR', 'L8-SR']}
# Convert above dictionary data to the data frame
df = pd.DataFrame(pandasdata_dict)
# display data frame
df
# Convert date field to datetime
df["date"] = pd.to_datetime(df["date"])
# Once conversion done then do the group by operation on the data frame with date field
df.groupby([df['date'].dt.date])
I would like to make an empty pandas series with a date index which is every day of 2020. That is 01-01-2020, 02-01-2020 etc.
Although this looks very simple I couldn’t find out how to do it.
Use date_range:
range_2020 = pd.date_range("2020-01-01", "2020-12-31", freq="D")
pd.DataFrame(range(366), index=range_2020)
The output is:
0
2020-01-01 0
2020-01-02 1
2020-01-03 2
2020-01-04 3
2020-01-05 4
...
My 'Date' column format is switching the days and months after row 7. It seems to be pulling in the dates in the format yyyy/dd/mm; however, I want my dates to be 'mm/dd/yyyy' but it keeps getting mixed up and I'm not sure how to fix this.
For example:
Date
0 2019-02-01
1 2019-03-01
2 2019-04-01
3 2019-05-01
4 2019-06-01
5 2019-07-01
6 2019-08-01
7 2019-09-01
8 2019-01-10
9 2019-01-11
Here's what I'm doing:
#import the file from my computer
data = pd.ExcelFile(r'file on my computer')
#parse data
rates = data.parse("Sheet 1")
#print deminsions and glimpse
print(rates.shape)
rates.head(15)
#convert Date columns to desired format
rates['Date'] = pd.to_datetime(rates['Date'], format = '%Y%d%m')
#check if it worked
rates.head(16)
it still shows the dates as the same as above.
I have an observational data set which contain weather information. Each column contain specific field in which date and time are in two separate column. The time column contain hourly time like 0000, 0600 .. up to 2300. What I am trying to do is to filter the data set based on certain time frame, for example between 0000 UTC to 0600 UTC. When I try to read the data file in pandas data frame, by default the time column is read in float. When I try to convert it in to datatime object, it produces a format which I am unable to convert. Code example is given below:
import pandas as pd
import datetime as dt
df = pd.read_excel("test.xlsx")
df.head()
which produces the following result:
tdate itime moonph speed ... qnh windir maxtemp mintemp
0 01-Jan-17 1000.0 NM7 5 ... $1,011.60 60.0 $32.60 $22.80
1 01-Jan-17 1000.0 NM7 2 ... $1,015.40 999.0 $32.60 $22.80
2 01-Jan-17 1030.0 NM7 4 ... $1,015.10 60.0 $32.60 $22.80
3 01-Jan-17 1100.0 NM7 3 ... $1,014.80 999.0 $32.60 $22.80
4 01-Jan-17 1130.0 NM7 5 ... $1,014.60 270.0 $32.60 $22.80
Then I extracted the time column with following line:
df["time"] = df.itime
df["time"]
0 1000.0
1 1000.0
2 1030.0
3 1100.0
4 1130.0
5 1200.0
6 1230.0
7 1300.0
8 1330.0
.
.
3261 2130.0
3262 2130.0
3263 600.0
3264 630.0
3265 730.0
3266 800.0
3267 830.0
3268 1900.0
3269 1930.0
3270 2000.0
Name: time, Length: 3279, dtype: float64
Then I tried to convert the time column to datetime object:
df["time"] = pd.to_datetime(df.itime)
which produced the following result:
df["time"]
0 1970-01-01 00:00:00.000001000
1 1970-01-01 00:00:00.000001000
2 1970-01-01 00:00:00.000001030
3 1970-01-01 00:00:00.000001100
It appears that it has successfully converted the data to datetime object. However, it added the hour time to ms which is difficult for me to do filtering.
The final data format I would like to get is either:
1970-01-01 06:00:00
or
06:00
Any help is appreciated.
When you read the excel file specify the dtype of col itime as a str:
df = pd.read_excel("test.xlsx", dtype={'itime':str})
then you will have a time column of strings looking like:
df = pd.DataFrame({'itime':['2300', '0100', '0500', '1000']})
Then specify the format and convert to time:
df['Time'] = pd.to_datetime(df['itime'], format='%H%M').dt.time
itime Time
0 2300 23:00:00
1 0100 01:00:00
2 0500 05:00:00
3 1000 10:00:00
Just addon to Chris answer, if you are unable to convert because there is no zero in the front, apply the following to the dataframe.
df['itime'] = df['itime'].apply(lambda x: x.zfill(4))
So basically is that because the original format does not have even leading digit (4 digit). Example: 945 instead of 0945.
Try
df["time"] = pd.to_datetime(df.itime).dt.strftime('%Y-%m-%d %H:%M:%S')
df["time"] = pd.to_datetime(df.itime).dt.strftime('%H:%M:%S')
For the first and second outputs you want to
Best!