Pandas groupby keep rows according to ranking - python

I have this dataframe:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
3 2020-03-01 0.603989 S2B-SR
4 2020-03-11 0.717264 S2B-SR
5 2020-04-02 0.737118 L8-SR
I would like to groupby by the date columns where I keep the rows according to a ranking/importance of my chooseing from the source columns. For example, my ranking is L8-SR>S2B-SR>GP6_r, meaning that for all rows with the same date, keep the row where source==L8-SR, if none contain L8-SR, then keep the row where source==S2B-SR etc. How can I accomplish that in pandas groupby
Output should look like this:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
3 2020-03-11 0.717264 S2B-SR
4 2020-04-02 0.737118 L8-SR

Let's try category dtype and drop_duplicates:
orders = ['L8-SR','S2B-SR','GP6_r']
df.source = df.source.astype('category')
df.source.cat.set_categories(orders, ordered=True)
df.sort_values(['date','source']).drop_duplicates(['date'])
Output:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
4 2020-03-11 0.717264 S2B-SR
5 2020-04-02 0.737118 L8-SR

TRY below code for the group by operation. For ordering after this operation you can perform sortby:
# Import pandas library
import pandas as pd
# Declare a data dictionary contains the data mention in table
pandasdata_dict = {'date':['2020-02-14', '2020-02-15', '2020-03-01', '2020-03-01', '2020-03-11', '2020-04-02'],
'value':[0.438767, 0.422867, 0.657453, 0.603989, 0.717264, 0.737118],
'source':['L8-SR', 'S2A-SR', 'L8-SR', 'S2B-SR', 'S2B-SR', 'L8-SR']}
# Convert above dictionary data to the data frame
df = pd.DataFrame(pandasdata_dict)
# display data frame
df
# Convert date field to datetime
df["date"] = pd.to_datetime(df["date"])
# Once conversion done then do the group by operation on the data frame with date field
df.groupby([df['date'].dt.date])

Related

New column from slices of a column in pandas

I am working with time series data with pandas and my data frame looks a little bit like this
Date Layer
0 2000-01-01 0.408640
1 2000-01-02 0.842065
2 2000-01-03 1.271810
3 2000-01-04 1.699399
4 2000-01-05 2.128098
... ...
7300 2019-12-27 149.323520
7301 2019-12-28 149.744012
7302 2019-12-29 150.155702
7303 2019-12-30 150.562771
7304 2019-12-31 151.003031
I need to make a column for each year, like this:
2000 2001 2002
0 0.408640 0.415863 0.425689
1 0.852653 0.826542 0.863524
... ... ...
364 156.235978 158.564578 152.135689
365 156.685421 158.924556 152.528978
Is there a way I can manage to do that? The resulting data can be in a new data frame
The approach for this will be to create separate year and day of year columns, and then create a pivot table:
#Convert Date column to pandas datetime if you haven't already:
df['Date'] = pd.to_datetime(df['Date'])
#Create year column
df['Year'] = df['Date'].dt.year
#Create day of year column
df['DayOfYear'] = df['Date'].dt.dayofyear
#create pivot table in new dataframe
df2 = pd.pivot_table(df, index = 'DayOfYear', columns = 'Year', values = 'Layer')
This won't look exactly like your desired output because the index will be numbered 1-365 (and have a name) rather than 0-364. If you want it to match exactly, you can add:
df2 = df2.reset_index()

Convert date format to 'Month-Day-Year'

I have a column full of dates of 2 million rows. The data format is 'Year-Month-Day', ex: '2019-11-28'. Each time I load the document I have to change the format of the column (which takes a long time) doing:
pd.to_datetime(df['old_date'])
I would like to rearrange the order to 'Month-Day-Year' so that I wouldn't have to change the format of the column each time I load it.
I tried doing:
df_1['new_date'] = df_1['old_date'].dt.month+'-'+df_1['old_date'].dt.day+'-'+df_1['old_date'].dt.year
But I received the following error: 'unknown type str32'
Could anyone help me?
Thanks!
You could use pandas.Series.dt.strftime (documentation) to change the format of your dates. In the code below I have a column with your old format dates, I create a new columns with this method:
import pandas as pd
df = pd.DataFrame({'old format': pd.date_range(start = '2020-01-01', end = '2020-06-30', freq = 'd')})
df['new format'] = df['old format'].dt.strftime('%m-%d-%Y')
Output:
old format new format
0 2020-01-01 01-01-2020
1 2020-01-02 01-02-2020
2 2020-01-03 01-03-2020
3 2020-01-04 01-04-2020
4 2020-01-05 01-05-2020
5 2020-01-06 01-06-2020
6 2020-01-07 01-07-2020
7 2020-01-08 01-08-2020
8 2020-01-09 01-09-2020
9 2020-01-10 01-10-2020

python compare date list to start and end date columns in dataframe

Problem: I have a dataframe with two columns: Start date and End date. I also have a list of dates. So lets say the data looks something like this:
data = [[1/1/2018,3/1/2018],[2/1/2018,3/1/2018],[4/1/2018,6/1/2018]]
df = pd.DataFrame(data,columns=['startdate','enddate'])
dates=[1/1/2018,2/1/2018]
What I need to do is:
1)Create a new column for each date in the dates list
2)for each row in the df, if the date for the new column created is in between the start and end date, assign a 1; if not, assign a 0.
I have tried to use zip but then I realized that the df rows will be thousands of rows, where as the dates list will contain about 24 items (spanning 2 years), so it stops when the dates list is exhausted, i.e., at 24.
So below is what the original df looks like and how it should look like afterwards:
Before:
startdate enddate
0 2018-01-01 2018-03-01
1 2018-02-01 2018-03-01
2 2018-04-01 2018-06-01
After:
startdate enddate 1/1/2018 2/1/2018
0 1/1/2018 3/1/2018 1 1
1 2/1/2018 3/1/2018 0 1
2 4/1/2018 6/1/2018 0 0
Any help on this would be much appreciated, thanks!
Using numpy broadcast
s1=df.startdate.values
s2=df.enddate.values
v=pd.to_datetime(pd.Series(dates)).values[:,None]
newdf=pd.DataFrame(((s1<=v)&(s2>=v)).T.astype(int),columns=dates,index=df.index)
pd.concat([df,newdf],axis=1)
startdate enddate 1/1/2018 2/1/2018
0 2018-01-01 2018-03-01 1 1
1 2018-02-01 2018-03-01 0 1
2 2018-04-01 2018-06-01 0 0

Prepare Data Frames to be compared. Index manipulation, datetime and beyond

Ok, this is a question in two steps.
Step one: I have a pandas DataFrame like this:
date time value
0 20100201 0 12
1 20100201 6 22
2 20100201 12 45
3 20100201 18 13
4 20100202 0 54
5 20100202 6 12
6 20100202 12 18
7 20100202 18 17
8 20100203 6 12
...
As you can see, for instance between rows 7 and 8 there is data missing (in this case, the value for the 0 time). Sometimes, several hours or even a full day could be missing.
I would like to convert this DataFrame to the format like this:
value
2010-02-01 00:00:00 12
2010-02-01 06:00:00 22
2010-02-01 12:00:00 45
2010-02-01 18:00:00 13
2010-02-02 00:00:00 54
2010-02-02 06:00:00 12
2010-02-02 12:00:00 18
2010-02-02 18:00:00 17
...
I want this because I have another DataFrame (let's call it "reliable DataFrame") in this format that I am sure it has no missing values.
EDIT 2016/07/28: Studying the problem it seems there were also duplicated data in the dataframe. See the solution to also address this problem.
Step two: With the previous step done I want to compare row by row the index in the "reliable DataFrame" with the index in the DataFrame with missing values.
I want to add a row with the value NaN where there are missing entries in the first DataFrame. The final check would be to be sure that both DataFrames have the same dimension.
I know this is a long question, but I am stacked. I have tried to manage the dates with the dateutil.parser.parse and to use set_index as the method to set a new index, but I have lots of errors in the code. I am afraid this is clearly above my pandas level.
Thank you in advance.
Step 1 Answer
df['DateTime'] = (df['date'].astype(str) + ' ' + df['time'].astype(str) +':'+'00'+':'+'00').apply(lambda x: pd.to_datetime(str(x)))
df.set_index('DateTime', drop=True, append=False, inplace=True, verify_integrity=False)
df.drop(['date', 'time'], axis=1, level=None, inplace=True, errors='raise')
If there are duplicates these can be removed by:
df = df.reset_index().drop_duplicates(subset='DateTime',keep='last').set_index('DateTime')
Step 2
df_join = df.join(df1, how='outer', lsuffix='x',sort=True)

How to access last element of a multi-index dataframe

I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs timestamp value
0 2010-10-30 1
2010-11-30 2
1 2000-01-01 300
2007-01-01 33
2010-01-01 400
2 2000-01-01 11
So basically the result I want is
IDs timestamp value
0 2010-11-30 2
1 2010-01-01 400
2 2000-01-01 11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs timestamp value
0 2010-10-30 1
0 2010-11-30 2
1 2000-01-01 300
1 2007-01-01 33
1 2010-01-01 400
2 2000-01-01 11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
timestamp value
IDs
0 2010-11-30 00:00:00 2
1 2010-01-01 00:00:00 400
2 2000-01-01 00:00:00 11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values.

Categories