Pandas not sorting datetime columns? - python

I have a dataframe as:
df:
+-----+--------------+-------------------------+------------------------------+----------------------------------------------------+-------------+-------------+-------------+--------------+--------------+---------------------+
| | Unnamed: 0 | country | league | game | home_odds | draw_odds | away_odds | home_score | away_score | datetime |
+=====+==============+=========================+==============================+====================================================+=============+=============+=============+==============+==============+=====================+
| 0 | 0 | Chile | Primera Division | Nublense - A. Italiano | 2.25 | 3.33 | 3.11 | 1 | 0 | 2021-06-08 00:30:00 |
+-----+--------------+-------------------------+------------------------------+----------------------------------------------------+-------------+-------------+-------------+--------------+--------------+---------------------+
| 1 | 1 | China | Jia League | Zibo Cuju - Shaanxi Changan | 11.54 | 4.39 | 1.31 | nan | nan | 2021-06-08 08:00:00 |
+-----+--------------+-------------------------+------------------------------+----------------------------------------------------+-------------+-------------+-------------+--------------+--------------+---------------------+
| 2 | 2 | Algeria | U21 League | Medea U21 - MC Alger U21 | 2.38 | 3.23 | 2.59 | nan | nan | 2021-06-08 09:00:00 |
+-----+--------------+-------------------------+------------------------------+----------------------------------------------------+-------------+-------------+-------------+--------------+--------------+---------------------+
| 3 | 3 | Algeria | U21 League | Skikda U21 - CR Belouizdad U21 | 9.48 | 4.9 | 1.25 | nan | nan | 2021-06-08 09:00:00 |
+-----+--------------+-------------------------+------------------------------+----------------------------------------------------+-------------+-------------+-------------+--------------+--------------+---------------------+
| 4 | 4 | China | Jia League | Zhejiang Professional - Xinjiang Tianshan | 1.2 | 5.92 | 12.18 | nan | nan | 2021-06-08 10:00:00 |
+-----+--------------+-------------------------+------------------------------+----------------------------------------------------+-------------+-------------+-------------+--------------+--------------+---------------------+
I have defined datetime as datetime
df['datetime'] = pd.to_datetime(df['datetime'])
and then tried to sort it
df.sort_values(by=['datetime'], ascending=True)
However the sorting does not work.
Can anybody help me understand why?
Please find the entire dataframe here for reference.
p.s. I am unable to paste the entire dataframe here because of character constraints.

I see in the comments you already found your solution. Copying the df back into itself after calling sort_values() means it's "new" name is the old name.
I'll add this as an answer.
df.sort_values(by=['datetime'], ascending=True, inplace=True)
Then it should make the sorting in-place, so you don't have to assign it to itself.

Related

Get sort string array by column value in a pandas DataFrame

Be the following python pandas DataFrame.
| date | days | country |
| ------------- | ---------- | --------- |
| 2022-02-01 | 1 | Spain |
| 2022-02-02 | 2 | Spain |
| 2022-02-01 | 3 | Italy |
| 2022-02-03 | 2 | France |
| 2022-02-03 | 1 | Germany |
| 2022-02-04 | 1 | Italy |
| 2022-02-04 | 1 | UK |
| 2022-02-05 | 2 | UK |
| 2022-02-04 | 5 | Spain |
| 2022-02-04 | 1 | Portugal |
I want to get a ranking by country according to its number of days.
| country | count_days |
| ---------------- | ----------- |
| Spain | 8 |
| Italy | 4 |
| UK | 3 |
| France | 2 |
| Germany | 1 |
| Portugal | 1 |
Finally I want to return the countries from most to least number of rows in a string array.
return: countries = ['Spain', 'Italy', 'UK', 'France', 'Germany', 'Portugal']
Firat aggreagte sum, then sorting values and convert to DataFrame:
df1 = (df.groupby('country')['days']
.sum()
.sort_values(ascending=False)
.reset_index(name='count_days'))
print (df1)
country count_days
0 Spain 8
1 Italy 4
2 UK 3
3 France 2
4 Germany 1
5 Portugal 1
Last convert column to list:
countries = df1['country'].tolist()
Solution without DataFrame df1:
countries = df.groupby('country')['days'].sum().sort_values(ascending=False).index.tolist()

How do I correctly remove all text from column in Pandas?

I have a dataframe as:
df:
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| | country | league | home_odds | draw_odds | away_odds | home_score | away_score | home_team | away_team | datetime |
+=====+=========================+==============================+=============+=============+=============+==============+==============+==========================+==============================+=====================+
| 63 | Chile | Primera Division | 2.80 | 3.05 | 2.63 | 3 | 1 | Melipilla | O'Higgins | 2021-06-07 00:30:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| 64 | North & Central America | CONCACAF Nations League | 2.95 | 3.07 | 2.49 | 3 | 2 ET | USA | Mexico | 2021-06-07 01:00:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| 66 | World | World Cup 2022 | 1.04 | 13.43 | 28.04 | 0 | 1 | Kyrgyzstan | Mongolia | 2021-06-07 07:00:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| 65 | World | Friendly International | 1.52 | 3.91 | 7.01 | 1 | 1 | Serbia | Jamaica | 2021-06-07 07:00:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
I want the columns home_score and away_score to be just integers and I am trying regex as:
df[['home_score', 'away_score']] = re.sub('\D', '', '.*')
however all the columns are coming in blank.
How do I correctly do it?
You can try via extract() and astype() method:
df['away_score']=df['away_score'].str.extract('^(\d+)').astype(int)
df['home_score']=df['home_score'].str.extract('^(\d+)').astype(int)
OR
df['away_score']=df['away_score'].str.extract('([0-9]+)').astype(int)
df['home_score']=df['home_score'].str.extract('([0-9]+)').astype(int)
output:
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| | country | league | home_odds | draw_odds | away_odds | home_score | away_score | home_team | away_team | datetime |
+=====+=========================+==============================+=============+=============+=============+==============+==============+==========================+==============================+=====================+
| 63 | Chile | Primera Division | 2.80 | 3.05 | 2.63 | 3 | 1 | Melipilla | O'Higgins | 2021-06-07 00:30:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| 64 | North & Central America | CONCACAF Nations League | 2.95 | 3.07 | 2.49 | 3 | 2 | USA | Mexico | 2021-06-07 01:00:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| 66 | World | World Cup 2022 | 1.04 | 13.43 | 28.04 | 0 | 1 | Kyrgyzstan | Mongolia | 2021-06-07 07:00:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
| 65 | World | Friendly International | 1.52 | 3.91 | 7.01 | 1 | 1 | Serbia | Jamaica | 2021-06-07 07:00:00 |
+-----+-------------------------+------------------------------+-------------+-------------+-------------+--------------+--------------+--------------------------+------------------------------+---------------------+
You can do df[['home_score', 'away_score']] = df[['home_score', 'away_score']].applymap(lambda x: int(float(x)))

Python rolling period returns

I need to develop a rolling 6-month return on the following dataframe
date Portfolio Performance
2001-11-30 1.048134
2001-12-31 1.040809
2002-01-31 1.054187
2002-02-28 1.039920
2002-03-29 1.073882
2002-04-30 1.100327
2002-05-31 1.094338
2002-06-28 1.019593
2002-07-31 1.094096
2002-08-30 1.054130
2002-09-30 1.024051
2002-10-31 0.992017
A lot of the answers from previous questions describe rolling average returns, which I can do. However, i am not looking for the average. What I need is the following example formula for a rolling 6-month return:
(1.100327 - 1.048134)/1.100327
The formula would then consider the next 6-month block between 2001-12-31 and 2002-05-31 and continue through to the end of the dataframe.
I've tried the following, but doesn't provide the right answer.
portfolio['rolling'] = portfolio['Portfolio Performance'].rolling(window=6).apply(np.prod) - 1
Expected output would be:
date Portfolio Performance Rolling
2001-11-30 1.048134 NaN
2001-12-31 1.040809 NaN
2002-01-31 1.054187 NaN
2002-02-28 1.039920 NaN
2002-03-29 1.073882 NaN
2002-04-30 1.100327 0.0520
2002-05-31 1.094338 0.0422
2002-06-28 1.019593 -0.0280
The current output is:
Portfolio Performance rolling
date
2001-11-30 1.048134 NaN
2001-12-31 1.040809 NaN
2002-01-31 1.054187 NaN
2002-02-28 1.039920 NaN
2002-03-29 1.073882 NaN
2002-04-30 1.100327 0.413135
2002-05-31 1.094338 0.475429
2002-06-28 1.019593 0.445354
2002-07-31 1.094096 0.500072
2002-08-30 1.054130 0.520569
2002-09-30 1.024051 0.450011
2002-10-31 0.992017 0.307280
I simply added the columns shifted 6 months and ran the formula presented. Does this meet the intent of the question?
df['before_6m'] = df['Portfolio Performance'].shift(6)
df['rolling'] = (df['Portfolio Performance'] - df['before_6m'])/df['Portfolio Performance']
df
| | date | Portfolio Performance | before_6m | rolling |
|---:|:--------------------|------------------------:|------------:|------------:|
| 0 | 2001-11-30 00:00:00 | 1.04813 | nan | nan |
| 1 | 2001-12-31 00:00:00 | 1.04081 | nan | nan |
| 2 | 2002-01-31 00:00:00 | 1.05419 | nan | nan |
| 3 | 2002-02-28 00:00:00 | 1.03992 | nan | nan |
| 4 | 2002-03-29 00:00:00 | 1.07388 | nan | nan |
| 5 | 2002-04-30 00:00:00 | 1.10033 | nan | nan |
| 6 | 2002-05-31 00:00:00 | 1.09434 | 1.04813 | 0.042221 |
| 7 | 2002-06-28 00:00:00 | 1.01959 | 1.04081 | -0.0208083 |
| 8 | 2002-07-31 00:00:00 | 1.0941 | 1.05419 | 0.0364767 |
| 9 | 2002-08-30 00:00:00 | 1.05413 | 1.03992 | 0.0134803 |
| 10 | 2002-09-30 00:00:00 | 1.02405 | 1.07388 | -0.0486607 |
| 11 | 2002-10-31 00:00:00 | 0.992017 | 1.10033 | -0.109182 |

Calculate rolling weekly/monthly change, using daily price data (with datetime stamps) in pandas

I have a pandas dataframe of daily stock price data, which is datetime stamped. Wondering the easiest way to create new columns which have weekly, monthly, or annual growth rates for this price data, but on a rolling basis.
Note that my daily price data only includes records for where there has been a change in price. i.e. no records for non-trading days.
For example, I want to generate something like this:
| | daily | weekly | monthly |
|------------|-------|--------|---------|
| 1/01/2000 | 2.00 | NaN | NaN |
| 3/01/2000 | 4.05 | NaN | NaN |
| 4/01/2000 | 2.10 | NaN | NaN |
| 5/01/2000 | 2.15 | NaN | NaN |
| 6/01/2000 | 3.20 | NaN | NaN |
| 7/01/2000 | 3.25 | 0.625 | NaN |
| 10/01/2000 | 3.30 | -0.185 | NaN |
| 11/01/2000 | 3.35 | 0.595 | NaN |
| 12/01/2000 | 3.40 | 0.581 | NaN |
| 13/01/2000 | 4.45 | 0.391 | NaN |
| 14/01/2000 | 2.50 | -0.231 | NaN |
| 17/01/2000 | 3.55 | 0.076 | NaN |
| 18/01/2000 | 4.60 | 0.373 | NaN |
| 19/01/2000 | 2.65 | -0.221 | NaN |
| 20/01/2000 | 4.70 | 0.056 | NaN |
| 21/01/2000 | 3.75 | 0.500 | NaN |
| 24/01/2000 | 2.80 | -0.211 | NaN |
| 25/01/2000 | 3.85 | -0.163 | NaN |
| 26/01/2000 | 3.90 | 0.472 | NaN |
| 27/01/2000 | 2.95 | -0.372 | NaN |
| 28/01/2000 | 3.00 | -0.200 | NaN |
| 31/01/2000 | 4.05 | 0.446 | NaN |
| 1/02/2000 | 3.10 | -0.195 | 0.550 |
| 2/02/2000 | 3.15 | -0.192 | 0.575 |
| 3/02/2000 | 5.20 | 0.763 | 0.284 |
| 4/02/2000 | 4.25 | 0.417 | 1.024 |
| 7/02/2000 | 5.30 | 0.309 | 0.631 |
| 8/02/2000 | 4.35 | 0.403 | 0.338 |
The weekly calculation is easy enough, seems like you can just lag it by five days:
shifted = data['daily'].shift(5)
data['weekly'] = (data['daily'] - shifted) / shifted
Monthly is harder because it seems like where there's a date missing you want to take the previous available date (i.e. on 2/2/2000 you compare to 1/1/2000 because there is no 2/1/2000). Or at least I think that's what you want from looking at the result. To do that you need to fill in the missing dates first using data_range() and reindex() using the "pad" fill method to take the previous day's value.
data.index = pd.to_datetime(data.index, dayfirst=True)
lag_data = data.reindex(pd.date_range(data.index.min(), data.index.max()), method='pad')
lag_data.index = lag_data.index + pd.DateOffset(31)
monthly = (data['daily'] - lag_data['daily']) / lag_data['daily']
data.join(monthly.rename('monthly'))

pandas pivot table to data frame [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a dataframe (df) that looks like this:
+---------+-------+------------+----------+
| subject | pills | date | strength |
+---------+-------+------------+----------+
| 1 | 4 | 10/10/2012 | 250 |
| 1 | 4 | 10/11/2012 | 250 |
| 1 | 2 | 10/12/2012 | 500 |
| 2 | 1 | 1/6/2014 | 1000 |
| 2 | 1 | 1/7/2014 | 250 |
| 2 | 1 | 1/7/2014 | 500 |
| 2 | 3 | 1/8/2014 | 250 |
+---------+-------+------------+----------+
When I use reshape in R, I get what I want:
reshape(df, idvar = c("subject","date"), timevar = 'strength', direction = "wide")
+---------+------------+--------------+--------------+---------------+
| subject | date | strength.250 | strength.500 | strength.1000 |
+---------+------------+--------------+--------------+---------------+
| 1 | 10/10/2012 | 4 | NA | NA |
| 1 | 10/11/2012 | 4 | NA | NA |
| 1 | 10/12/2012 | NA | 2 | NA |
| 2 | 1/6/2014 | NA | NA | 1 |
| 2 | 1/7/2014 | 1 | 1 | NA |
| 2 | 1/8/2014 | 3 | NA | NA |
+---------+------------+--------------+--------------+---------------+
Using pandas:
df.pivot_table(df, index=['subject','date'],columns='strength')
+---------+------------+-------+----+-----+
| | | pills |
+---------+------------+-------+----+-----+
| | strength | 250 | 500| 1000|
+---------+------------+-------+----+-----+
| subject | date | | | |
+---------+------------+-------+----+-----+
| 1 | 10/10/2012 | 4 | NA | NA |
| | 10/11/2012 | 4 | NA | NA |
| | 10/12/2012 | NA | 2 | NA |
+---------+------------+-------+----+-----+
| 2 | 1/6/2014 | NA | NA | 1 |
| | 1/7/2014 | 1 | 1 | NA |
| | 1/8/2014 | 3 | NA | NA |
+---------+------------+-------+----+-----+
How do I get exactly the same output as in R with pandas? I only want 1 header.
After pivoting, convert the dataframe to records and then back to dataframe:
flattened = pd.DataFrame(pivoted.to_records())
# subject date ('pills', 250) ('pills', 500) ('pills', 1000)
#0 1 10/10/2012 4.0 NaN NaN
#1 1 10/11/2012 4.0 NaN NaN
#2 1 10/12/2012 NaN 2.0 NaN
#3 2 1/6/2014 NaN NaN 1.0
#4 2 1/7/2014 1.0 1.0 NaN
#5 2 1/8/2014 3.0 NaN NaN
You can now "repair" the column names, if you want:
flattened.columns = [hdr.replace("('pills', ", "strength.").replace(")", "") \
for hdr in flattened.columns]
flattened
# subject date strength.250 strength.500 strength.1000
#0 1 10/10/2012 4.0 NaN NaN
#1 1 10/11/2012 4.0 NaN NaN
#2 1 10/12/2012 NaN 2.0 NaN
#3 2 1/6/2014 NaN NaN 1.0
#4 2 1/7/2014 1.0 1.0 NaN
#5 2 1/8/2014 3.0 NaN NaN
It's awkward, but it works.

Categories