Sort table in pandas by certain value in column - python

I am using pandas to sort this table by "Departure date" and "Value" which I could by using: sort_values(["Departure date:", "Value"]), but the thing is that I need to sort only Wednesday's flights starting from cheapest.
When I print(type(Data["Departure date])) is says: <class 'pandas.core.series.Series'>, if that helps.
City Departure date Airline Value
Podgorica Sat 1 Jan Ryanair 14.46
Managua Wed 5 Jan Ryanair 1699.05
Bucharest Tue 11 Jan Ryanair 38.24
Oslo Wed 12 Jan Ryanair 24.32
Istanbul Wed 12 Jan Ryanair 120.00
Kyiv Wed 12 Jan Windrose 227.43
I could maybe split Departure date and extract only days of week, sort and join them later but it looks like a lot of work.
I just recently started with python and pandas so any help is welcomed. Thank you!

Isn't sorting by two columns the solution for you? Or do other dates need to stay in the same order?
data.sort_values(['Departure date', 'Value'])

Related

Sort out dates with different formats Python

I am having a hard time trying to sort dates with different formats. I have a Series with inputs containing dates in many different formats and need to extract them and sort them chronologically.
So far I have setup different regex for fully numerical dates (01/01/1989), dates with month (either Mar 12 1989 or March 1989 or 12 Mar 1989) and dates where only the year is given (see code below)
pat1=r'(\d{0,2}[/-]\d{0,2}[/-]\d{2,4})' # matches mm/dd/yy and mm/dd/yyyy
pat2=r'((\d{1,2})?\W?(Jan|Feb|Mar|Apr|May|June|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\W+(\d{1,2})?\W?\d{4})'
pat3=r'((?<!\d)(\d{4})(?!\d))'
finalpat=pat1 + "|"+ pat2 + "|" + pat3
df2=df1.str.extractall(finalpat).groupby(level=0).first()
I now got a dataframe with the different regex expressions above in different columns that I need to transform in usable times.
The problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989 and Mar 1989 (no day) in the same column of my dataframe.
Without two formats ( Month dd YYYY and dd Month YYYY) I can easily do this :
df3=df2.copy()
dico={"Jan":'01','Feb':'02','Mar':'03','Apr':'04','May':'05','Jun':'06','Jul':'07','Aug':'08','Sep':'09','Oct':'10','Nov':'11','Dec':'12'}
df3[1]=df3[1].str.replace("(?<=[A-Z]{1}[a-z]{2})\w*","") # we replace the month in the column by its number, and remove
for key,item in dico.items(): # the letters in month after the first 3.
df3[1]=df3[1].str.replace(key,item)
df3[1]=df3[1].str.replace("^(\d{1,2}/\d{4})",r'01/\g<1>')
df3[1]=pd.to_datetime(df3[1],format='%d/%m/%Y').dt.strftime('%Y%m%d') # add 01 if no day given
where df3[1] is the column of interest. I use a dictionary to change Month to their number and get my dates as I want them.
The problem is that with two formats of dates ( Mar 12 1989 and 12 Mar 1989), one of the two format will be wrongly transformed.
Is there a way to discriminate between the date formats and apply different transformations accordingly ?
Thanks a lot
problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989
and Mar 1989 (no day) in the same column of my dataframe.
pandas.to_datetime can cope with that, consider following example
import pandas as pd
df = pd.DataFrame({'d_str':["Mar 12 1989", "12 Mar 1989", "Mar 1989"]})
df['d_dt'] = pd.to_datetime(df.d_str)
print(df)
output
d_str d_dt
0 Mar 12 1989 1989-03-12
1 12 Mar 1989 1989-03-12
2 Mar 1989 1989-03-01
Now you can sort using d_dt as it has type datetime64[ns] but you must keep in mind that lack of day is treated as 1st day of given month. Be warned though it might fail if your data contain dates in middle-endian format (mm/dd/yy).

How can I order the table by month and then show it in plot chart? Python

I want to Order the table by the year and by month.
Sort_values doesnt work for me.
after that I need to show it in plot line chart with month over time
How can I do it?
df10=df.groupby(['year','month'],as_index=False).Sales.sum()
df10
year month Sales
0 2018 Apr 452546547.720000
1 2018 Aug 452830473.750001
2 2018 Dec 525888501.900000
3 2018 Feb 417589010.130000
4 2018 Jan 506665837.860000
5 2018 Jul 527113871.520000
6 2018 Jun 489527703.960000
7 2018 Mar 471807206.670001
8 2018 May 517740285.600000
9 2018 Nov 417862539.330000
10 2018 Oct 441153829.710001
11 2018 Sep 450298873.800000
12 2019 Apr 440397073.890000
13 2019 Feb 408684717.060001
14 2019 Jan 511212275.310001
15 2019 Mar 455560627.320000
16 2019 May 571120956.510000
sns.lineplot(x='month',y='Sales',data=df10)
'To sort by month, you need to have mont has number, or sorted string text. Either way, refer below to my code to get month as number, then plot the df however you like.
from time import strptime
df['month_num'] = [strptime(x,'%b').tm_mon for x in df['month']
df = df.soft_vlaues(['year', 'month_num')
data['y-m'] = data['year'].astype(str) +'-'+ data['month']
data['y-m'] = pd.to_datetime(data['y-m'])
sns.lineplot(y='Sales',x='y-m',data=data)
plt.xticks(rotation=45)
plt.show()
When sorting by dates, you first need to convert your data to datetime using datetime.date(year, month)
the key parameter helps you with that.
df10.sort_values(key=lambda e: datetime.date(e["year"], e["month"]))

How to parse irregular date formats in pandas?

I'm parsing a date column that contains irregular date formats that wouldn't be interpreted by pandas'. Dates include different languages for days, months, and years as well as varying formats. The date entries often include timestamps as well. (Bonus: Would separating them by string/regex with lambda/loops be the fastest method?) What's the best option and workflow to tackle these several tens of thousands of date entries?
The entries unknown to pandas and dateutil.parser.
Examples include:
19.8.2017, 21:23:32
31/05/2015 19:41:56
Saturday, 18. May
11 - 15 July 2001
2019/4/28 下午6:29:28
1 JuneMay 2000
19 aprile 2008 21:16:37 GMT+02:00
Samstag, 15. Mai 2010 20:55:10
So 23 Jun 2007 23:45 CEST
28 August 1998
30 June 2001
1 Ноябрь 2008 г. 18:46:59
Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time)
May-28-11 6:56:08 PM
Sat Jun 26 2010 21:55:54 GMT+0200 (West-Europa (zomertijd))
lunedì 5 maggio 2008 9.30.33
"ValueError: ('Unknown string format:', '1 JuneMay 2000')"
I realize this may be a cumbersome and undesirable task. Luckily the dates are currently nonessential to my project so they may be left alone, but a solution would be favorable. Any and all replies are appreciated, thank you.
Line by line, lots of your dates works:
>>> pd.to_datetime('19.8.2017, 21:23:32')
Timestamp('2017-08-19 21:23:32')
But there are many matters:
as your format is irregular, pandas cannot guess if 01-02-2019 is the first of february 2019 or the second of january 2019, I don't know if you can,
some of your example cannot be converted into date Saturday, 18. May: which year?
there are month and date in different languages (aprile seems Italian, Samstag is German)
some of your example works without the parenthesis content:
>>> pd.to_datetime('Sat Jun 18 2011 19:46:46 GMT+0200') # works
Timestamp('2011-06-18 19:46:46-0200', tz='pytz.FixedOffset(-120)')
>>> pd.to_datetime('Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time) ') # doesn't work.
...
ValueError: ('Unknown string format:', 'Sat Jun 18 2011 19:46:46 GMT+0200 (Romance Daylight Time) ')
It is sure that you cannot have all the date into timestamp, I would try to create a new column with the correctly parsed date in timestamp and the other saved as NaT.
For example:
date
02-01-2019
Saturday, 18. May
will become:
date new date
02-01-2019 Timestamp('2019-01-02 00:00:00.00)
Saturday, 18. May NaT
For this I would delete the parenthesis in the initial column:
df2 = df.assign(
date2=lambda x: x['date'].str.split('(')[0],
new_date=lambda x: x['date2'].apply(lambda y: pd.to_datetime(y, errors='coerce'), axis='columns')) # apply the function row by row
# This will work with python >= 3.6
After, you can see what's remaining with keeping NaT values.
For translation, you can try to replace words but it will be really long.
This is really slow (due to the apply row by row) but if your data are not consistent you cannot work directly on a column.
I hope it will help.

combining and formatting dtypes in pandas

Working on a project where ultimately I want to try and predict NBA home game attendance. I've scraped my preliminary data, but still want to add other fields such as arena capacity, win streak and other fields I might find valuable.
In my initial dataframe I'm just not sure how to combine my date fields in a way that will make it easier to plot and work with later. Also any other input would be appreciated as far as other tips. Thanks.
]
You have three original fields here: Date, Year, and Time. (Weekday can be derived from these.)
One route would be to concatenate their string-forms and form a Series of datetimes:
>>> concat = df['Date'] + ' ' + df['Year'].astype(str) + ' ' + df['Time']
>>> df['Fulldate'] = pd.to_datetime(concat)
>>> df
Weekday Date Year Time Fulldate
0 Tue Oct 30 2012 7:00 pm 2012-10-30 19:00:00
1 Tue Oct 30 2012 7:30 pm 2012-10-30 19:30:00
2 Tue Oct 30 2012 7:00 pm 2012-10-30 19:00:00
3 Wed Oct 31 2012 7:30 pm 2012-10-31 19:30:00
4 Wed Oct 31 2012 8:00 pm 2012-10-31 20:00:00
From there, you're free to derive additional fields with the .dt accessor. For instance:
>>> df.Fulldate.dt.month
0 10
1 10
2 10
3 10
4 10
Name: Fulldate, dtype: int64
>>> df.Fulldate.dt.weekday.isin((5, 6)) # weekend games
Here's a full list of datetime-like properties:
https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties
In the future, try to make your question a little more specific and post something people can easily reproduce, not pictures.

sort pandas dataframe of month names in correct order

I have a dataframe with names of months of the year, I.e. Jan, Feb, March etc
and I want to sort the data first by month, then by category so it looks like
Month_Name | Cat
Jan 1
Jan 2
Jan 3
Feb 1
Feb 2
Feb 3
pandas doesn't do custom sort functions for you, but you can easily add a temporary column which is the index of the month, and then sort by that
months = {datetime.datetime(2000,i,1).strftime("%b"): i for i in range(1, 13)}
df["month_number"] = df["month_name"].map(months)
df.sort(columns=[...])
You may wish to take advantage of pandas' good date parsing when reading in your dataframe, though: if you store the dates as dates instead of string month names then you'll be able to sort natively by them.
Use Sort_Dataframeby_MonthandNumeric_cols function to sort dataframe by month and numeric column:
You need to install two packages are shown below.
pip install sorted-months-weekdays
pip install sort-dataframeby-monthorweek
Example:
import pandas as pd
from sorted_months_weekdays import *
from sort_dataframeby_monthorweek import *
df = pd.DataFrame([['Jan',23],['Jan',16],['Dec',35],['Apr',79],['Mar',53],['Mar',12],['Feb',3]], columns=['Month','Sum'])
df
Out[11]:
Month Sum
0 Jan 23
1 Jan 16
2 Dec 35
3 Apr 79
4 Mar 53
5 Mar 12
6 Feb 3
To get sorted dataframe by month and numeric column I have used above function.
Sort_Dataframeby_MonthandNumeric_cols(df = df, monthcolumn='Month',numericcolumn='Sum')
Out[12]:
Month Sum
0 Jan 16
1 Jan 23
2 Feb 3
3 Mar 12
4 Mar 53
5 Apr 79
6 Dec 35

Categories