I have a list:
['Sun Oct 24 10:31:10 +0000 2021','Sun Oct 24 10:45:02 +0000 2021','Mon Oct 25 04:13:27 +0000 2021',
'Mon Oct 25 04:26:20 +0000 2021','Mon Oct 25 04:32:32 +0000 2021','Mon Oct 25 04:56:39 +0000 2021',
'Mon Oct 25 05:21:21 +0000 2021','Mon Oct 25 06:46:27 +0000 2021','Mon Oct 25 08:59:13 +0000 2021']
How can I get this result:
['Sun Oct 24', 'Sun Oct 24', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25',
'Mon Oct 25', 'Mon Oct 25']
We could use re.sub here for a regex based approach:
inp = ['Sun Oct 24 10:31:10 +0000 2021', 'Sun Oct 24 10:45:02 +0000 2021', 'Mon Oct 25 04:13:27 +0000 2021', 'Mon Oct 25 04:26:20 +0000 2021', 'Mon Oct 25 04:32:32 +0000 2021', 'Mon Oct 25 04:56:39 +0000 2021', 'Mon Oct 25 05:21:21 +0000 2021', 'Mon Oct 25 06:46:27 +0000 2021', 'Mon Oct 25 08:59:13 +0000 2021']
output = [re.sub(r'\s+\d{2}:.*$', '', x) for x in inp]
print(output)
# ['Sun Oct 24', 'Sun Oct 24', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25',
'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25']
If you have fixed format dates, you can just take first 10 chars of every string date
dates = ['Sun Oct 24 10:31:10 +0000 2021','Sun Oct 24 10:45:02 +0000 2021','Mon Oct 25 04:13:27 +0000 2021',
'Mon Oct 25 04:26:20 +0000 2021','Mon Oct 25 04:32:32 +0000 2021','Mon Oct 25 04:56:39 +0000 2021',
'Mon Oct 25 05:21:21 +0000 2021','Mon Oct 25 06:46:27 +0000 2021','Mon Oct 25 08:59:13 +0000 2021']
trunc_dates = [
date[:10] for date in dates
]
print(trunc_dates)
Output
['Sun Oct 24', 'Sun Oct 24', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25']
Also you can use more reliable solution with parsing via dateutil and formatting
from dateutil import parser
dates = ['Sun Oct 24 10:31:10 +0000 2021','Sun Oct 24 10:45:02 +0000 2021','Mon Oct 25 04:13:27 +0000 2021',
'Mon Oct 25 04:26:20 +0000 2021','Mon Oct 25 04:32:32 +0000 2021','Mon Oct 25 04:56:39 +0000 2021',
'Mon Oct 25 05:21:21 +0000 2021','Mon Oct 25 06:46:27 +0000 2021','Mon Oct 25 08:59:13 +0000 2021']
trunc_dates = [
parser.parse(date).strftime('%a %b %d')
for date in dates
]
print(trunc_dates)
Output
['Sun Oct 24', 'Sun Oct 24', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25', 'Mon Oct 25']
Related
I'm trying to match dates in a dataframe with 500 entries using regex:
The dates can appear in the following formats:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
dates[dates[0].str.contains(r'(?P<year>\d?\d?\d\d)')].shape
returns a tuple of shape(500,1)
but
dates[dates[0].str.contains(r'((?P\<day\>(\d?\d)?(\s|-|/|th|st|nd)?)??P\<year\>(\d?\d?\d\d))')].shape
returns a tuple of shape(0,1), but the day group is optional, so shouldnt it still match the year group.
Ok I got it.
The correct regex pattern is:
r'((?P<day>(\d?\d)?(\s|-|/|th|st|nd)?)?(?P<year>\d?\d?\d\d))'
The bracket for the year group was at the wrong position.
I have a df, where the data looks like this:
Time Value
60.8
Jul 2019 58.1
58.8
56.9
Oct 2019 51.8
54.6
56.8
Jan 2020 58.8
54.2
51.3
Apr 2020 52.2
I want to fill in the blank cells in the Time variable according to the calendar year. So:
Time Value
Jun 2019 60.8
Jul 2019 58.1
Aug 2019 58.8
Sep 2019 56.9
Oct 2019 51.8
Nov 2019 54.6
Dec 2019 56.8
Jan 2020 58.8
Feb 2020 54.2
Mar 2020 51.3
Apr 2020 52.2
I saw a post where pandas could be used to fill in numeric values, but since my variable isn't necessarily defined in a numeric way, I'm not entirely sure how to apply it in this situation.
There seem to me to be two ways of approaching this: 1) modifying the list before writing to df. 2) Modifying the df.
I prefer the first solution, but not sure if it is possible.
Thanks.
My script:
totalmonth=['', 'Jul 2019', '', '', 'Oct 2019', '', '', 'Jan 2020', '', '', 'Apr 2020', '']
totalvalue=['60.8', '58.1', '58.8', '56.9', '51.8', '54.6', '56.8', '58.8', '54.2', '51.3', '52.2', '48.7']
df = pd.DataFrame({'Time': totalmonth,
'Value': totalvalue})
Ok, this took me longer than I would like to admit. I solved for your first answer
Output:
***********************BEFORE********************************
['', 'Jul 2019', '', '', 'Oct 2019', '', '', 'Jan 2020', '', '', 'Apr 2020', '']
Time Value
0 60.8
1 Jul 2019 58.1
2 58.8
3 56.9
4 Oct 2019 51.8
5 54.6
6 56.8
7 Jan 2020 58.8
8 54.2
9 51.3
10 Apr 2020 52.2
11 48.7
***********************AFTER********************************
['Jun 2019', 'Jul 2019', 'Aug 2019', 'Sep 2019', 'Oct 2019', 'Nov 2019', 'Dec 2019', 'Jan 2020', 'Feb 2020', 'Mar 2020', 'Apr 2020', 'May 2020']
Time Value
0 Jun 2019 60.8
1 Jul 2019 58.1
2 Aug 2019 58.8
3 Sep 2019 56.9
4 Oct 2019 51.8
5 Nov 2019 54.6
6 Dec 2019 56.8
7 Jan 2020 58.8
8 Feb 2020 54.2
9 Mar 2020 51.3
10 Apr 2020 52.2
11 May 2020 48.7
Code:
from datetime import datetime
from dateutil.relativedelta import relativedelta
totalmonth=['', 'Jul 2019', '', '', 'Oct 2019', '', '', 'Jan 2020', '', '', 'Apr 2020', '']
new_totalmonth = [datetime.strptime(x,'%b %Y') for x in totalmonth if x != '' ]
index = totalmonth.index(min(new_totalmonth).strftime('%b %Y'))
new_totalmonth = [(min(new_totalmonth) + relativedelta(months=x)).strftime('%b %Y') for x in range(-index,len(totalmonth) - index)]
print(new_totalmonth)
Breakdown
This line of code creates a list of all the valid dates and puts them in a format that I can run the min() function on.
new_totalmonth = [datetime.strptime(x,'%b %Y') for x in totalmonth if x != '' ]
What this prints out
print(new_totalmonth)
[datetime.datetime(2019, 7, 1, 0, 0), datetime.datetime(2019, 10, 1, 0, 0), datetime.datetime(2020, 1, 1, 0, 0), datetime.datetime(2020, 4, 1, 0, 0)]
This is creating the variable index and assigning it the index of the minimum date in totalmonth
index = totalmonth.index(min(new_totalmonth).strftime('%b %Y'))
min(new_totalmonth) # this is finding the minimum date in new_totalmonth
print(min(new_totalmonth))
2019-07-01 00:00:00
min(new_totalmonth).strftime('%b %Y') # This is putting that minimum in a format that matches what is in totalmonth so the function totalmonth.index() can get the correct index
print(min(new_totalmonth).strftime('%b %Y'))
Jul 2019
This is using list comprehension.
new_totalmonth = [(min(new_totalmonth) + relativedelta(months=x)).strftime('%b %Y') for x in range(-index,len(totalmonth) - index)]
I am using the index of the minimum date in totalmonth to manipulate the range of values (how many months) I am going to add to the minimum month in totalmonth
range(-index,len(totalmonth) - index)
print(list(range(-index,len(totalmonth) - index)))
[-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Since the minimum month (Jul 2019) is at index 1 I need to add -1 months to it to get the month that comes before it which is Jun 2019
So it can be broken out to:
(min(new_totalmonth) + relativedelta(months=-1)).strftime('%b %Y') = Jun 2019
(min(new_totalmonth) + relativedelta(months=0)).strftime('%b %Y') = Ju1 2019
(min(new_totalmonth) + relativedelta(months=1)).strftime('%b %Y') = Aug 2019
...
(min(new_totalmonth) + relativedelta(months=10)).strftime('%b %Y') = May 2019
Take all those values and put them in the list new_totalmonth
print(new_totalmonth)
['Jun 2019', 'Jul 2019', 'Aug 2019', 'Sep 2019', 'Oct 2019', 'Nov 2019', 'Dec 2019', 'Jan 2020', 'Feb 2020', 'Mar 2020', 'Apr 2020', 'May 2020']
The minimum minus 1 in the 'Time' column is the start month, the maximum plus 2 in the 'Time' column is the last month, and the target column is updated with date_range() to get successive values.
df['Time'] = pd.to_datetime(df['Time'])
startM = datetime.datetime((df['Time'].min()).year,(df['Time'].min()).month-1,1)
endM = datetime.datetime((df['Time'].max()).year,(df['Time'].max()).month+2,1)
df['Time'] = pd.date_range(startM,endM, freq='1M')
df
Time Value
0 2019-06-30 60.8
1 2019-07-31 58.1
2 2019-08-31 58.8
3 2019-09-30 56.9
4 2019-10-31 51.8
5 2019-11-30 54.6
6 2019-12-31 56.8
7 2020-01-31 58.8
8 2020-02-29 54.2
9 2020-03-31 51.3
10 2020-04-30 52.2
11 2020-05-31 48.7
First use pd.to_datetime to convert the Time column to pandas datetime series t, next use pd.period_range to generate a period range with a monthly frequency and the starting period equals to the calculated period and number of periods equal to the length of the series t, finally use .strftime with a format specifier %b %Y to returns the string representation of the period_range in the desired format:
t = pd.to_datetime(df['Time'])
df['Time'] = pd.period_range(
t.min().to_period('M') - t.idxmin(), periods=len(t), freq='M').strftime('%b %Y')
Details:
# print(t)
0 NaT
1 2019-07-01
2 NaT
3 NaT
4 2019-10-01
5 NaT
6 NaT
7 2020-01-01
8 NaT
9 NaT
10 2020-04-01
11 NaT
Name: Time, dtype: datetime64[ns]
# print(t.min(), t.idxmin())
Timestamp('2019-07-01 00:00:00'), 1
# print(t.min().to_period('M') - t.idxmin())
Period('2019-06', 'M') # starting period of the period range
Result:
# print(df)
Time Value
0 Jun 2019 60.8
1 Jul 2019 58.1
2 Aug 2019 58.8
3 Sep 2019 56.9
4 Oct 2019 51.8
5 Nov 2019 54.6
6 Dec 2019 56.8
7 Jan 2020 58.8
8 Feb 2020 54.2
9 Mar 2020 51.3
10 Apr 2020 52.2
11 May 2020 48.7
I have following string:
"04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010; 31/May/2019; 01/October/2019; 1st April"
With Current regex I am able to find all dates format accept two which are 31/May/2019 and 01/October/2019
Current regex which I am using:
(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)+(?:\d{2,4})+
Can anyone help in making a regex for extracting all dates mentioned above. I want to solve this using regex only.
try
dates = """04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010; 31/May/2019; 01/October/2019"""
pd.to_datetime(pd.Series(dates.split(';')))
0 2009-04-20
1 2009-04-20
2 2009-04-20
3 2009-04-03
4 2009-03-20
5 2009-03-20
6 2009-03-20
7 2009-03-20
8 2009-03-20
9 2009-03-20
10 2009-03-02
11 2009-03-20
12 2009-03-20
13 2009-03-21
14 2009-03-22
15 2009-02-01
16 2009-09-01
17 2010-10-01
18 2008-06-01
19 2009-12-01
20 2009-01-01
21 2010-01-01
22 2019-05-31
23 2019-10-01
dtype: datetime64[ns]
I want to copy the table (id=symbolMarket) and save it as a pandas dataframe in this link https://www.myfxbook.com/en/forex-market/currencies/US30-historical-data
How should I do it in the simple/beautiful way please?
Obviously I can retrieve the element one by one, but I believe there is a better way.
(I am using selenium to access the page, if this helps)
Many thanks for sharing knowledge with me
I was pretty hesitant to post this since it is pretty basic, and there is an abundance of solutions that show how to read a html table into pandas dataframe. Makes me wonder if you even attempted to look it up first.
But, just use .read_html(). This will return a list of dataframes. So you'll just have to figure out which dataframe in that list that you want:
import pandas as pd
url = 'https://www.myfxbook.com/en/forex-market/currencies/US30-historical-data'
tables = pd.read_html(url)
Output:
table = tables[3]
print (table)
0 1 ... 5 6
0 Date Open ... Change (Pips) Change (%)
1 Mar 20, 2019 21:00 25737 ... +253.0 +0.97%
2 Mar 19, 2019 21:00 25871 ... -135.0 -0.52%
3 Mar 18, 2019 21:00 25935 ... -63.0 -0.24%
4 Mar 17, 2019 21:00 25864 ... +70.0 +0.27%
5 Mar 16, 2019 21:00 25864 ... -20.0 -0.08%
6 Mar 14, 2019 21:00 25716 ... +153.0 +0.59%
7 Mar 13, 2019 21:00 25756 ... -40.0 -0.16%
8 Mar 12, 2019 21:00 25575 ... +185.0 +0.72%
9 Mar 11, 2019 21:00 25686 ... -93.0 -0.36%
10 Mar 10, 2019 21:00 25470 ... +212.0 +0.83%
11 Mar 09, 2019 21:00 25470 ... -29.0 -0.11%
12 Mar 07, 2019 21:00 25459 ... +61.0 +0.24%
13 Mar 06, 2019 21:00 25673 ... -197.0 -0.77%
14 Mar 05, 2019 21:00 25786 ... -108.0 -0.42%
15 Mar 04, 2019 21:00 25805 ... +3.0 +0.01%
16 Mar 03, 2019 21:00 26114 ... -300.0 -1.16%
17 Feb 28, 2019 21:00 25911 ... +138.0 +0.53%
18 Feb 27, 2019 21:00 26018 ... -89.0 -0.34%
19 Feb 26, 2019 21:00 26005 ... +31.0 +0.12%
20 Feb 25, 2019 21:00 26093 ... -63.0 -0.24%
21 Feb 24, 2019 21:00 26094 ... -3.0 -0.01%
22 Feb 21, 2019 21:00 25825 ... +210.0 +0.81%
23 Feb 20, 2019 21:00 25962 ... -120.0 -0.46%
24 Feb 19, 2019 21:00 25877 ... +88.0 +0.34%
25 Feb 18, 2019 21:00 25894 ... -9.0 -0.03%
26 Feb 17, 2019 21:00 25905 ... +5.0 +0.02%
27 Feb 14, 2019 21:00 25404 ... +500.0 +1.93%
28 Feb 13, 2019 21:00 25483 ... -68.0 -0.27%
29 Feb 12, 2019 21:00 25418 ... +102.0 +0.40%
.. ... ... ... ... ...
71 Dec 11, 2018 21:00 24341 ... +208.0 +0.85%
72 Dec 10, 2018 21:00 24490 ... -152.0 -0.62%
73 Dec 09, 2018 21:00 24338 ... +144.0 +0.59%
74 Dec 06, 2018 21:00 24921 ... -517.0 -2.12%
75 Dec 05, 2018 21:00 25118 ... -189.0 -0.76%
76 Dec 04, 2018 21:00 25033 ... +134.0 +0.53%
77 Dec 03, 2018 21:00 25837 ... -798.0 -3.19%
78 Dec 02, 2018 21:00 25897 ... -55.0 -0.21%
79 Nov 29, 2018 21:00 25367 ... +220.0 +0.86%
80 Nov 28, 2018 21:00 25327 ... +62.0 +0.24%
81 Nov 27, 2018 21:00 24794 ... +568.0 +2.24%
82 Nov 26, 2018 21:00 24546 ... +253.0 +1.02%
83 Nov 25, 2018 21:00 24300 ... +230.0 +0.94%
84 Nov 22, 2018 21:00 24367 ... -80.0 -0.33%
85 Nov 21, 2018 21:00 24497 ... -144.0 -0.59%
86 Nov 20, 2018 21:00 24461 ... +38.0 +0.16%
87 Nov 19, 2018 21:00 25063 ... -604.0 -2.47%
88 Nov 18, 2018 21:00 25410 ... -342.0 -1.36%
89 Nov 15, 2018 21:00 25335 ... +135.0 +0.53%
90 Nov 14, 2018 21:00 25085 ... +256.0 +1.01%
91 Nov 13, 2018 21:00 25378 ... -273.0 -1.09%
92 Nov 12, 2018 21:00 25422 ... -65.0 -0.26%
93 Nov 11, 2018 21:00 25987 ... -577.0 -2.27%
94 Nov 08, 2018 21:00 26184 ... -202.0 -0.78%
95 Nov 07, 2018 21:00 26190 ... +15.0 +0.06%
96 Nov 06, 2018 21:00 25663 ... +572.0 +2.18%
97 Nov 05, 2018 21:00 25481 ... +200.0 +0.78%
98 Nov 04, 2018 21:00 25267 ... +221.0 +0.87%
99 Nov 01, 2018 21:00 25240 ... +40.0 +0.16%
100 Oct 31, 2018 21:00 25090 ... +229.0 +0.90%
[101 rows x 7 columns]
I need to find a Python regular expression in order to match every valid date in a raw text file. I split the text in lines and put them in a Pandas Series, the goal now, is to extract only the date in every line getting a series of dates. I was able to match most of the numerical date formats, but I stopped when I had to deal with literal months (Jan, January, Feb, February,...). In particular, I need a regex (or a set of them) which match the following formats:
- Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
- 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
- Feb 2009; Sep 2009; Oct 2010
Any help will be appreciated,
thank you in advance!
In line with the comment I made, suggest using split and strip to generate a list of possible dates from your output string and then feed it to dateutils.parser.parse() to turn into a proper datetime object which you can manipulate to your liking.
Possible implementation below:
test = '''- Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
- 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
- Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
- Feb 2009; Sep 2009; Oct 2010'''
list_of_dates = []
for line in test.split('\n'):
for date in line.split(';'):
list_of_dates.append(date.strip(' - '))
from dateutil.parser import parse
def is_date(string):
try:
parse(string)
return True
except ValueError:
return False
found_dates = []
for date in list_of_dates:
if is_date(date):
found_dates.append(parse(date))
for date in found_dates:
print(date)
Result:
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-20 00:00:00
2009-03-21 00:00:00
2009-03-22 00:00:00
2009-02-04 00:00:00
2009-09-04 00:00:00
2010-10-04 00:00:00