How can I splice out the date from similar strings? - python

I have a bunch of dates from some web scraping, but it seems that a country is also in the date string. Here is a sample:
Nov. 4, 2015Bangladesh
April 8, 2015Saudi Arabia
Jan. 14, 2016Indonesia
June 26, 2015Tunisia
Jan. 11, 2016France
I know regex is really great for working with strings, but I am just not experienced enough to know how to start.
How can I remove the country while keeping the dates intact?

This regex will get you just the date string from all of those. This could probably also be fixed by showing us your code for scraping the dates, but that's not what this question is about.
^.+?\s\d+,\s\d+
Example:
import re
dates = ["Nov. 4, 2015Bangladesh",
"April 8, 2015Saudi Arabia ",
"Jan. 14, 2016Indonesia ",
"June 26, 2015Tunisia ",
"Jan. 11, 2016France "]
for item in dates:
print(re.match(r"^.+?\s\d+,\s\d+", item).group(0))
This prints:
Nov. 4, 2015
April 8, 2015
Jan. 14, 2016
June 26, 2015
Jan. 11, 2016
Explanation
^ -assert position at start of string
.+? -match any char except newline (as few as possible)
\s -match a space character
\d+ -match any number of digits
, -match literal comma
\s -match a space character
\d+ -match any number of digits

You could try following:
^(.*\d{4})
Check the demo here:
import re
dates = """Nov. 4, 2015Bangladesh
April 8, 2015Saudi Arabia
Jan. 14, 2016Indonesia
June 26, 2015Tunisia
Jan. 11, 2016France"""
print re.findall(r'^(.*\d{4})', dates, re.M)
# ['Nov. 4, 2015', 'April 8, 2015', 'Jan. 14, 2016', 'June 26, 2015', 'Jan. 11, 2016']

Related

How to handle spelled out days in strptime

I have an array of dates that is formatted like so:
['October 22nd, 2019', 'February 8th, 2020', 'July 31st, 2020', 'September 21st, 2020', ...]
I'd like to turn it into datetime objects using strptime, but I can't figure out how to hand the spelled out parts of the days, e.g. 22nd or 8th and it doesn't say in the format documentation.
The following works when there's no written out part of the day:
from datetime import datetime
dt_obj = datetime.striptime('October 22, 2019', '&B &d, &Y')
But I can't figure out how to parse a string that has the day written out:
in: dt_obj = datetime.striptime('October 22nd, 2019', '&B &d, &Y')
out: ValueError: time data 'October 22nd, 2019' does not match format '%B %d, %Y'
What's the proper format for this? Thank you!
e.g. 22nd or 8th and it doesn't say in the format
documentation
You got it right, it is not mentioned in documentation because there is no such formats, one way you can parse them is by using regex, and converting those date strings to something for which Python's datetime has the format for.
import re
from datetime import datetime
[datetime.strptime(x, '%B %d %Y') for x in [' '.join(re.findall('^\w+|\d+',each)) for each in ['October 22nd, 2019', 'February 8th, 2020', 'July 31st, 2020', 'September 21st, 2020']]]
#output:
[datetime.datetime(2019, 10, 22, 0, 0), datetime.datetime(2020, 2, 8, 0, 0), datetime.datetime(2020, 7, 31, 0, 0), datetime.datetime(2020, 9, 21, 0, 0)]

How do I sum consecutive items of a list when i'm using another parameter to decide it

I'm doing a problem right now and I'm kinda stuck. I have 2 lists, "sells" and "date". These sells are about several products, and I want to sum all the sells that are from the same month.
Let's say I have
sells = [25, 30, 1, 5, 15, 12]
date = [July 18, July 18, August 18, September 18, September 18, September 18]
Right now I'm trying to solve it like this:
last = None
sell = []
for s, d in zip(sells, date):
if d == last
sell.append(sum(s)
I'm kinda following the explanation i read over here: Check if next value is equal o current value in python loop? but I get no output at all.
What am I doing wrong?
You can use itertools.groupby with zip:
from itertools import groupby
sells = [25, 30, 1, 5, 15, 12]
date = ['July 18', 'July 18', 'August 18', 'September 18', 'September 18', 'September 18']
new_results = groupby(sorted(zip(date, sells), key=lambda x:x[0]), key=lambda x:x[0])
final_data = {a:sum(b for _, b in c) for a, c in new_results}
Output:
{'August 18': 1, 'July 18': 55, 'September 18': 32}

Representing date format for three letter month in Python

How do I represent a 3 letter month date format in python such as the following:
Jan 16, 2012
I know for January 16, 2012 the format is %B %d,%Y. Any ideas?
There's the three letter month format %b:
In [37]: datetime.strptime('Jan 16, 2012', '%b %d, %Y')
Out[37]: datetime.datetime(2012, 1, 16, 0, 0)
date_str = 'Jan 16, 2012'
date_components = date_str.split(' ')
date_components[0] = date_components[0][:3]
return ' '.join(date_components)

change datetime object to string

I want to convert datetime.datetime(2016, 11, 21, 5, 34, 38, 826339, tzinfo=<UTC>) as Nov. 21, 2016, 11:04 a.m.
The time in the datetime object is in UST but I want it to be converted into IST(UST+ 05:30).
I tried using strftime as:
>>> datetime(2016, 11, 21, 5, 34, 38, 826339, tzinfo=<UTC>).isoformat(' ')
File "<stdin>", line 1
datetime(2016, 11, 21, 5, 34, 38, 826339, tzinfo=<UTC>).isoformat(' ')
^
SyntaxError: invalid syntax
Can I get some help here.
PS: I am using python
EDIT:
cr_date = datetime(2016, 11, 21, 5, 34, 38, 826339) #excluding the timezone
I can get partial desired reults by:
cr_date.strftime('%b. %d, %Y %H:%M')
'Oct. 31, 2013 18:23
didn't get the am/pm though
For the am/pm part, couldn't you use %p for the last field? I don't know python, I'm just assuming python is taking syntax from the unix date command.
Please find below code
from datetime import datetime
cr_date = datetime(2016, 11, 21, 5, 34, 38, 826339)
cr_date.strftime('%b. %d, %Y %H:%M %P')
'Nov. 21, 2016 05:34 am'
add %p for 'AM or PM' else add %P for 'am or pm'

Need to grab a 3x3 neighbourhood of an input cell from a 2d numpy array

I am trying to define a function that will return the 3x3 neighbourhood of an input cell. Right now I have:
def queen_neighbourhood(in_forest, in_row, in_col):
neighbourhood = in_forest[in_row-1:in_row+1, in_col-1:in_col+1]
return neighbourhood
(in_forest is the input array).
When I run this, it only seems to return a 2x2 matrix, instead of a 3x3. Why is this? It seems to me that I am inputting a row and column reference, and then slicing out a square that starts one row behind the input row, and ends one row ahead of it, and then the same for columns.
So for example, given an input array as such:
[ 01, 02, 03, 04, 05
06, 07, 08, 09, 10
11, 12, 13, 14, 15
16, 17, 18, 19, 20
21, 22, 23, 24, 25 ]
And then using row 2, col 3, I want to return a matrix as such:
[ 02, 03, 04
07, 08, 09
12, 13, 14 ]
When you say in_forest[in_row-1:in_row+1, in_col-1:in_col+2] you are saying "give me a square from in_row-1 inclusive to in_row+1 exclusive, and from in_col-1 inclusive to in_col+2 exclusive. It slices up to, but not including the second index.
Simply use in_row-1:in_row+2 and in_col-1:in_col+2 instead to slice including the "+1"s.

Categories