This specific str.replace() in Python with BeautifulSoup isn't working - python

I'm trying to automate a task that occurs roughly monthly, which is adding a hyperlink to a page that looks like:
2013: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2012: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011: Jan Feb Mar ...
Whenever we get a new document for that month, we add the
Jul
tags around it.
So I'm using BeautifulSoup in Python. You can see below that I'm picking out the HTML "p" tag that contains this data and doing a replace() on the first month that it finds (finds Month using the reverse dictionary I created, and the third parameter of replace() indicates to only do the first one it finds).
# Modify link in hr.php:
hrphp = open('\\\\intranet\\websites\\infonet\\hr\\hr.php', 'r').read()
soup = BeautifulSoup(hrphp) # Parsing with BeautifulSoup
Months = {k: v for k,v in enumerate(calendar.month_abbr)} # Creates a reverse dictionary for month abbreviation lookup by month number, ie. "print Months[07]" will print "Jul"
print hrphp+"\n\n\n\n\n" # DEBUGGING: Compare output before
hrphp = hrphp.replace(
str(soup.findAll('p')[4]),
str(soup.findAll('p')[4]).replace(
Months[int(InterlinkDate[1][-5:-3])],
""+Months[int(InterlinkDate[1][-5:-3])]+"",
1),
1
)
print hrphp # DEBUGGING: Compare output after
See how it's a nested replace()? The logic seems to work out fine, but for some reason it doesn't actually change the value. Earlier in the script I do something similar with the Months[] dictionary and str.replace() on a segment of the page, and that works out, although it doesn't have a nested replace() like this nor does it search for a block of text using soup.findAll().
Starting to bang my head around on the desk, any help would be greatly appreciated. Thanks in advance.

What you end up doing with the code str(soup.findAll('p')[4]).replace is just replacing the values that are found in a string representation of the results in soup.findAll('p')[4], which will more than likely differ from the string in hrphp because "Beautiful Soup gives you Unicode" after it parses.
Beautiful Soups documentation holds the answer. Have a look at the Changing Attribute Values section.

Related

Python Feedparser pubdate to one timezone

I need to parse RSS feed. I am using feedparser on python. Basically my task is to run script every N seconds and check for last feed. I come up with idea to check is date is younger than 15 seconds each iteration. But there is a problem pubDate has different timezone.
'Published_parsed' I think working not correct because it gives me these:
2020-06-17 05:46:45
-
Wed, 17 Jun 2020 04:46:45 GMT
and this
2020-06-17 11:19:39
-
Wed, 17 Jun 2020 10:19:39 IST
Thus it's not parsed to one timezone. I tried to check it to each timezone using pytz, but there is no IST timezone, what is not good to me.
How can I parse this varioty of dates to one timezone time.
Wed, 17 Jun 2020 13:12:43 IST
Tue, 16 Jun 2020 21:49:32 GMT

Extract timestamp from a given string using python

I tried multiple packages to extract timestamp from a given string, but no package gives correct results. I did use dateutils, datefinder, parsedatetime, etc. for this task. They extract some datetimes which are in certain formats but not all formats, sometimes they extract some unwanted numbers also as timestamps.
Is there any python package which extracts datetime from a given string.
Assume, I have 2 strings like these:
scala> val xorder= new order(1,"2016-02-22 00:00:00.00",100,"COMPLETED")
and
Fri, 10 Jun 2011 11:04:17 +0200 (CEST)
and want to extract only datetime. Is there any function which extracts both formats of datetimes from above strings. In other cases formats may be different, still it should pick out datetime strings
You can use the datetime function strptime() as follows
dt = datetime.strptime("21/11/06 16:30", "%d/%m/%y %H:%M")
You can create your own formatting and use the function as well.
I created a small python package datetime_extractor to pull out timestamps from a given strings. It can extract many datetime formats from given strings. Hope it will be useful.
pip install datetime-extractor
from datetime_extractor import DateTimeExtractor
import re
samplestring1 = 'scala> val xorder= new order(1,"2016-02-22 00:00:00.00",100,"COMPLETED")'
DateTimeExtractor(samplestring1)
Out: ['2016-02-22 00:00:00.00']
samplestring2 = 'Fri, 10 Jun 2011 11:04:17 +0200 (CEST)'
DateTimeExtractor(samplestring2)
Out: ['10 Jun 2011 11:04:17']
#Allan & #Manmeet Singh, Let me know your comments.

Using python to substitute awk for Linux commands

I am new to Python and I need to learn it for work purposes. I am having trouble figuring out a way to use Python to replace awk for column prints.
For example, I need to print out the date:
root#user:~# date
Mon Jun 24 01:30:08 EDT 2013
But, I only need a certain part of it:
root#user:~# date | awk '{print $2" "$3" "$4" "$5}'
Jun 24 01:30:54 EDT
Is there a way in Python to do this without needing to do the following:
import os
os.system("date | awk '{print $2" "$3" "$4" "$5}'")
I have tried to do an extensive Google/Bing/Ask/Yahoo search and have seemed to have come up short on this.
You probably want to look at the datetime.datetime.strftime() function for that particular task.
However, for the more general task of printing out certain fields, you'd use .split() and list slicing:
date_string = "Mon Jun 24 01:30:08 EDT 2013"
fields = date_string.split()
print ' '.join(fields[1:5]) # Prints "June 24 01:30:08 EDT"

Browsers cookie problem

Well,
Opera and Chrome add 2 hours to expiration where i only want 15 minutes to be added. Actually they are both successful at that 15 minutes part but because of some reasons i didn't understand yet, they also add another 2 hours to date.
Here is response header:
Content-Type:text/html
Date:Thu, 28 Apr 2011 15:59:27 GMT
Server:lighttpd/1.4.28
Set-Cookie:SID=2554373e-9144-34af-b9ad-a67b2ccdc8cd; expires=Thu, 28 Apr 2011 16:14:27 GMT; Path=/
Thu, 28 Apr 2011 16:14:27 GMT
Transfer-Encoding:chunked
this is also fine. Exact date that i want. But when i check from browsers cookie list, I see expires=Thu, 28 Apr 2011 18:14:27 GMT.
What can cause that?
Thanks
Edit: Info:
To create cookie I use python. They all depend on server time which is same for all.
And all browsers are tested in same environment.
Edit Code Sample:
def createCookie(self):
expiration = datetime.datetime.now() + datetime.timedelta(hours=0,minutes=15)
self.cookie['SID'] = self.SID
self.cookie['SID']['path'] = "/"
self.cookie['SID']['Expires'] = expiration.strftime("%a, %d %b %Y %H:%M:%S GMT")
As you are not posting the related code to your question it is impossible to say what is causing the issue.
But my nose tells me you are probably mixing timezones in your time delta code.
Here is some info when dealing with timezone aware time and datetime objects in Python:
http://blog.mfabrik.com/2008/06/30/relativity-of-time-shortcomings-in-python-datetime-and-workaround/

Select Distinct Years and Months for Django Archive Page

I want to make an archive_index page for my django site. However, the date-based generic views really aren't any help. I want the dictionary returned by the view to have all the years and months for which at least one instance of the object type exists. So if my blog started in September 2007, but there were no posts in April 2008, I could get something like this
2009 - Jan, Feb, Mar
2008 - Jan, Feb, Mar, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
2007 - Sep, Oct, Nov, Dec
This will give you a list of unique posting dates:
Posts.objects.filter(draft=False).dates('post_date','month',order='DESC')
Of course you might not need the draft filter, and change 'post_date' to your field name, etc.
I found the answer to my own question.
It's on this page in the documentation.
There's a function called dates that will give you distinct dates. So I can do
Entry.objects.dates('pub_date','month') to get a list of datetime objects, one for each year/month.
You should be able to get all the info you describe from the built-in views. Can you be more specific as to what you cannot get? This should have everything you need:
django.views.generic.date_based.archive_month
Reference page (search for the above string on that page)

Categories