I scraped some data from google news into a dataframe:
DataFrame:
df
title link pubDate description source source_url
0 Australian research finds cost-effective way t... https://news.google.com/__i/rss/rd/articles/CB... Sat, 15 Oct 2022 23:51:00 GMT Australian research finds cost-effective way t... The Guardian https://www.theguardian.com
1 Something New Under the Sun: Floating Solar Pa... https://news.google.com/__i/rss/rd/articles/CB... Tue, 18 Oct 2022 11:49:11 GMT Something New Under the Sun: Floating Solar Pa... Voice of America - VOA News https://www.voanews.com
2 Adapt solar panels for sub-Saharan Africa - Na... https://news.google.com/__i/rss/rd/articles/CB... Tue, 18 Oct 2022 09:06:41 GMT Adapt solar panels for sub-Saharan AfricaNatur... Nature.com https://www.nature.com
3 Cost of living: The people using solar panels ... https://news.google.com/__i/rss/rd/articles/CB... Wed, 05 Oct 2022 07:00:00 GMT Cost of living: The people using solar panels ... BBC https://www.bbc.co.uk
4 Business Matters: Solar Panels on Commercial P... https://news.google.com/__i/rss/rd/articles/CB... Mon, 17 Oct 2022 09:13:35 GMT Business Matters: Solar Panels on Commercial P... Insider Media https://www.insidermedia.com
... ... ... ... ... ... ...
What I want to do now is basically to iterate through the "link" column and summarize every article with NLTK and add the summary to a new column. Here is an example:
article = Article(df.iloc[4, 1]) #get the url from the link column
article.download()
article.parse()
article.nlp()
article = article.summary
print(article)
Output:
North WestGemma Cornwall, Head of Sustainability of Anderton Gables, looks into the benefit of solar panels.
And, with the cost of solar panels continually dropping, it is becoming increasingly affordable for commercial property owners.
Reduce your energy spendMost people are familiar with solar energy, but many are unaware of the significant financial savings that can be gained by installing solar panels in commercial buildings.
As with all things, there are pros and cons to weigh up when considering solar panels.
If you’re considering solar panels for your property, contact one of the Anderton Gables team, who can advise you on the best course of action.
I tried a little bit, but I couldn't make it work...
Thanks for your help!
This will be a very slow solution with a for loop, but it might work for a small dataset. Iterating through all the links and then applying the transformations needed, and ultimately create a new column in the dataframe
summaries = []
for l in df['source_url'].values:
article = Article(l)
article.download()
article.parse()
article.nlp()
summaries.append(article.summary)
df['summaries'] = summaries
Or you could define a custom function and the use pd.apply:
def get_description(x):
art = Article(x)
art.download()
art.parse()
art.nlp()
return art.summary
df['summary'] = df['source_url'].apply(get_description)
Is there a Python function where I can get the Series title group together in 1 record? I would like to remove the additional extensions from the series name.
Title
1.Evening Edition 16 March 2022 (Part 6)
2.Evening Edition 17/01/2022
3.Evening Edition 30 Nov 2021 (Part 1)
4.Winter Olympic Games 2022: Daily Highlights Day 13 Part 2
5.Winter Olympic Games 2022: Daily Highlights Day 15 Part 2
The result that I'm looking for is like this:
Title
1.Evening Edition
2.Winter Olympic Games 2022
I think that what you are looking for is to have the longest possible common substring.
In order to do that you can do the following:
from difflib import SequenceMatcher
title1= "something Evening Edition something else"
title2 = "Evening Edition 30 Nov 2021 (Part 1)"
clean_title = SequenceMatcher(None, title1, title2).find_longest_match()
This will give you 'Evening Edition', for this case but it will give you 'Winter Olympic Games 2022: Daily Highlights' for the second case. I am not sure if this will work for you, but without additional information about the data it is very trick to do something else.
Maybe you will find something better for your use case here, https://docs.python.org/3/library/difflib.html#.
I am trying to parse the response of an API. Takes a batch of phone numbers, and returns information on their status i.e. active or not.
This is what the response looks like:
# API call
s.get('https://api/data/stuff')
# response
',MSISDN,Status,Error Code,Error Text,Original Network,Current Network,Current Country,Roaming
Country,Type,Date Checked\n447541255456,447541255456,Undelivered,27,Absent Subscriber,O2
(UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n447856999555,447856999555,Undelivered,1,Dead,O2
(UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000
(UTC)\n447854111222,447854111222,Undelivered,1,Dead,Orange,,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000
(UTC)\n'
I can see that MSISDN,Status,Error Code,Error Text,Original Network,Current Network,Current Country,Roaming
Country,Type,Date Checked are headers, and the rest are the rows.
But I can't get this into a structure I can read easily, such as a dataframe.
There were some suggested answers while typing this question, which use import io and pd.read_table etc. but I couldn't get any of them to work.
I guess I could save it as a txt file then read it back in as a comma separated csv. But is there a native pandas or other easier way to do this?
Here's the response string pasted directly into stack overflow with no tidying:
',MSISDN,Status,Error Code,Error Text,Original Network,Current Network,Current Country,Roaming Country,Type,Date Checked\n447541255456,447541255456,Undelivered,27,Absent Subscriber,O2 (UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n447856999555,447856999555,Undelivered,1,Dead,O2 (UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n447854111222,447854111222,Undelivered,1,Dead,Orange,,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n'
I believe you need:
from io import StringIO
df = pd.read_csv(StringIO(s.get('https://api/data/stuff')))
Or try:
df = pd.read_csv('https://api/data/stuff')
I have a large txt file (log file), where each entry starts with timestamp such as Sun, 17 Mar 2013 18:58:06
I want to split the file into multiple txt by mm/yy and and sorted
The general code I planned is below, but I do not know how to implement such. I know how to split a file by number of lines etc, but not by specified timestamp
import re
f = open("log.txt", "r")
my_regex = re.compile('regex goes here')
body = []
for line in f:
if my_regex.match(line):
if body:
write_one(body)
body = []
body.append(line)
f.close()
example of lines from txt
2Sun, 17 Mar 2013 18:58:06 Pro IDS2.0 10E22E37-B2A1-4D55-BE20-84661D420196 nCWgKUtjalmYx053ykGeobwgWW V3
3Sun, 17 Mar 2013 19:17:33 <AwaitingDHKey c i FPdk 1:0 pt 0 Mrse> 0000000000000000000000000000000000000000 wo>
HomeKit keychain state:HomeKit: mdat=2017-01-01 01:41:47 +0000,cdat=2017-01-01 01:41:47 +0000,acct=HEDF3,class=genp,svce=AirPort,labl=HEDF3
4Sun, 13 Apr 2014 19:10:26 values in decoded form...
oak: <C: gen:'[ 21:10 5]' ak>
<PI#0x7fc01dc05d90: [name: Bourbon] [--SrbK-] [spid: zP8H/Rpy] [os: 15G31] [devid: 49645DA6] [serial: C17J9LGKDTY3] -
5Sun, 16 Feb 2014 18:59:41 tLastKVSKeyCleanup:
ak|nCWgKUtjalmYx053ykGeobwgWW:sk1Kv+37Clci7VwR2IGa+DNVEA: DHMessage (0x02): 112
You could use regex (such as [0-9]{4} ([01]\d|2[0123]):([012345]\d):([012345]\d) ) but in the example posted the date is always in the beginning of the string. If that is the case, you could just use the position of the string to parse the date.
import datetime
lines =[]
lines.append("2Sun, 17 Mar 2013 18:58:06 Pro IDS2.0 10E22E37-B2A1-4D55-BE20-84661D420196 nCWgKUtjalmYx053ykGeobwgWW V3")
lines.append("3Sun, 17 Mar 2013 19:17:33 <AwaitingDHKey c i FPdk 1:0 pt 0 Mrse> 0000000000000000000000000000000000000000 wo> HomeKit keychain state:HomeKit: mdat=2017-01-01 01:41:47 +0000,cdat=2017-01-01 01:41:47 +0000,acct=HEDF3,class=genp,svce=AirPort,labl=HEDF3")
lines.append("4Sun, 13 Apr 2014 19:10:26 values in decoded form... oak: <C: gen:'[ 21:10 5]' ak> <PI#0x7fc01dc05d90: [name: Bourbon] [--SrbK-] [spid: zP8H/Rpy] [os: 15G31] [devid: 49645DA6] [serial: C17J9LGKDTY3] -")
for l in lines:
datetime_object = datetime.datetime.strptime(l[6:26], '%d %b %Y %H:%M:%S')
print(datetime_object)
Which gives the correct output for the three examples you provided
2013-03-17 18:58:06
2013-03-17 19:17:33
2014-04-13 19:10:26
The datetime object has attributed such as month() and year() so you can use a simple equality to check whether two dates are in the same month and/or year.
In python pexpect, I want to filter the oupt. For example, in the below code I want only the date to be printed.
#!/usr/bin/env python
import pexpect,time
p=pexpect.spawn('ssh myusername#192.168.151.80')
p.expect('Password:')
p.sendline('mypassword')
time.sleep(2)
p.sendline('date')
p.expect('IST')
current_date = p.before
print 'the current date in remote server is: %s' % current_date
Actual output:
the current date in remote server is:
Last login: Thu Aug 23 22:58:02 2012 from solaris3
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
You have new mail.
welcome
-bash-3.00$ date
Thu Aug 23 23:03:10
Expected output:
the current date in remote server is: Thu Aug 23 23:03:10
before will give you everything since the previous expect call.
You could split the output on newline:
current_date = p.before.split('\n')[-1]
However it would be better to expect the prompt instead of sleeping 2 seconds:
p.sendline('mypassword')
p.expect('[#\$] ')
p.sendline('date')