Using regex separators with read_csv() in python? - python

I have a lot of csv files formated as such:
date1::tweet1::location1::language1
date2::tweet2::location2::language2
date3::tweet3::location3::language3
and so on. Some files contain up to 200 000 tweets. I want to extract 4 fields and put them in a pandas dataframe, as well as count the number of tweets. Here's the code I'm using for now:
try:
data = pd.read_csv(tweets_data_path, sep="::", header = None, engine='python')
data.columns = ["timestamp", "tweet", "location", "lang"]
print 'Number of tweets: ' + str(len(data))
except BaseException, e :
print 'Error: ',str(e)
I get the following error thrown at me
Error: expected 4 fields in line 4581, saw 5
I tried setting error_bad_lines = False, manually deleting the lines that make the program bug, setting nrows to a lower number.. and still get those "expected fields" errors for random lines. Say I delete the bottom half of the file, I will get the same error but for line 1787. Which doesn't make sense to me as it was processed correctly before. Visually inspecting the csv files doesn't reveal abornmal patterns that suddenly appear in the buggy line either.
The date fields and tweets contain colons, urls and so on so perhaps regex would make sense?
Can someone help me figure out what I'm doing wrong? Many thanks in advance!
Sample of the data as requested below:
Fri Apr 22 21:41:03 +0000 2016::RT #TalOfer: Barack Obama: Brexit would put UK back of the queue for trade talks [short url] #EuRef #StrongerIn::United Kingdom::en
Fri Apr 22 21:41:07 +0000 2016::RT #JamieRoss7: It must be awful to strongly believe in Brexit and be watching your campaigns make an absolute horse's arse of it.::The United Kingdom::en
Fri Apr 22 21:41:07 +0000 2016::Whether or not it rains on June 23rd will have more influence on the vote than Obama's lunch with the Queen and LiGA with George. #brexit.::Dublin, Ireland::en
Fri Apr 22 21:41:08 +0000 2016::FINANCIAL TIMES FRONT PAGE: 'Obama warns Brexit vote would send UK to 'back of trade queue' #skypapers [short url]::Mardan, Pakistan::en

Start with this:
pd.read_csv(tweets_data_path, sep="::", header = None, usecols = [0,1,2,3])
The above should bring in 4 columns, then you can figure out how many lines were dropped, and if the data makes sense.
Use this pattern:
data["lang"].unique()
Since, you have problem with data and do not where it is. You need to step back and use python 'csv reader'. This should get you started.
import csv
reader = csv.reader(tweets_data_path)
tweetList = []
for row in reader:
try:
tweetList.append( (row[0].split('::')) )
except BaseException, e :
print 'Error: ',str(e)
print tweetList
tweetsDf = pd.DataFrame(tweetList)
print tweetsDf
0 \
0 Fri Apr 22 21:41:03 +0000 2016
1 Fri Apr 22 21:41:07 +0000 2016
2 Fri Apr 22 21:41:07 +0000 2016
3 Fri Apr 22 21:41:08 +0000 2016
1 2 3
0 RT #TalOfer: Barack Obama: Brexit would put UK... United Kingdom en
1 RT #JamieRoss7: It must be awful to strongly b... The United Kingdom en
2 Whether or not it rains on June 23rd will hav... Dublin None
3 FINANCIAL TIMES FRONT PAGE: 'Obama warns Brexi... Mardan None

Have you tried read_table instead? I've got this kind of error when I tried to use read_csv before and I solved the problem by using it. Please refer to this post, this might give you some ideas about how to solve the error. And maybe also try sep=r":{2}" as delimiter.

Related

Python/Pandas/NLTK: Iterating through a DataFrame, get value, transform it and add the new value to a new column

I scraped some data from google news into a dataframe:
DataFrame:
df
title link pubDate description source source_url
0 Australian research finds cost-effective way t... https://news.google.com/__i/rss/rd/articles/CB... Sat, 15 Oct 2022 23:51:00 GMT Australian research finds cost-effective way t... The Guardian https://www.theguardian.com
1 Something New Under the Sun: Floating Solar Pa... https://news.google.com/__i/rss/rd/articles/CB... Tue, 18 Oct 2022 11:49:11 GMT Something New Under the Sun: Floating Solar Pa... Voice of America - VOA News https://www.voanews.com
2 Adapt solar panels for sub-Saharan Africa - Na... https://news.google.com/__i/rss/rd/articles/CB... Tue, 18 Oct 2022 09:06:41 GMT Adapt solar panels for sub-Saharan AfricaNatur... Nature.com https://www.nature.com
3 Cost of living: The people using solar panels ... https://news.google.com/__i/rss/rd/articles/CB... Wed, 05 Oct 2022 07:00:00 GMT Cost of living: The people using solar panels ... BBC https://www.bbc.co.uk
4 Business Matters: Solar Panels on Commercial P... https://news.google.com/__i/rss/rd/articles/CB... Mon, 17 Oct 2022 09:13:35 GMT Business Matters: Solar Panels on Commercial P... Insider Media https://www.insidermedia.com
... ... ... ... ... ... ...
What I want to do now is basically to iterate through the "link" column and summarize every article with NLTK and add the summary to a new column. Here is an example:
article = Article(df.iloc[4, 1]) #get the url from the link column
article.download()
article.parse()
article.nlp()
article = article.summary
print(article)
Output:
North WestGemma Cornwall, Head of Sustainability of Anderton Gables, looks into the benefit of solar panels.
And, with the cost of solar panels continually dropping, it is becoming increasingly affordable for commercial property owners.
Reduce your energy spendMost people are familiar with solar energy, but many are unaware of the significant financial savings that can be gained by installing solar panels in commercial buildings.
As with all things, there are pros and cons to weigh up when considering solar panels.
If you’re considering solar panels for your property, contact one of the Anderton Gables team, who can advise you on the best course of action.
I tried a little bit, but I couldn't make it work...
Thanks for your help!
This will be a very slow solution with a for loop, but it might work for a small dataset. Iterating through all the links and then applying the transformations needed, and ultimately create a new column in the dataframe
summaries = []
for l in df['source_url'].values:
article = Article(l)
article.download()
article.parse()
article.nlp()
summaries.append(article.summary)
df['summaries'] = summaries
Or you could define a custom function and the use pd.apply:
def get_description(x):
art = Article(x)
art.download()
art.parse()
art.nlp()
return art.summary
df['summary'] = df['source_url'].apply(get_description)

Is there a Python function where I can get the Series title group together in 1 record?

Is there a Python function where I can get the Series title group together in 1 record? I would like to remove the additional extensions from the series name.
Title
1.Evening Edition 16 March 2022 (Part 6)
2.Evening Edition 17/01/2022
3.Evening Edition 30 Nov 2021 (Part 1)
4.Winter Olympic Games 2022: Daily Highlights Day 13 Part 2
5.Winter Olympic Games 2022: Daily Highlights Day 15 Part 2
The result that I'm looking for is like this:
Title
1.Evening Edition
2.Winter Olympic Games 2022
I think that what you are looking for is to have the longest possible common substring.
In order to do that you can do the following:
from difflib import SequenceMatcher
title1= "something Evening Edition something else"
title2 = "Evening Edition 30 Nov 2021 (Part 1)"
clean_title = SequenceMatcher(None, title1, title2).find_longest_match()
This will give you 'Evening Edition', for this case but it will give you 'Winter Olympic Games 2022: Daily Highlights' for the second case. I am not sure if this will work for you, but without additional information about the data it is very trick to do something else.
Maybe you will find something better for your use case here, https://docs.python.org/3/library/difflib.html#.

Convert text string to dataframe - comma separated

I am trying to parse the response of an API. Takes a batch of phone numbers, and returns information on their status i.e. active or not.
This is what the response looks like:
# API call
s.get('https://api/data/stuff')
# response
',MSISDN,Status,Error Code,Error Text,Original Network,Current Network,Current Country,Roaming
Country,Type,Date Checked\n447541255456,447541255456,Undelivered,27,Absent Subscriber,O2
(UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n447856999555,447856999555,Undelivered,1,Dead,O2
(UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000
(UTC)\n447854111222,447854111222,Undelivered,1,Dead,Orange,,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000
(UTC)\n'
I can see that MSISDN,Status,Error Code,Error Text,Original Network,Current Network,Current Country,Roaming
Country,Type,Date Checked are headers, and the rest are the rows.
But I can't get this into a structure I can read easily, such as a dataframe.
There were some suggested answers while typing this question, which use import io and pd.read_table etc. but I couldn't get any of them to work.
I guess I could save it as a txt file then read it back in as a comma separated csv. But is there a native pandas or other easier way to do this?
Here's the response string pasted directly into stack overflow with no tidying:
',MSISDN,Status,Error Code,Error Text,Original Network,Current Network,Current Country,Roaming Country,Type,Date Checked\n447541255456,447541255456,Undelivered,27,Absent Subscriber,O2 (UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n447856999555,447856999555,Undelivered,1,Dead,O2 (UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n447854111222,447854111222,Undelivered,1,Dead,Orange,,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n'
I believe you need:
from io import StringIO
df = pd.read_csv(StringIO(s.get('https://api/data/stuff')))
Or try:
df = pd.read_csv('https://api/data/stuff')

how to generate multiple txt file based on months/year?

I have a large txt file (log file), where each entry starts with timestamp such as Sun, 17 Mar 2013 18:58:06
I want to split the file into multiple txt by mm/yy and and sorted
The general code I planned is below, but I do not know how to implement such. I know how to split a file by number of lines etc, but not by specified timestamp
import re
f = open("log.txt", "r")
my_regex = re.compile('regex goes here')
body = []
for line in f:
if my_regex.match(line):
if body:
write_one(body)
body = []
body.append(line)
f.close()
example of lines from txt
2Sun, 17 Mar 2013 18:58:06 Pro IDS2.0 10E22E37-B2A1-4D55-BE20-84661D420196 nCWgKUtjalmYx053ykGeobwgWW V3
3Sun, 17 Mar 2013 19:17:33 <AwaitingDHKey c i FPdk 1:0 pt 0 Mrse> 0000000000000000000000000000000000000000 wo>
HomeKit keychain state:HomeKit: mdat=2017-01-01 01:41:47 +0000,cdat=2017-01-01 01:41:47 +0000,acct=HEDF3,class=genp,svce=AirPort,labl=HEDF3
4Sun, 13 Apr 2014 19:10:26 values in decoded form...
oak: <C: gen:'[ 21:10 5]' ak>
<PI#0x7fc01dc05d90: [name: Bourbon] [--SrbK-] [spid: zP8H/Rpy] [os: 15G31] [devid: 49645DA6] [serial: C17J9LGKDTY3] -
5Sun, 16 Feb 2014 18:59:41 tLastKVSKeyCleanup:
ak|nCWgKUtjalmYx053ykGeobwgWW:sk1Kv+37Clci7VwR2IGa+DNVEA: DHMessage (0x02): 112
You could use regex (such as [0-9]{4} ([01]\d|2[0123]):([012345]\d):([012345]\d) ) but in the example posted the date is always in the beginning of the string. If that is the case, you could just use the position of the string to parse the date.
import datetime
lines =[]
lines.append("2Sun, 17 Mar 2013 18:58:06 Pro IDS2.0 10E22E37-B2A1-4D55-BE20-84661D420196 nCWgKUtjalmYx053ykGeobwgWW V3")
lines.append("3Sun, 17 Mar 2013 19:17:33 <AwaitingDHKey c i FPdk 1:0 pt 0 Mrse> 0000000000000000000000000000000000000000 wo> HomeKit keychain state:HomeKit: mdat=2017-01-01 01:41:47 +0000,cdat=2017-01-01 01:41:47 +0000,acct=HEDF3,class=genp,svce=AirPort,labl=HEDF3")
lines.append("4Sun, 13 Apr 2014 19:10:26 values in decoded form... oak: <C: gen:'[ 21:10 5]' ak> <PI#0x7fc01dc05d90: [name: Bourbon] [--SrbK-] [spid: zP8H/Rpy] [os: 15G31] [devid: 49645DA6] [serial: C17J9LGKDTY3] -")
for l in lines:
datetime_object = datetime.datetime.strptime(l[6:26], '%d %b %Y %H:%M:%S')
print(datetime_object)
Which gives the correct output for the three examples you provided
2013-03-17 18:58:06
2013-03-17 19:17:33
2014-04-13 19:10:26
The datetime object has attributed such as month() and year() so you can use a simple equality to check whether two dates are in the same month and/or year.

How to filter the output of pexpect

In python pexpect, I want to filter the oupt. For example, in the below code I want only the date to be printed.
#!/usr/bin/env python
import pexpect,time
p=pexpect.spawn('ssh myusername#192.168.151.80')
p.expect('Password:')
p.sendline('mypassword')
time.sleep(2)
p.sendline('date')
p.expect('IST')
current_date = p.before
print 'the current date in remote server is: %s' % current_date
Actual output:
the current date in remote server is:
Last login: Thu Aug 23 22:58:02 2012 from solaris3
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
You have new mail.
welcome
-bash-3.00$ date
Thu Aug 23 23:03:10
Expected output:
the current date in remote server is: Thu Aug 23 23:03:10
before will give you everything since the previous expect call.
You could split the output on newline:
current_date = p.before.split('\n')[-1]
However it would be better to expect the prompt instead of sleeping 2 seconds:
p.sendline('mypassword')
p.expect('[#\$] ')
p.sendline('date')

Categories