Convert text string to dataframe - comma separated - python

I am trying to parse the response of an API. Takes a batch of phone numbers, and returns information on their status i.e. active or not.
This is what the response looks like:
# API call
s.get('https://api/data/stuff')
# response
',MSISDN,Status,Error Code,Error Text,Original Network,Current Network,Current Country,Roaming
Country,Type,Date Checked\n447541255456,447541255456,Undelivered,27,Absent Subscriber,O2
(UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n447856999555,447856999555,Undelivered,1,Dead,O2
(UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000
(UTC)\n447854111222,447854111222,Undelivered,1,Dead,Orange,,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000
(UTC)\n'
I can see that MSISDN,Status,Error Code,Error Text,Original Network,Current Network,Current Country,Roaming
Country,Type,Date Checked are headers, and the rest are the rows.
But I can't get this into a structure I can read easily, such as a dataframe.
There were some suggested answers while typing this question, which use import io and pd.read_table etc. but I couldn't get any of them to work.
I guess I could save it as a txt file then read it back in as a comma separated csv. But is there a native pandas or other easier way to do this?
Here's the response string pasted directly into stack overflow with no tidying:
',MSISDN,Status,Error Code,Error Text,Original Network,Current Network,Current Country,Roaming Country,Type,Date Checked\n447541255456,447541255456,Undelivered,27,Absent Subscriber,O2 (UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n447856999555,447856999555,Undelivered,1,Dead,O2 (UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n447854111222,447854111222,Undelivered,1,Dead,Orange,,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n'

I believe you need:
from io import StringIO
df = pd.read_csv(StringIO(s.get('https://api/data/stuff')))
Or try:
df = pd.read_csv('https://api/data/stuff')

Related

How to convert different rss feed dates with Python so they can be ordered

Trying to make an RSS feed reader using django, feedparser and dateutil
Getting this error: can't compare offset-naive and offset-aware datetimes
I just have five feeds right now. These are the datetimes from the feeds..
Sat, 10 Sep 2022 23:08:59 -0400
Sun, 11 Sep 2022 04:08:30 +0000
Sun, 11 Sep 2022 13:12:18 +0000
2022-09-10T01:01:16+00:00
Sat, 17 Sep 2022 11:27:15 EDT
I was able to order the first four feeds and then I got the error when I added the last one.
## create a list of lists - each inner list holds entries from a feed
parsed_feeds = [feedparser.parse(url)['entries'] for url in feed_urls]
## put all entries in one list
parsed_feeds2 = [item for feed in parsed_feeds for item in feed]
## sort entries by date
parsed_feeds2.sort(key=lambda x: dateutil.parser.parse(x['published']), reverse=True)
How can I make all the datetimes from the feeds the same so they can be ordered?

Python dateparser fails when timezone is in middle

I'm trying to parse a date string using the following code:
from dateutil.parser import parse
datestring = 'Thu Jul 25 15:13:16 GMT+06:00 2019'
d = parse(datestring)
print (d)
The parsed date is:
datetime.datetime(2019, 7, 25, 15, 13, 16, tzinfo=tzoffset(None, -21600))
As you can see, instead of adding 6 hours to GMT, it actually subtracted 6 hours.
What's wrong I'm doing here? Any help on how can I parse datestring in this format?
There's a comment in the source: https://github.com/dateutil/dateutil/blob/cbcc0871792e7eed4a42cc62630a08ec7a78be30/dateutil/parser/_parser.py#L803.
# Check for something like GMT+3, or BRST+3. Notice
# that it doesn't mean "I am 3 hours after GMT", but
# "my time +3 is GMT". If found, we reverse the
# logic so that timezone parsing code will get it
# right.
Important parts
Notice that it doesn't mean "I am 3 hours after GMT", but "my time +3 is GMT"
If found, we reverse the logic so that timezone parsing code will get it right
Last sentence in that comment (and 2nd bullet point above) explains why 6 hours are subtracted. Hence, Thu Jul 25 15:13:16 GMT+06:00 2019 means Thu Jul 25 09:13:16 2019 GMT.
Take a look at http://www.timebie.com/tz/timediff.php?q1=Universal%20Time&q2=GMT%20+6%20Time for more context.
dateutil.parse converts every time into GMT. The input is being read as 15:13:16 in GMT+06:00 time. Naturally, it becomes 15:13:16-06:00 in GMT.

how to generate multiple txt file based on months/year?

I have a large txt file (log file), where each entry starts with timestamp such as Sun, 17 Mar 2013 18:58:06
I want to split the file into multiple txt by mm/yy and and sorted
The general code I planned is below, but I do not know how to implement such. I know how to split a file by number of lines etc, but not by specified timestamp
import re
f = open("log.txt", "r")
my_regex = re.compile('regex goes here')
body = []
for line in f:
if my_regex.match(line):
if body:
write_one(body)
body = []
body.append(line)
f.close()
example of lines from txt
2Sun, 17 Mar 2013 18:58:06 Pro IDS2.0 10E22E37-B2A1-4D55-BE20-84661D420196 nCWgKUtjalmYx053ykGeobwgWW V3
3Sun, 17 Mar 2013 19:17:33 <AwaitingDHKey c i FPdk 1:0 pt 0 Mrse> 0000000000000000000000000000000000000000 wo>
HomeKit keychain state:HomeKit: mdat=2017-01-01 01:41:47 +0000,cdat=2017-01-01 01:41:47 +0000,acct=HEDF3,class=genp,svce=AirPort,labl=HEDF3
4Sun, 13 Apr 2014 19:10:26 values in decoded form...
oak: <C: gen:'[ 21:10 5]' ak>
<PI#0x7fc01dc05d90: [name: Bourbon] [--SrbK-] [spid: zP8H/Rpy] [os: 15G31] [devid: 49645DA6] [serial: C17J9LGKDTY3] -
5Sun, 16 Feb 2014 18:59:41 tLastKVSKeyCleanup:
ak|nCWgKUtjalmYx053ykGeobwgWW:sk1Kv+37Clci7VwR2IGa+DNVEA: DHMessage (0x02): 112
You could use regex (such as [0-9]{4} ([01]\d|2[0123]):([012345]\d):([012345]\d) ) but in the example posted the date is always in the beginning of the string. If that is the case, you could just use the position of the string to parse the date.
import datetime
lines =[]
lines.append("2Sun, 17 Mar 2013 18:58:06 Pro IDS2.0 10E22E37-B2A1-4D55-BE20-84661D420196 nCWgKUtjalmYx053ykGeobwgWW V3")
lines.append("3Sun, 17 Mar 2013 19:17:33 <AwaitingDHKey c i FPdk 1:0 pt 0 Mrse> 0000000000000000000000000000000000000000 wo> HomeKit keychain state:HomeKit: mdat=2017-01-01 01:41:47 +0000,cdat=2017-01-01 01:41:47 +0000,acct=HEDF3,class=genp,svce=AirPort,labl=HEDF3")
lines.append("4Sun, 13 Apr 2014 19:10:26 values in decoded form... oak: <C: gen:'[ 21:10 5]' ak> <PI#0x7fc01dc05d90: [name: Bourbon] [--SrbK-] [spid: zP8H/Rpy] [os: 15G31] [devid: 49645DA6] [serial: C17J9LGKDTY3] -")
for l in lines:
datetime_object = datetime.datetime.strptime(l[6:26], '%d %b %Y %H:%M:%S')
print(datetime_object)
Which gives the correct output for the three examples you provided
2013-03-17 18:58:06
2013-03-17 19:17:33
2014-04-13 19:10:26
The datetime object has attributed such as month() and year() so you can use a simple equality to check whether two dates are in the same month and/or year.

Make timeline-chart out of .csv data

for item in list_of_dictionaries:
print(item['created_time']
I have a list of 70k dictionaries and each dict has the above key and the value is in this format:
Wed Sep 20 23:40:58 +0000 2017
What i need is to make a dictionarythat would tell me the number of entries for every hour in each day, such as:
for Tue:
[00:200,01:231,02 ... ]
Is there any way to convert that string ( Wed Sep 20 23:40:58 2017) into a datetime format so i could group them for every half an hour?

Using regex separators with read_csv() in python?

I have a lot of csv files formated as such:
date1::tweet1::location1::language1
date2::tweet2::location2::language2
date3::tweet3::location3::language3
and so on. Some files contain up to 200 000 tweets. I want to extract 4 fields and put them in a pandas dataframe, as well as count the number of tweets. Here's the code I'm using for now:
try:
data = pd.read_csv(tweets_data_path, sep="::", header = None, engine='python')
data.columns = ["timestamp", "tweet", "location", "lang"]
print 'Number of tweets: ' + str(len(data))
except BaseException, e :
print 'Error: ',str(e)
I get the following error thrown at me
Error: expected 4 fields in line 4581, saw 5
I tried setting error_bad_lines = False, manually deleting the lines that make the program bug, setting nrows to a lower number.. and still get those "expected fields" errors for random lines. Say I delete the bottom half of the file, I will get the same error but for line 1787. Which doesn't make sense to me as it was processed correctly before. Visually inspecting the csv files doesn't reveal abornmal patterns that suddenly appear in the buggy line either.
The date fields and tweets contain colons, urls and so on so perhaps regex would make sense?
Can someone help me figure out what I'm doing wrong? Many thanks in advance!
Sample of the data as requested below:
Fri Apr 22 21:41:03 +0000 2016::RT #TalOfer: Barack Obama: Brexit would put UK back of the queue for trade talks [short url] #EuRef #StrongerIn::United Kingdom::en
Fri Apr 22 21:41:07 +0000 2016::RT #JamieRoss7: It must be awful to strongly believe in Brexit and be watching your campaigns make an absolute horse's arse of it.::The United Kingdom::en
Fri Apr 22 21:41:07 +0000 2016::Whether or not it rains on June 23rd will have more influence on the vote than Obama's lunch with the Queen and LiGA with George. #brexit.::Dublin, Ireland::en
Fri Apr 22 21:41:08 +0000 2016::FINANCIAL TIMES FRONT PAGE: 'Obama warns Brexit vote would send UK to 'back of trade queue' #skypapers [short url]::Mardan, Pakistan::en
Start with this:
pd.read_csv(tweets_data_path, sep="::", header = None, usecols = [0,1,2,3])
The above should bring in 4 columns, then you can figure out how many lines were dropped, and if the data makes sense.
Use this pattern:
data["lang"].unique()
Since, you have problem with data and do not where it is. You need to step back and use python 'csv reader'. This should get you started.
import csv
reader = csv.reader(tweets_data_path)
tweetList = []
for row in reader:
try:
tweetList.append( (row[0].split('::')) )
except BaseException, e :
print 'Error: ',str(e)
print tweetList
tweetsDf = pd.DataFrame(tweetList)
print tweetsDf
0 \
0 Fri Apr 22 21:41:03 +0000 2016
1 Fri Apr 22 21:41:07 +0000 2016
2 Fri Apr 22 21:41:07 +0000 2016
3 Fri Apr 22 21:41:08 +0000 2016
1 2 3
0 RT #TalOfer: Barack Obama: Brexit would put UK... United Kingdom en
1 RT #JamieRoss7: It must be awful to strongly b... The United Kingdom en
2 Whether or not it rains on June 23rd will hav... Dublin None
3 FINANCIAL TIMES FRONT PAGE: 'Obama warns Brexi... Mardan None
Have you tried read_table instead? I've got this kind of error when I tried to use read_csv before and I solved the problem by using it. Please refer to this post, this might give you some ideas about how to solve the error. And maybe also try sep=r":{2}" as delimiter.

Categories