Splitting a list of twitter data - python

I have a file full of hundreds of un-separated tweets all formatted like so:
{"text": "Just posted a photo # Navarre Conference Center", "created_at": "Sun Nov 13 01:52:03 +0000 2016", "coordinates": [-86.8586, 30.40299]}
I am trying to split them up so I can assign each part to a variable.
The text
The timestamp
The location coordinates
I was able to split the tweets up using .split('{}') but I don't really know how to split the rest into the three things that I want.
My basic idea that didn't work:
file = open('tweets_with_time.json' , 'r')
line = file.readline()
for line in file:
line = line.split(',')
message = (line[0])
timestamp = (line[1])
position = (line[2])
#just to test if it's working
print(position)
Thanks!

I just downloaded your file, it's not as bad as you said. Each tweet is on a separate line. It would be nicer if the file was a JSON list, but we can still parse it fairly easily, line by line. Here's an example that extracts the 1st 10 tweets.
import json
fname = 'tweets_with_time.json'
with open(fname) as f:
for i, line in enumerate(f, 1):
# Convert this JSON line into a Python dict
data = json.loads(line)
# Extract the data
message = data['text']
timestamp = data['created_at']
position = data['coordinates']
# Print it
print(i)
print('Message:', message)
print('Timestamp:', timestamp)
print('Position:', position)
print()
#Only print the first 10 tweets
if i == 10:
break
Unfortunately, I can't show the output of this script: Stack Exchange won't allow me to put those shortened URLs into a post.
Here's a modified version that cuts off each message at the URL.
import json
fname = 'tweets_with_time.json'
with open(fname) as f:
for i, line in enumerate(f, 1):
# Convert this JSON line to a Python dict
data = json.loads(line)
# Extract the data
message = data['text']
timestamp = data['created_at']
position = data['coordinates']
# Remove the URL from the message
idx = message.find('https://')
if idx != -1:
message = message[:idx]
# Print it
print(i)
print('Message:', message)
print('Timestamp:', timestamp)
print('Position:', position)
print()
#Only print the first 10 tweets
if i == 10:
break
output
1
Message: Just posted a photo # Navarre Conference Center
Timestamp: Sun Nov 13 01:52:03 +0000 2016
Position: [-86.8586, 30.40299]
2
Message: I don't usually drink #coffee, but I do love a good #Vietnamese drip coffee with condense milk…
Timestamp: Sun Nov 13 01:52:04 +0000 2016
Position: [-123.04437109, 49.26211779]
3
Message: #bestcurryπŸ’₯πŸ‘£πŸ‘ŒπŸ½πŸ˜ŽπŸ€‘πŸ‘πŸ½πŸ‘πŸΌπŸ‘ŠπŸΌβ˜πŸ½πŸ™ŒπŸΌπŸ’ͺπŸΌπŸŒ΄πŸŒΊπŸŒžπŸŒŠπŸ·πŸ‰πŸπŸŠπŸΌπŸ„πŸ½πŸ‹πŸ½πŸŒβœˆοΈπŸ’ΈβœπŸ’―πŸ†’πŸ‡ΏπŸ‡¦πŸ‡ΊπŸ‡ΈπŸ™πŸΌ#johanvanaarde #kauai #rugby #surfing…
Timestamp: Sun Nov 13 01:52:04 +0000 2016
Position: [-159.4958861, 22.20321232]
4
Message: #thatonePerezwedding πŸ’πŸ’ # Scenic Springs
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-98.68685568, 29.62182898]
5
Message: Miami trends now: Heat, Wade, VeteransDay, OneLetterOffBands and TheyMightBeACatfishIf.
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-80.19240081, 25.78111669]
6
Message: Thank you family for supporting my efforts. I love you all!…
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-117.83012, 33.65558157]
7
Message: If you're looking for work in #HONOLULU, HI, check out this #job:
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-157.7973653, 21.2868901]
8
Message: Drinking a L'Brett d'Apricot by #CrookedStave # FOBAB β€”
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-87.6455, 41.8671]
9
Message: Can you recommend anyone for this #job? Barista (US) -
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-121.9766823, 38.350109]
10
Message: He makes me happy # Frank and Bank
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-75.69360487, 45.41268776]

It looks like well-formatted JSON data. Try the following:
import json
from pprint import pprint
file_ptr = open('tweets_with_time.json' , 'r')
data = json.load(file_ptr)
pprint(data)
It should parse your data into a nice Python dictionary. You can access the elements by their names like:
# Return the first 'coordinates' data point as a list of floats
data[0]["coordinates"]
# Return the 5th 'text' data point as a string
data[4]["text"]

Related

regex from txt file python

I'm building a task manager and I need to find tasks in a txt file assigned to a user who is logged in. I added the output from my txt file to a variable counts and then I use regex to find the needed txt .My problem comes in when I try and match the output with re.findall().
this is the text file I read from
admin, Register Users with taskManager.py, Use taskManager.py to add the usernames and passwords for all team members that will be using this program., 10 Oct 2019, 20 Oct 2019, No
admin, Assign initial tasks, Use taskManager.py to assign each team member with appropriate tasks, 10 Oct 2019, 25 Oct 2019, No
chris, build task manager, finish manager, 29 June 2022, 22 May 2025, No
tony, build new suit, builing a new iron man suit, 5 July 2022, 25 October 2023, no
this is my code
elif menu == "vm":
#find_tasks = input("Please enter your username: ")
with open(r"C:\Users\27711\Desktop\PROGRAMMING\Bootcamp lvl 1\task 20 Capstone\tasks.txt", 'r') as file3:
for line in file3:
iteration = line.split(", ")
counts = f'''\nAssigned to:\t {iteration[0]}
\nTask:\t {iteration[1]}
\nDate assigned:\t {iteration[3]}
\nDue Date:\t {iteration[4]}
\nTask Complete:\t {iteration[5]}
\nTask Description: \n{iteration[2]}\n'''
found = re.findall('Task:\s*(.*)\s*Assigned to:.\s*(admin)\s*Date assigned:\s*(.*)\s*Due Date:\s*(.*)\s*Task Complete:\s*(.*)\s*Task Description:\s*(.*)', counts)
for item in found:
print(f'''\nTask:\t {item[1]}
\nAssigned to:\t {item[0]}
\nDate assigned:\t {item[2]}
\nDue Date:\t {item[0]}
\nTask Complete:\t {item[0]}
\nTask Description: \n{item[0]}\n''')
So my code made the txt file contents into this
Task: Register Users with taskManager.py
Assigned to: admin
Date assigned: 10 Oct 2019
Due Date: 20 Oct 2019
Task Complete: No
Task Description:
Use taskManager.py to add the usernames and passwords for all team members that will be using this program.
Why wont my regex work on this I need to print only the tasks that is assigned to the user thats logged in

Take specifics date in log file and process it

I want to process a log file that contains events log, but only today logs.
The log file looks like this:
Aug 23 07:23:05 iZk1a211s8hkb4hkecu7w1Z sshd[19569]: Invalid user test from 10.148.0.13 port 48382
...
Sep 20 07:23:06 iZk1a211s8hkb4hkecu7w1Z sshd[19569]: Failed password for invalid user test from 10.148.0.13 port 48382 ssh2
...
Aug 23 07:23:07 iZk1a211s8hkb4hkecu7w1Z sshd[19564]: Failed password for invalid user sysadm from 10.148.0.13 port 48380 ssh2
...
Oct 15 07:23:09 iZk1a211s8hkb4hkecu7w1Z sshd[19573]: Invalid user sinusbot from 10.148.0.13 port 48384
...
Sep 08 07:23:11 iZk1a211s8hkb4hkecu7w1Z sshd[19573]: Failed password for invalid user sinusbot from 10.148.0.13 port 48384 ssh2
...
Nov 01 07:23:16 iZk1a211s8hkb4hkecu7w1Z sshd[19587]: Invalid user smkim from 10.148.0.13 port 48386
...
Nov 12 07:23:18 iZk1a211s8hkb4hkecu7w1Z sshd[19587]: Failed password for invalid user smkim from 10.148.0.13 port 48386 ssh2
How to grab the today line in the log?
I've tried this and got stuck in finding the patterns:
from datetime import date
today = date.today()
today = today.strftime("%B %d")
with open('file.log','r') as f:
for line in f:
date = line.find("*idk I'm stuck at this point*")
if date = today:
`*run my process script*`
Does anyone have any suggestions?
You need to extract the part of the string containing the date, parse it as datetime and convert it to a date:
from datetime import date
today: date = date.today()
with open('file.log','r') as f:
for line in f:
date: date = datetime.strptime(line[:15], "%b %d %H:%M:%S").date().replace(year=today.year)
if date == today:
`*run my process script*`

How to parse multiple line catalina log in python - regex

I have catalina log:
oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated
WARNING: HTTP Session created without LoggedInSessionBean
oct 21, 2016 3:03:20 AM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
javax.ws.rs.WebApplicationException
at ais.api.rest.rdss.Resource.lookAT(Resource.java:22)
at sun.reflect.GeneratedMethodAccessor3019.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
I try to parse it in python. My problem is that I dont know how many lines there are in log. Minimum are 2 lines. I try read from file and when first line start with j,m,s,o etc. it mean it is first line of log, because this are first letters of months. But I dont know how to continue. When I stop read the lines ? When next line will starts with one of these letters ? But how I do that?
import datetime
import re
SPACE = r'\s'
TIME = r'(?P<time>.*?M)'
PATH = r'(?P<path>.*?\S)'
METHOD = r'(?P<method>.*?\S)'
REQUEST = r'(?P<request>.*)'
TYPE = r'(?P<type>.*?\:)'
REGEX = TIME+SPACE+PATH+SPACE+METHOD+SPACE+TYPE+SPACE+REQUEST
def parser(log_line):
match = re.search(REGEX,log_line)
return ( (match.group('time'),
match.group('path'),
match.group('method'),
match.group('type'),
match.group('request')
)
)
db = MySQLdb.connect(host="localhost", user="myuser", passwd="mypsswd", db="Database")
with db:
cursor = db.cursor()
with open("Mylog.log","rw") as f:
for line in f:
if (line.startswith('j')) or (line.startswith('f')) or (line.startswith('m')) or (line.startswith('a')) or (line.startswith('s')) or (line.startswith('o')) or (line.startswith('n')) or (line.startswith('d')) :
logLine = line
result = parser(logLine)
sql = ("INSERT INTO ..... ")
data = (result[0])
cursor.execute(sql, data)
f.close()
db.close()
Best idea I have is read just two lines at a time. But that means discard all another data. There must be better way.
I want read lines like this:
1.line - oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated WARNING: HTTP Session created without LoggedInSessionBean
2.line - oct 21, 2016 3:03:20 AM com.sun.jersey.spi.container.ContainerResponse logException SEVERE: Mapped exception to response: 500 (Internal Server Error) javax.ws.rs.WebApplicationException at ais.api.rest.rdss.Resource.lookAT(Resource.java:22) at sun.reflect.GeneratedMethodAccessor3019.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl java:43)
3.line - oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated WARNING: HTTP Session created without LoggedInSessionBean
So I want start read when line starts with datetime (this is no problem). Problem is that I want stop read when next line starts with datetime.
This may be what you want.
I read lines from the log inside a generator so that I can determine whether they are datetime lines or other lines. Also, importantly, I can flag that end-of-file has been reached in the log file.
In the main loop of the program I start accumulating lines in a list when I get a datetime line. The first time I see a datetime line I print it out if it's not empty. Since the program will have accumulated a complete line when end-of-file occurs I arrange to print the accumulated line at that point too.
import re
a_date, other, EOF = 0,1,2
def One_line():
with open('caroline.txt') as caroline:
for line in caroline:
line = line.strip()
m = re.match(r'[a-z]{3}\s+[0-9]{1,2},\s+[0-9]{4}\s+[0-9]{1,2}:[0-9]{2}:[0-9]{2}\s+[AP]M', line, re.I)
if m:
yield a_date, line
else:
yield other, line
yield EOF, ''
complete_line = []
for kind, content in One_line():
if kind in [a_date, EOF]:
if complete_line:
print (' '.join(complete_line ))
complete_line = [content]
else:
complete_line.append(content)
Output:
oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated WARNING: HTTP Session created without LoggedInSessionBean
oct 21, 2016 3:03:20 AM com.sun.jersey.spi.container.ContainerResponse logException SEVERE: Mapped exception to response: 500 (Internal Server Error) javax.ws.rs.WebApplicationException at ais.api.rest.rdss.Resource.lookAT(Resource.java:22) at sun.reflect.GeneratedMethodAccessor3019.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Python Error: β€œValueError: need more than 1 value to unpack” in socket programming

In Python, when I run this code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import socket
s=socket.socket()
s.connect(('www.sina.com.cn',80))
s.send(b'GET /HTTP/1.1\r\nHost: www.sina.com.cn\r\nConnection: close\r\n\r\n')
buffer=[]
while True:
d=s.recv(1024)
if d:
buffer.append(d)
else:
break
data=b''.join(buffer)
s.close()
header,html = data.split(b'\r\n\r\n',1)
print(header.decode('utf-8'))
with open('sina_test.html','wb') as f:
f.write(html)
I get this error:
line 19, in (header,html,h) = data.split(b'\r\n\r\n',1)
ValueError: need more than 1 value to unpack
What does that error mean?
The second argument to split method limits how many items the method will return
header,html = data.split(b'\r\n\r\n',1)
Here you are trying to unpack more than 1 even though you specified that split should only return 1 item
There's a SPACE before the HTTP
# wrong
s.send(b'GET /HTTP/1.1\r\nHost: www.sina.com.cn\r\nConnection: close\r\n\r\n')
# correct
s.send(b'GET / HTTP/1.1\r\nHost: www.sina.com.cn\r\nConnection: close\r\n\r\n')
The wrong usage maybe send you this information:
print(data)
<HTML>\n<HEAD>\n<TITLE>Not Found on Accelerator</TITLE>\n</HEAD>\n\n<BODY BGCOLOR="white" FGCOLOR="black">\n<H1>Not Found on Accelerator</H1>\n<HR>\n\n<FONT FACE="Helvetica,Arial"><B>\nDescription: Your request on the specified host was not found.\nCheck the location and try again.\n</B></FONT>\n<HR>\n</BODY>\n
The right information is just like this:
HTTP/1.1 200 OK
Server: nginx
Date: Tue, 23 Jan 2018 03:28:38 GMT
Content-Type: text/html
Content-Length: 605221
Connection: close
Last-Modified: Tue, 23 Jan 2018 03:27:02 GMT
Vary: Accept-Encoding
Expires: Tue, 23 Jan 2018 03:29:37 GMT
Cache-Control: max-age=60
X-Powered-By: shci_v1.03
Age: 1
Via: http/1.1 ctc.jiangsu.ha2ts4.82 (ApacheTrafficServer/6.2.1 [cHs f ])
X-Cache: HIT.82
X-Via-CDN: f=edge,s=ctc.jiangsu.ha2ts4.83.nb.sinaedge.com,c=58.213.91.6;f=Edge,s=ctc.jiangsu.ha2ts4.82,c=61.155.142.83
X-Via-Edge: 1516678118627065bd53afa8e9b3d553f23b9
That's why ValueError occurs.
This error means that your string (data) does not contain the regex you are trying to split accordingly and therefore - data.split(b'\r\n\r\n',1) == data which cannot be assigned to header and html.

Parsing email headers with regular expressions in python

I'm a python beginner trying to extract data from email headers. I have thousands of email messages in a single text file, and from each message I want to extract the sender's address, recipient(s) address, and the date, and write it to a single, semicolon-delimitted line in a new file.
this is ugly, but it's what I've come up with:
import re
emails = open("demo_text.txt","r") #opens the file to analyze
results = open("results.txt","w") #creates new file for search results
resultsList = []
for line in emails:
if "From - " in line: #recgonizes the beginning of a email message and adds a linebreak
newMessage = re.findall(r'\w\w\w\s\w\w\w.*', line)
if newMessage:
resultsList.append("\n")
if "From: " in line:
address = re.findall(r'[\w.-]+#[\w.-]+', line)
if address:
resultsList.append(address)
resultsList.append(";")
if "To: " in line:
if "Delivered-To:" not in line: #avoids confusion with 'Delivered-To:' tag
address = re.findall(r'[\w.-]+#[\w.-]+', line)
if address:
for person in address:
resultsList.append(person)
resultsList.append(";")
if "Date: " in line:
date = re.findall(r'\w\w\w\,.*', line)
resultsList.append(date)
resultsList.append(";")
for result in resultsList:
results.writelines(result)
emails.close()
results.close()
and here's my 'demo_text.txt':
From - Sun Jan 06 19:08:49 2013
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Delivered-To: somebody_1#hotmail.com
Received: by 10.48.48.3 with SMTP id v3cs417003nfv;
Mon, 15 Jan 2007 10:14:19 -0800 (PST)
Received: by 10.65.211.13 with SMTP id n13mr5741660qbq.1168884841872;
Mon, 15 Jan 2007 10:14:01 -0800 (PST)
Return-Path: <nobody#hotmail.com>
Received: from bay0-omc3-s21.bay0.hotmail.com (bay0-omc3-s21.bay0.hotmail.com [65.54.246.221])
by mx.google.com with ESMTP id e13si6347910qbe.2007.01.15.10.13.58;
Mon, 15 Jan 2007 10:14:01 -0800 (PST)
Received-SPF: pass (google.com: domain of nobody#hotmail.com designates 65.54.246.221 as permitted sender)
Received: from hotmail.com ([65.54.250.22]) by bay0-omc3-s21.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2668);
Mon, 15 Jan 2007 10:13:48 -0800
Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC;
Mon, 15 Jan 2007 10:13:47 -0800
Message-ID: <BAY115-F12E4E575FF2272CF577605A1B50#phx.gbl>
Received: from 65.54.250.200 by by115fd.bay115.hotmail.msn.com with HTTP;
Mon, 15 Jan 2007 18:13:43 GMT
X-Originating-IP: [200.122.47.165]
X-Originating-Email: [nobody#hotmail.com]
X-Sender: nobody#hotmail.com
From: =?iso-8859-1?B?UGF1bGEgTWFy7WEgTGlkaWEgRmxvcmVuemE=?=
<nobody#hotmail.com>
To: somebody_1#hotmail.com, somebody_2#gmail.com, 3_nobodies#yahoo.com.ar
Bcc:
Subject: fotos
Date: Mon, 15 Jan 2007 18:13:43 +0000
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="----=_NextPart_000_d98_1c4f_3aa9"
X-OriginalArrivalTime: 15 Jan 2007 18:13:47.0572 (UTC) FILETIME=[E68D4740:01C738D0]
Return-Path: nobody#hotmail.com
The output is:
somebody_1#hotmail.com;somebody_2#gmail.com;3_nobodies#yahoo.com.ar;Mon, 15 Jan 2007 18:13:43 +0000;
This output would be fine except there's a line break in the 'From:' field in my demo_text.txt (line 24), and so I miss 'nobody#hotmail.com'.
I'm not sure how to tell my code to skip line break and still find email address in the From: tag.
More generally, I'm sure there are many more sensible ways to go about this task. If anyone could point me in the right direction, I'd sure appreciate it.
Your demo text is practicallly the mbox format, which can be perfectly processed with the appropriate object in the mailbox module:
from mailbox import mbox
import re
PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\#[0-9A-Za-z._-]+")
mymbox = mbox("demo.txt")
for email in mymbox.values():
from_address = PAT_EMAIL.findall(email["from"])
to_address = PAT_EMAIL.findall(email["to"])
date = [ email["date"], ]
print ";".join(from_address + to_address + date)
In order to skip newlines, you can't read it line by line. You can try loading in your file, and using your keywords (From, To, etc.) as boundaries. So when you search for 'From -', you use the rest of your keywords as boundaries so they won't be included in the portion of the list.
Also, mentioning this cause you said you're a beginner:
The "Pythonic" way of naming your non-class variables is with underscores. So resultsList should be results_list.

Categories