search string in text file and capture following n characters

search string in text file and capture following n characters - python

I'm using subprocess.Popen and stdout to write the output of a curl from Backendless (BaaS).
What's written to the output file is a long single line of data, separated by commas. Here's a small portion of it.
{"APIEndpoint":"asdfaasdfa","created":1429550024000,"updated":null,"objectId":"EE51537D-A9AC-721C-FF33-F4B258931E00"...}
The value I need from this output file is the 37-character string following "objectID":". I've read many similar questions but haven't been able to find a solution to this specific one. I've tried something like:
objectid = 37
searchfile open('backendless.txt', 'r')
for line in searchfile:
if "\"objectId\":\"" in line:
print(right[:objectidd])
which returns nothing. Please correct me if I'm using line incorrectly. I'm very new to this. Also, is there a way to achieve the same result without saving it to the text file first and instead performing the curl with PIPE and communicate?
I'm using Python 3.4. Thank you.
EDIT/SOLUTION:
from subprocess import *
baascurl = Popen(['curl', '-H', appid, '-H', secretkey, '-H', apptype, '-H', contenttype, '-X', 'GET', '-v', 'https://api.backendless.com/v1/data/Latency/last'], stdout=PIPE).communicate()[0]
objidbytes = baascurl.decode(encoding='utf-8')
objid = json.loads(objidbytes)["objectId"]

Your data looks pretty much like JSON, so maybe you can use Python's json module:
import json
objid = json.loads(datastring)["objectId"]
If you want to stay on the text level, the right tool for this job are regular expressions. Look into Python's re module.
import re
m = re.search(r'"objectId":"([^"]+)"', datastring, re.IGNORECASE)
if m:
objid = m.group(1)

Related

Extract only additions from diff in python

I am trying to solve a problem:
I receive auto-generated email from government with no tags in HTML. It's one table nested upon another. An abomination of a template. I get it every few days and I want to extract some fields from it. My idea was this
Use HTML in the email as template. Remove all fields that change with every mail like Name of my client, their Unique ID and issue explained in the mail.
Use this html template with missing fields and diff it with new emails. That will give me all the new info in one shot without having to parse this email.
Problem is, I can't find any way of loading only these additions. I am trying to use difflib in python and it returns byte streams of additions and subtractions in each line that I am not able to process properly. I want to find a way to only return the additions and nothing else. I am open to using other libraries or methods. I do not want to write a huge regex with tons of html.

When I got the stdout from using Popen calling diff it also returned bytes.
You can convert the bytes to chars, then continue with your processing.
You could do something similar to what I do below to convert your bytes to a string
The below calls diff on two files and prints only the lines beginning with the '>' symbol (new in the rhs file):
#! /usr/env python
import os
import sys, subprocess
file1 = 'test1'
file2 = 'test2'
if len(sys.argv)==3:
file1=sys.argv[1]
file2=sys.argv[2]
if not os.access(file1,os.R_OK):
print(f'Unable to read: \'{file1}\'')
sys.exit(1)
if not os.access(file2,os.R_OK):
print(f'Unable to read: \'{file2}\'')
sys.exit(1)
argv = ['diff',file1,file2]
runproc = subprocess.Popen(args=argv, stdout=subprocess.PIPE)
out, err = runproc.communicate()
outstr=''
for c in out:
outstr+=chr(c)
for line in outstr.split('\n'):
if len(line)==0:
continue
if line[0]=='>':
print(line)

Issue using replace function in python

I am trying to encrypt a word and than replace it in the given text for that i am using replace() in python. This method is able to replace the word but keeps the original one in the text also. Below is my code
import subprocess
import bz2
import base64
from subprocess import Popen, PIPE
cat = subprocess.Popen(["hadoop", "fs", "-cat", "/user/cloudera/xxx.dat"], stdout=subprocess.PIPE)
for line in cat.stdout:
code = line.split('|')[0]
if (code == "ID"):
name = line.split('|')[5]
address = line.split('|')[11]
ciphername = base64.b64encode(bz2.compress(name))
cipheraddr = base64.b64encode(bz2.compress(address))
line.replace(name,ciphername).replace(address,cipheraddr)
print line
Sample:
ID|1|ZXD0629|ZXD0629||HODJON||11383129|M|||221 B POLLARD RD��KAsODK�TBN�37764|||||||629Z800060|480837
Output:
'ID|1|ZXD0629|ZXD0629||QlpoOTFBWSZTWbk9uLgAAAIGCAbRiAACACAAMQZMQQaMItAUVNzxdyRThQkLk9uLgA==||11383129|M|||QlpoOTFBWSZTWT0tjHQAAAQeCEAALeAkDdQAAgAgADFNMjExMQpo0ZqBmowcuKOA3JhB1VMGcoxTGvi7kinChIHpbGOg|||||||QlpoOTFBWSZTWc5EbhIAAAQKAFNgABAgACEpppkIYBoRvMsvi7kinChIZyI3CQA=|480837\n'
ID|1|ZXD0629|ZXD0629||HODJON||11383129|M|||221 B POLLARD RD��KAsODK�TBN�37764|||||||629Z800060|480837
Expected Output:
ID|1|ZXD0629|ZXD0629||QlpoOTFBWSZTWbk9uLgAAAIGCAbRiAACACAAMQZMQQaMItAUVNzxdyRThQkLk9uLgA==||11383129|M|||QlpoOTFBWSZTWT0tjHQAAAQeCEAALeAkDdQAAgAgADFNMjExMQpo0ZqBmowcuKOA3JhB1VMGcoxTGvi7kinChIHpbGOg|||||||QlpoOTFBWSZTWc5EbhIAAAQKAFNgABAgACEpppkIYBoRvMsvi7kinChIZyI3CQA=|480837\n
I don't need the original text without encryption i only need the encrypted one in my text. I have huge records so i cannot post entire sample here that's why i have posted a small sample. I don't know this issue is because of replace() or some mistake i did while implementing. Please help

Wheb calling str.replace() you don't change the original string value, the replace() function returns new value, so here you need to rewrite your original string with the replaced one:
line = line.replace(name,ciphername).replace(address,cipheraddr)
print line

Getting a '\n' (new line by default) during a subprocess call) on date format why?

Here the trace
IOError: [Errno 2] No such file or directory: 'security01/30/15,15:32:58\n.jpg'
#how and why \n come here?
for these line:
p = subprocess.Popen(['date +%m/%d/%y,%H:%M:%S']
stdout=subprocess.PIPE,shell=True)
Additional line used:
(output, err) = p.communicate()
args = ['fswebcam','--no-banner','-r',' 960x720','filename' + str(output) + '.jpg']
subprocess.call(args)
Another line:
mail('mail#mail.com',
'subject',
'body',
'filename' + str(output) + '.jpg')

date returns with a newline at the end of its output, so you need the strip() method to get rid of it.
p = subprocess.Popen(['date +%m/%d/%y,%H:%M:%S'], stdout=subprocess.PIPE, shell=True)
stdout = p.communicate()[0].strip()
print 'security{date}.jpg'.format(date=stdout)
>>> security01/30/15,10:36:24.jpg
If you're looking to add a date/timestamp to your filename though, you would be better off doing it directly in Python using the datetime module and the strftime function:
from datetime import datetime
print datetime.now().strftime('%m/%d/%y,%H:%M:%S')
>>> '01/30/15,10:47:37'

The date command, and basically every other command intended for interactive command-line use, terminates its output with a newline.
If that is not what you need, trimming the final newline from the output of a subprocess call is a very common thing to do (and in the shell, the hardcoded default behavior of process substitutions with `backticks` and the modern $(command) syntax).
But you don't need a subprocess to create a date string -- Python has extensive (albeit slightly clunky) support for this in its standard library, out of the box. See e.g. here.
import time
filename = time.strftime('security%Y.%m.%d_%H.%M.%S.jpg')
or, adapted into your first example snippet,
args = ['fswebcam','--no-banner','-r',' 960x720',
time.strftime('filename%Y.%m.%d_%H.%M.%S.jpg')]
Because both slashes and (to a lesser extent, but still) colons are problematic characters to have in file names, I have replaced them with dots. For purely aesthetic reasons, I also changed the comma to an underscore (doubtful; underscores are ugly, too).
I also switched the generated file names to use a standard datestamp naming convention with a full-digit year first, so that file listings and glob loops produce the files in correct date order.
Probably the code should be further tweaked to contain a proper ISO 8601 date in the file name; then if you want to parse and reformat it for human consumption separately, you are free to do that. But avoid custom date formats when there are standard formats which can be both read and written by existing code, as well as unambiguously understood by humans.

Pythonic way to send contents of a file to a pipe and count # lines in a single step

given the > 4gb file myfile.gz, I need to zcat it into a pipe for consumption by Teradata's fastload. I also need to count the number of lines in the file. Ideally, I only want to make a single pass through the file. I use awk to output the entire line ($0) to stdout and through using awk's END clause, writes the number of rows (awk's NR variable) to another file descriptor (outfile).
I've managed to do this using awk but I'd like to know if a more pythonic way exists.
#!/usr/bin/env python
from subprocess import Popen, PIPE
from os import path
the_file = "/path/to/file/myfile.gz"
outfile = "/tmp/%s.count" % path.basename(the_file)
cmd = ["-c",'zcat %s | awk \'{print $0} END {print NR > "%s"} \' ' % (the_file, outfile)]
zcat_proc = Popen(cmd, stdout = PIPE, shell=True)
The pipe is later consumed by a call to teradata's fastload, which reads from
"/dev/fd/" + str(zcat_proc.stdout.fileno())
This works but I'd like to know if its possible to skip awk and take better advantage of python. I'm also open to other methods. I have multiple large files that I need to process in this manner.

There's no need for either of zcat or Awk. Counting the lines in a gzipped file can be done with
import gzip
nlines = sum(1 for ln in gzip.open("/path/to/file/myfile.gz"))
If you want to do something else with the lines, such as pass them to a different process, do
nlines = 0
for ln in gzip.open("/path/to/file/myfile.gz"):
nlines += 1
# pass the line to the other process

Counting lines and unzipping gzip-compressed files can be easily done with Python and its standard library. You can do everything in a single pass:
import gzip, subprocess, os
fifo_path = "path/to/fastload-fifo"
os.mkfifo(fifo_path)
fastload_fifo = open(fifo_path)
fastload = subprocess.Popen(["fastload", "--read-from", fifo_path],
stdin=subprocess.PIPE)
with gzip.open("/path/to/file/myfile.gz") as f:
for i, line in enumerate(f):
fastload_fifo.write(line)
print "Number of lines", i + 1
os.unlink(fifo_path)
I don't know how to invoke Fastload -- subsitute the correct parameters in the invocation.

This can be done in one simple line of bash:
zcat myfile.gz | tee >(wc -l >&2) | fastload
This will print the line count on stderr. If you want it somewhere else you can redirect the wc output however you like.

Actually, it should not be possible to pipe the data to Fastload at all, so it would be great if somebody post here an exact example if he could.
From Teradata documentation on the Fastload configuration http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/Load_and_Unload_Utilities/B035_2411_071A/2411Ch03.026.028.html#ww1938556
FILE=filename
Keyword phrase specifying the name of the data source that contains the input data. fileid must refer to a regular file. Specifically, pipes are not supported.

Newbie question about file formatting in Python

I'm writing a simple program in Python 2.7 using pycURL library to submit file contents to pastebin.
Here's the code of the program:
#!/usr/bin/env python2
import pycurl, os
def send(file):
print "Sending file to pastebin...."
curl = pycurl.Curl()
curl.setopt(pycurl.URL, "http://pastebin.com/api_public.php")
curl.setopt(pycurl.POST, True)
curl.setopt(pycurl.POSTFIELDS, "paste_code=%s" % file)
curl.setopt(pycurl.NOPROGRESS, True)
curl.perform()
def main():
content = raw_input("Provide the FULL path to the file: ")
open = file(content, 'r')
send(open.readlines())
return 0
main()
The output pastebin looks like standard Python list: ['string\n', 'line of text\n', ...] etc.
Is there any way I could format it so it looks better and it's actually human-readable? Also, I would be very happy if someone could tell me how to use multiple data inputs in POSTFIELDS. Pastebin API uses paste_code as its main data input, but it can use optional things like paste_name that sets the name of the upload or paste_private that sets it private.

First, use .read() as virhilo said.
The other step is to use urllib.urlencode() to get a string:
curl.setopt(pycurl.POSTFIELDS, urllib.urlencode({"paste_code": file}))
This will also allow you to post more fields:
curl.setopt(pycurl.POSTFIELDS, urllib.urlencode({"paste_code": file, "paste_name": name}))

import pycurl, os
def send(file_contents, name):
print "Sending file to pastebin...."
curl = pycurl.Curl()
curl.setopt(pycurl.URL, "http://pastebin.com/api_public.php")
curl.setopt(pycurl.POST, True)
curl.setopt(pycurl.POSTFIELDS, "paste_code=%s&paste_name=%s" \
% (file_contents, name))
curl.setopt(pycurl.NOPROGRESS, True)
curl.perform()
if __name__ == "__main__":
content = raw_input("Provide the FULL path to the file: ")
with open(content, 'r') as f:
send(f.read(), "yournamehere")
print
When reading files, use the with statement (this makes sure your file gets closed properly if something goes wrong).
There's no need to be having a main function and then calling it. Use the if __name__ == "__main__" construct to have your script run automagically when called (unless when importing this as a module).
For posting multiple values, you can manually build the url: just seperate different key, value pairs with an ampersand (&). Like this: key1=value1&key2=value2. Or you can build one with urllib.urlencode (as others suggested).
EDIT: using urllib.urlencode on strings which are to be posted makes sure content is encoded properly when your source string contains some funny / reserved / unusual characters.

use .read() instead of .readlines()

The POSTFIELDS should be sended the same way as you send Query String arguments. So, in the first place, it's necessary to encode the string that you're sending to paste_code, and then, using & you could add more POST arguments.
Example:
paste_code=hello%20world&paste_name=test
Good luck!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.