Extract only additions from diff in python - python

I am trying to solve a problem:
I receive auto-generated email from government with no tags in HTML. It's one table nested upon another. An abomination of a template. I get it every few days and I want to extract some fields from it. My idea was this
Use HTML in the email as template. Remove all fields that change with every mail like Name of my client, their Unique ID and issue explained in the mail.
Use this html template with missing fields and diff it with new emails. That will give me all the new info in one shot without having to parse this email.
Problem is, I can't find any way of loading only these additions. I am trying to use difflib in python and it returns byte streams of additions and subtractions in each line that I am not able to process properly. I want to find a way to only return the additions and nothing else. I am open to using other libraries or methods. I do not want to write a huge regex with tons of html.

When I got the stdout from using Popen calling diff it also returned bytes.
You can convert the bytes to chars, then continue with your processing.
You could do something similar to what I do below to convert your bytes to a string
The below calls diff on two files and prints only the lines beginning with the '>' symbol (new in the rhs file):
#! /usr/env python
import os
import sys, subprocess
file1 = 'test1'
file2 = 'test2'
if len(sys.argv)==3:
file1=sys.argv[1]
file2=sys.argv[2]
if not os.access(file1,os.R_OK):
print(f'Unable to read: \'{file1}\'')
sys.exit(1)
if not os.access(file2,os.R_OK):
print(f'Unable to read: \'{file2}\'')
sys.exit(1)
argv = ['diff',file1,file2]
runproc = subprocess.Popen(args=argv, stdout=subprocess.PIPE)
out, err = runproc.communicate()
outstr=''
for c in out:
outstr+=chr(c)
for line in outstr.split('\n'):
if len(line)==0:
continue
if line[0]=='>':
print(line)

Related

How to use wiktextract

I am trying to extract a Wiktionary xml file from their dumps using the wiktextract python module. However their website does not give me enough information. I could not use the command line program that comes with it since it isn't a Windows executable, so I tried the programmatic way. The following code takes a while to run so it seems to be doing something but then I'm not sure what to do with the ctx variable. Can anyone help me?
import wiktextract
def word_cb(data):
print(data)
ctx = wiktextract.parse_wiktionary(
r'myfile.xml', word_cb,
languages=["English", "Translingual"])
You are on the right track, but don't have to worry too much about the ctx object.
As the documentation says:
The parse_wiktionary call will call word_cb(data) for words and redirects found in the
Wiktionary dump. data is information about a single word and part-of-speech as a dictionary (multiple senses of the same part-of-speech are combined into the same dictionary). It may also be a redirect (indicated by presence of a redirect key in the dictionary).
The output ctx object mostly contains summary information (the number of sections processed, etc; you can use dir(ctx) to see some of its fields.
The useful results are not the ones in the returned ctx object, but the ones passed to word_cb on a word-by-word basis. So you might just try something like the following to get a JSON dump from a wiktionary XML dump. Because the full dumps are many gigabytes, I put a small one on a server for convenience in this example.
import json
import wiktextract
import requests
xml_fn = 'enwiktionary-20190220-pages-articles-sample.xml'
print("Downloading XML dump to " + xml_fn)
response = requests.get('http://45.61.148.79/' + xml_fn, stream=True)
# Throw an error for bad status codes
response.raise_for_status()
with open(xml_fn, 'wb') as handle:
for block in response.iter_content(4096):
handle.write(block)
print("Downloaded XML dump, beginning processing...")
fh = open("output.json", "wb")
def word_cb(data):
fh.write(json.dumps(data))
ctx = wiktextract.parse_wiktionary(
r'enwiktionary-20190220-pages-articles-sample.xml', word_cb,
languages=["English", "Translingual"])
print("{} English entries processed.".format(ctx.language_counts["English"]))
print("{} bytes written to output.json".format(fh.tell()))
fh.close()
For me this produces:
Downloading XML dump to enwiktionary-20190220-pages-articles-sample.xml
Downloaded XML dump, beginning processing...
684 English entries processed.
326478 bytes written to output.json
with the small dump extract I placed on a server for convenience. It will take much longer to run on the full dump.

Scripting a website-opener

I'm using a python script to take in a file containing a bunch of website URLs and open all of them in new tabs. However, I'm getting an error message when opening the first website: this is what I get:
0:41: execution error: "https://www.pandora.com/
" doesn’t understand the “open location” message. (-1708)
My script thus far looks like this:
import os
import webbrowser
websites = []
with open("websites.txt", "r+") as my_file:
websites.append(my_file.readline())
for x in websites:
try:
webbrowser.open(x)
except:
print (x + " does not work.")
My file consists of a bunch of URLs on their own lines.
I tried running your code and it works on my machine with python 2.7.9
It may be a character encoding issue when you are trying to open the file
This is my suggestion with the following edits:
import webbrowser
with open("websites.txt", "r+") as sites:
sites = sites.readlines() # readlines returns a list of all the lines in your file, this makes code more concise
# In addition we can use the variable 'sites' to hold the list returned to us by the file object 'sites.readlines()'
print sites # here we send the output of the list to the shell to make sure it contains the right information
for url in sites:
webbrowser.open_new_tab( url.encode('utf-8') ) # this is here just in-case, to encode characters that the webbrowser module can interpret
# sometimes special characters like '\' or '/' can cause issues for us unless we encode/decode them or make them raw strings
Hope this helps!

subprocess, Popen, and stdin: Seeking practical advice on automating user input to .exe

Despite my obviously beginning Python skills, I’ve got a script that pulls a line of data from a 2,000-row CSV file, reads key parameters, and outputs a buffer CSV file organized as an N-by-2 rectangle, and uses the subprocess module to call the external program POVCALLC.EXE, which takes a CSV file organized that way as input. The relevant portion of the code is shown below. I THINK that subprocess or one of its methods should allow me to interact with the external program, but am not quite sure how - or indeed whether this is the module I need.
In particular, when POVCALLC.EXE starts it first asks for the input file, which in this case is buffer.csv. It then asks for several additional parameters including the name of an output file, which come from outside the snippet below. It then starts computing results, and then ask for further user input, including several carriage returns . Obviously, I would prefer to automate this interaction for the 2,000 rows in the original CSV.
Am I on the right track with subprocess, or should I be looking elsewhere to automate this interaction with the external executable?
Many thanks in advance!
# Begin inner loop to fetch Lorenz curve data for each survey
for i in range(int(L_points_number)):
index = 3 * i
line = []
P = L_points[index]
line.append(P)
L = L_points[index + 1]
line.append(L)
with open('buffer.csv', 'a', newline='') as buffer:
writer = csv.writer(buffer, delimiter=',')
P=1
line.append(P)
L=1
line.append(L)
writer.writerow(line)
subprocess.call('povcallc.exe')
# TODO: CALL povcallc and compute results
# TODO: USE Regex to interpret results and append them to
# output file
If your program expects these arguments on the standard input (e.g. after running POVCALLC you type csv filenames into the console), you could use subprocess.Popen() [see https://docs.python.org/3/library/subprocess.html#subprocess.Popen ] with stdin redirection (stdin=PIPE), and use the returned object to send data to stdin.
It would looks something like this:
my_proc = subprocess.Popen('povcallc.exe', stdin=subprocess.PIPE)
my_proc.communicate(input="my_filename_which_is_expected_by_the_program.csv")
You can also use the tuple returned by communicate to automatically check the programs stdout and stderr (see the link to docs for more).

Specifying filename in os.system call from python

I am creating a simple file in python to reorganize some text data I grabbed from a website. I put the data in a .txt file and then want to use the "tail" command to get rid of the first 5 lines. I'm able to make this work for a simple filename shown below, but when I try to change the filename (to what I'd actually like it to be) I get an error. My code:
start = 2010
end = 2010
for i in range(start,end+1)
year = str(i)
...write data to a file called file...
teamname=open(file).readline() # want to use this in the new filename
teamfname=teamname.replace(" ","") #getting rid of spaces
file2 = "gotdata2_"+year+".txt"
os.system("tail -n +5 gotdata_"+year+".txt > "+file2)
The above code works as intended, creating file, then creating file2 that excludes the first 5 lines of file. However, when I change the name of file2 to be:
file2 = teamfname+"_"+year+".txt"
I get the error:
sh: line 1: _2010.txt: command not found
It's as if the end of my file2 statement is getting chopped off and the .txt part isn't being recognized. In this case, my code outputs a file but is missing the _2010.txt at the end. I've double checked that both year and teamfname are strings. I've also tried it with and without spaces in the teamfname string. I get the same error when I try to include a os.system mv statement that would rename the file to what I want it to be, so there must be something wrong with my understanding of how to specify the string here.
Does anyone have any ideas about what causes this? I haven't been able to find a solution, but I've found this problem difficult to search for.
Without knowing what your actual strings are, it's impossible to be sure what the problem is. However, it's almost certainly something to do with failing to properly quote and/or escape arguments for the command line.
My first guess would be that you have a newline in the middle of your filename, and the shell is truncating the command at the newline. But I wouldn't bet too heavily on that. If you actually printed out the repr of the pathname, I could tell you for sure. But why go through all this headache?
The solution to almost any problem with os.system is to not use os.system.
If you look at the docs, they even tell you this:
The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.
If you use subprocess instead of os.system, you can avoid the shell entirely. You can also pass arguments as a list instead of trying to figure out how to quote them and escape them properly. Which would completely avoid the exact problem you're having.
For example, if you do this:
file2 = "gotdata2_"+year+".txt"
with open(file2, 'wb') as f:
subprocess.check_call(['tail', '-n', '+5', "gotdata_"+year+".txt"], stdout=f)
Then, if you change that first line to this:
file2 = teamfname+"_"+year+".txt"
It will still work even if teamfname has a space or a quote or another special character in it.
That being said, I'm not sure why you want to use tail in the first place. You can skip the first 5 lines just as easily directly in Python.

Newbie question about file formatting in Python

I'm writing a simple program in Python 2.7 using pycURL library to submit file contents to pastebin.
Here's the code of the program:
#!/usr/bin/env python2
import pycurl, os
def send(file):
print "Sending file to pastebin...."
curl = pycurl.Curl()
curl.setopt(pycurl.URL, "http://pastebin.com/api_public.php")
curl.setopt(pycurl.POST, True)
curl.setopt(pycurl.POSTFIELDS, "paste_code=%s" % file)
curl.setopt(pycurl.NOPROGRESS, True)
curl.perform()
def main():
content = raw_input("Provide the FULL path to the file: ")
open = file(content, 'r')
send(open.readlines())
return 0
main()
The output pastebin looks like standard Python list: ['string\n', 'line of text\n', ...] etc.
Is there any way I could format it so it looks better and it's actually human-readable? Also, I would be very happy if someone could tell me how to use multiple data inputs in POSTFIELDS. Pastebin API uses paste_code as its main data input, but it can use optional things like paste_name that sets the name of the upload or paste_private that sets it private.
First, use .read() as virhilo said.
The other step is to use urllib.urlencode() to get a string:
curl.setopt(pycurl.POSTFIELDS, urllib.urlencode({"paste_code": file}))
This will also allow you to post more fields:
curl.setopt(pycurl.POSTFIELDS, urllib.urlencode({"paste_code": file, "paste_name": name}))
import pycurl, os
def send(file_contents, name):
print "Sending file to pastebin...."
curl = pycurl.Curl()
curl.setopt(pycurl.URL, "http://pastebin.com/api_public.php")
curl.setopt(pycurl.POST, True)
curl.setopt(pycurl.POSTFIELDS, "paste_code=%s&paste_name=%s" \
% (file_contents, name))
curl.setopt(pycurl.NOPROGRESS, True)
curl.perform()
if __name__ == "__main__":
content = raw_input("Provide the FULL path to the file: ")
with open(content, 'r') as f:
send(f.read(), "yournamehere")
print
When reading files, use the with statement (this makes sure your file gets closed properly if something goes wrong).
There's no need to be having a main function and then calling it. Use the if __name__ == "__main__" construct to have your script run automagically when called (unless when importing this as a module).
For posting multiple values, you can manually build the url: just seperate different key, value pairs with an ampersand (&). Like this: key1=value1&key2=value2. Or you can build one with urllib.urlencode (as others suggested).
EDIT: using urllib.urlencode on strings which are to be posted makes sure content is encoded properly when your source string contains some funny / reserved / unusual characters.
use .read() instead of .readlines()
The POSTFIELDS should be sended the same way as you send Query String arguments. So, in the first place, it's necessary to encode the string that you're sending to paste_code, and then, using & you could add more POST arguments.
Example:
paste_code=hello%20world&paste_name=test
Good luck!

Categories