Creating a set from a variable instead of a file

Creating a set from a variable instead of a file - python

I have a piece of code to read two files, convert them to sets, and then subtract one set from the other. I would like to use a string variable (installedPackages) for "a" instead of a file. I would also like to write to a variable for "c".
a = open("/home/user/packages1.txt")
b = open("/home/user/packages.txt")
c = open("/home/user/unique.txt", "w")
for line in set(a) - set(b):
c.write(line)
a.close()
b.close()
c.close()
I have tried the following and it does not work:
for line in set(installedPackages) - set(b):
I have tried to use StringIO, but I think I am using it improperly.
Here, finally, is how I have created installedPackages:
stdout, stderr = p.communicate()
installedPackages = re.sub('\n$', '', re.sub('install$', '', re.sub('\t', '', stdout), 0,re.MULTILINE))
Sample of packages.txt:
humanity-icon-theme
hunspell-en-us
hwdata
hyphen-en-us
ibus
ibus-gtk
ibus-gtk3
ibus-pinyin
ibus-pinyin-db-android
ibus-table

If you want to write to a string buffer file-like use StringIO
>>> from StringIO import StringIO
>>> installed_packages = StringIO()
>>> installed_packages.write('test')
>>> installed_packages.getvalue()
'test'

Something like the following?
Edit: after several iterations:
from subprocess import Popen, PIPE
DEBUG = True
if DEBUG:
def log(msg, data):
print(msg)
print(repr(data))
else:
def log(msg, data):
pass
def setFromFile(fname):
with open(fname) as inf:
return set(ln.strip() for ln in inf)
def setFromString(s):
return set(ln.strip() for ln in s.split("\n"))
def main():
# get list of installed packages
p = Popen(['dpkg', '--get-selections'], stdout=PIPE, stderr=PIPE)
stdout, stderr = p.communicate()
installed_packages = setFromString(stdout)
# get list of expected packages
known_packages = setFromFile('/home/john/packages.txt')
# calculate the difference
unknown_packages = installed_packages - known_packages
unknown_packages_string = "\n".join(unknown_packages)
log("Installed packages:", installed_packages)
log("Known packages:", known_packages)
log("Unknown packages:", unknown_packages)
if __name__=="__main__":
main()

The set data type takes an iterable as a parameter, therefore if installedPackages a string with multiple items you need to split it by the delimiter. For example, the following code would split the string by all commas:
for line in set(installedPackages.split(',')) - set(b):
c.write(line)

Related

Python stdout and stdout.buffer capture

Hey i want to prevent any stdouts from anywhere and capture it in a variable.
Now the problem:
I have two methods to print something in stdout
Method 1:
print("Normal Print")
Method 2:
fds: List[BinaryIO] = [sys.stdin.buffer, sys.stdout.buffer, sys.stderr.buffer]
fds[1].write(b"%d\n" % a)
fds[1].flush()
Now i tried something like this
from io import BufferedWriter, BytesIO, StringIO
mystdout = StringIO()
mystdout.buffer = BufferedWriter(raw=BytesIO())
sys.stdout = mystdout
But with this i get no output at all.
How is the best way to archiev this?

What do you mean that you get no output at all? It's in variable:
mystdout = StringIO()
mystdout.buffer = BufferedRandom(raw=BytesIO()) # You can read from BufferedRandom
sys.stdout = mystdout
sys.stdout.buffer.write(b"BUFFER")
print("PRINT")
sys.stdout = sys.__stdout__ # Restore original stdout
print(mystdout.getvalue()) # PRINT
mystdout.buffer.seek(0)
print(mystdout.buffer.read()) b"BUFFER"

Python write both commands and their output to a file

Is there any way to write both commands and their output to an external file?
Let's say I have a script outtest.py :
import random
from statistics import median, mean
d = [random.random()**2 for _ in range(1000)]
d_mean = round(mean(d), 2)
print(f'mean: {d_mean}')
d_median = round(median(d), 2)
print(f'median: {d_median}')
Now if I want to capture its output only I know I can just do:
python3 outtest.py > outtest.txt
However, this will only give me an outtest.txt file with e.g.:
mean: 0.34
median: 0.27
What I'm looking for is a way to get an output file like:
import random
from statistics import median, mean
d = [random.random()**2 for _ in range(1000)]
d_mean = round(mean(d), 2)
print(f'mean: {d_mean}')
>> mean: 0.34
d_median = round(median(d), 2)
print(f'median: {d_median}')
>> median: 0.27
Or some other format (markdown, whatever). Essentially, something like jupyter notebook or Rmarkdown but with using standard .py files.
Is there any easy way to achieve this?

Here's a script I just wrote which quite comprehensively captures printed output and prints it alongside the code, no matter how it's printed or how much is printed in one go. It uses the ast module to parse the Python source, executes the program one statement at a time (kind of as if it was fed to the REPL), then prints the output from each statement. Python 3.6+ (but easily modified for e.g. Python 2.x):
import ast
import sys
if len(sys.argv) < 2:
print(f"Usage: {sys.argv[0]} <script.py> [args...]")
exit(1)
# Replace stdout so we can mix program output and source code cleanly
real_stdout = sys.stdout
class FakeStdout:
''' A replacement for stdout that prefixes # to every line of output, so it can be mixed with code. '''
def __init__(self, file):
self.file = file
self.curline = ''
def _writerow(self, row):
self.file.write('# ')
self.file.write(row)
self.file.write('\n')
def write(self, text):
if not text:
return
rows = text.split('\n')
self.curline += rows.pop(0)
if not rows:
return
for row in rows:
self._writerow(self.curline)
self.curline = row
def flush(self):
if self.curline:
self._writerow(self.curline)
self.curline = ''
sys.stdout = FakeStdout(real_stdout)
class EndLineFinder(ast.NodeVisitor):
''' This class functions as a replacement for the somewhat unreliable end_lineno attribute.
It simply finds the largest line number among all child nodes. '''
def __init__(self):
self.max_lineno = 0
def generic_visit(self, node):
if hasattr(node, 'lineno'):
self.max_lineno = max(self.max_lineno, node.lineno)
ast.NodeVisitor.generic_visit(self, node)
# Pretend the script was called directly
del sys.argv[0]
# We'll walk each statement of the file and execute it separately.
# This way, we can place the output for each statement right after the statement itself.
filename = sys.argv[0]
source = open(filename, 'r').read()
lines = source.split('\n')
module = ast.parse(source, filename)
env = {'__name__': '__main__'}
prevline = 0
endfinder = EndLineFinder()
for stmt in module.body:
# note: end_lineno will be 1-indexed (but it's always used as an endpoint, so no off-by-one errors here)
endfinder.visit(stmt)
end_lineno = endfinder.max_lineno
for line in range(prevline, end_lineno):
print(lines[line], file=real_stdout)
prevline = end_lineno
# run a one-line "module" containing only this statement
exec(compile(ast.Module([stmt]), filename, 'exec'), env)
# flush any incomplete output (FakeStdout is "line-buffered")
sys.stdout.flush()
Here's a test script:
print(3); print(4)
print(5)
if 1:
print(6)
x = 3
for i in range(6):
print(x + i)
import sys
sys.stdout.write('I love Python')
import pprint
pprint.pprint({'a': 'b', 'c': 'd'}, width=5)
and the result:
print(3); print(4)
# 3
# 4
print(5)
# 5
if 1:
print(6)
# 6
x = 3
for i in range(6):
print(x + i)
# 3
# 4
# 5
# 6
# 7
# 8
import sys
sys.stdout.write('I love Python')
# I love Python
import pprint
pprint.pprint({'a': 'b', 'c': 'd'}, width=5)
# {'a': 'b',
# 'c': 'd'}

You can call the script and inspect its output. You'll have to make some assumptions though. The output is only from stdout, only from lines containing the string "print", and each print produces only one line of output. That being the case, an example command to run it:
> python writer.py script.py
And the script would look like this:
from sys import argv
from subprocess import run, PIPE
script = argv[1]
r = run('python ' + script, stdout=PIPE)
out = r.stdout.decode().split('\n')
with open(script, 'r') as f:
lines = f.readlines()
with open('out.txt', 'w') as f:
i = 0
for line in lines:
f.write(line)
if 'print' in line:
f.write('>>> ' + out[i])
i += 1
And the output:
import random
from statistics import median, mean
d = [random.random()**2 for _ in range(1000)]
d_mean = round(mean(d), 2)
print(f'mean: {d_mean}')
>>> mean: 0.33
d_median = round(median(d), 2)
print(f'median: {d_median}')
>>> median: 0.24
More complex cases with multiline output or other statements producing output, this won't work. I guess it would require some in depth introspection.

My answer might be similar to #Felix answer, but I'm trying to do it in a pythonic way.
Just grab the source code (mycode) and then execute it and capture the results, finally write them back to the output file.
Remember that I've used exec to the demonstration for the solution and you shouldn't use it in a production environment.
I also used this answer to capture stdout from exec.
import mycode
import inspect
import sys
from io import StringIO
import contextlib
source_code = inspect.getsource(mycode)
output_file = "output.py"
#contextlib.contextmanager
def stdoutIO(stdout=None):
old = sys.stdout
if stdout is None:
stdout = StringIO()
sys.stdout = stdout
yield stdout
sys.stdout = old
# execute code
with stdoutIO() as s:
exec(source_code)
# capture stdout
stdout = s.getvalue().splitlines()[::-1]
# write to file
with open(output_file, "w") as f:
for line in source_code.splitlines():
f.write(line)
f.write('\n')
if 'print' in line:
f.write(">> {}".format(stdout.pop()))
f.write('\n')

You can assign result to variable before print out and later on write those values to file
Here is example
import random
from statistics import median, mean
d = [random.random()**2 for _ in range(1000)]
d_mean = round(mean(d), 2)
foo = f'mean: {d_mean}'
print(foo)
d_median = round(median(d), 2)
bar = f'median: {d_median}'
print(bar)
# save to file
with open("outtest.txt", "a+") as file:
file.write(foo + "\n")
file.write(bar + "\n")
file.close()
argument "a+" mean open file and append content to file instead of overwrite if file was exist.

How to update line with modified data in Jython?

I'm have a csv file which contains hundred thousands of rows and below are some sample lines..,
1,Ni,23,28-02-2015 12:22:33.2212-02
2,Fi,21,28-02-2015 12:22:34.3212-02
3,Us,33,30-03-2015 12:23:35-01
4,Uk,34,31-03-2015 12:24:36.332211-02
I need to get the last column of csv data which is in wrong datetime format. So I need to get default datetimeformat("YYYY-MM-DD hh:mm:ss[.nnn]") from last column of the data.
I have tried the following script to get lines from it and write into flow file.
import json
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
class PyStreamCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
text = IOUtils.readLines(inputStream, StandardCharsets.UTF_8)
for line in text[1:]:
outputStream.write(line + "\n")
flowFile = session.get()
if (flowFile != None):
flowFile = session.write(flowFile,PyStreamCallback())
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename'))
session.transfer(flowFile, REL_SUCCESS)
but I am not able to find a way to convert it like below output.
1,Ni,23,28-02-2015 12:22:33.221
2,Fi,21,29-02-2015 12:22:34.321
3,Us,33,30-03-2015 12:23:35
4,Uk,34,31-03-2015 12:24:36.332
I have checked solutions with my friend(google) and was still not able to find solution.
Can anyone guide me to convert those input data into my required output?

In this transformation the unnecessary data located at the end of each line, so it's really easy to manage transform task with regular expression.
^(.*:\d\d)((\.\d{1,3})(\d*))?(-\d\d)?
Check the regular expression and explanation here:
https://regex101.com/r/sAB4SA/2
As soon as you have a large file - better not to load it into the memory. The following code loads whole the file into the memory:
IOUtils.readLines(inputStream, StandardCharsets.UTF_8)
Better to iterate line by line.
So this code is for ExecuteScript nifi processor with python (Jython) language:
import sys
import re
import traceback
from org.apache.commons.io import IOUtils
from org.apache.nifi.processor.io import StreamCallback
from org.python.core.util import StringUtil
from java.lang import Class
from java.io import BufferedReader
from java.io import InputStreamReader
from java.io import OutputStreamWriter
class TransformCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
try:
writer = OutputStreamWriter(outputStream,"UTF-8")
reader = BufferedReader(InputStreamReader(inputStream,"UTF-8"))
line = reader.readLine()
p = re.compile('^(.*:\d\d)((\.\d{1,3})(\d*))?(-\d\d)?')
while line!= None:
# print line
match = p.search(line)
writer.write( match.group(1) + (match.group(3) if match.group(3)!=None else '') )
writer.write('\n')
line = reader.readLine()
writer.flush()
writer.close()
reader.close()
except:
traceback.print_exc(file=sys.stdout)
raise
flowFile = session.get()
if flowFile != None:
flowFile = session.write(flowFile, TransformCallback())
# Finish by transferring the FlowFile to an output relationship
session.transfer(flowFile, REL_SUCCESS)
And as soon as question is about nifi, here are alternatives that seems to be easier
the same code as above but in groovy for nifi ExecuteScript processor:
def ff = session.get()
if(!ff)return
ff = session.write(ff, {rawIn, rawOut->
// ## transform streams into reader and writer
rawIn.withReader("UTF-8"){reader->
rawOut.withWriter("UTF-8"){writer->
reader.eachLine{line, lineNum->
if(lineNum>1) { // # skip the first line
// ## let use regular expression to transform each line
writer << line.replaceAll( /^(.*:\d\d)((\.\d{1,3})(\d*))?(-\d\d)?/ , '$1$3' ) << '\n'
}
}
}
}
} as StreamCallback)
session.transfer(ff, REL_SUCCESS)
ReplaceText processor
And if regular expression is ok - the easiest way in nifi is a ReplaceText processor that could do regular expression replace line-by-line.
In this case you don't need to write any code, just build the regular expression and configure your processor correctly.

Just using pure jython. It is an example that can be adapted to OP's needs.
Define a datetime parser for this csv file
from datetime import datetime
def parse_datetime(dtstr):
mydatestr='-'.join(dtstr.split('-')[:-1])
try:
return datetime.strptime(mydatestr,'%d-%m-%Y %H:%M:%S.%f').strftime('%d-%m-%Y %H:%M:%S.%f')[:-3]
except ValueError:
return datetime.strptime(mydatestr,'%d-%m-%Y %H:%M:%S').strftime('%d-%m-%Y %H:%M:%S')
my test.csv includes data like this: ( 2015 didnt have 29 Feb had to change OP's example ).
1,Ni,23,27-02-2015 12:22:33.2212-02
2,Fi,21,28-02-2015 12:22:34.3212-02
3,Us,33,30-03-2015 12:23:35-01
4,Uk,34,31-03-2015 12:24:36.332211-02
now the solution
with open('test.csv') as fi:
for line in fi:
line_split=line.split(',')
out_line = ', '.join(word if i<3 else parse_datetime(word) for i,word in enumerate(line_split))
#print(out_line)
#you can write this out_line to a file here.
printing out_line looks like this
1, Ni, 23, 27-02-2015 12:22:33.221
2, Fi, 21, 28-02-2015 12:22:34.321
3, Us, 33, 30-03-2015 12:23:35
4, Uk, 34, 31-03-2015 12:24:36.332

You can get them with regex :
(\d\d-\d\d-\d\d\d\d\ \d\d:\d\d:)(\d+(?:\.\d+)*)(-\d\d)$
Then just replace #2 with a rounded version of #2
See regex example at regexr.com
You could even do it "nicer" by getting every single value with a capturing group and then put them into a datetime.datetime object and print it from there, but imho that would be an overkill in maintainability and loose you too much performance.
Code had no possibility to test
import re
...
pattern = '^(.{25})(\d+(?:\.\d+)*)(-\d\d)$' //used offset for simplicity
....
for line in text[1:]:
match = re.search(pattern, line)
line = match.group(1) + round(match.group(2),3) + match.group(3)
outputStream.write(line + "\n")

Can I configure python to have matlab like print?

Can I configure python to have matlab like print, so that when I just have a function
returnObject()
that it simply prints that object without me having to type print around it? I assume this is not easy, but something like if an object does not get bound by some other var it should get printed, so that this would work.
a = 5 #prints nothing
b = getObject() #prints nothing
a #prints 5
b #prints getObject()
getObject() #prints the object

If you use an ipython notebook individual cells work like this. But you can only view one object per cell by typing the objects name. To see multiple objects you'd need to call print, or use lots of cells.

You could write a script to modify the original script based on a set of rules that define what to print, then run the modified script.
A basic script to do this would be:
f = open('main.py', 'r')
p = open('modified.py', 'w')
p.write('def main(): \n')
for line in f:
temp = line
if len(temp) == 1:
temp = 'print(' + line + ')'
p.write('\t' + temp)
p.close()
from modified import main
main()
The script main.py would then look like this:
x = 236
x
output:
236

Idea is as follows: parse AST of Python code, replace every expression with call to print and content of expression as argument and then run the modified version. I'm not sure whether it works with every code, but you might try. Save it as matlab.py and run your code as python3 -m matlab file.py.
#!/usr/bin/env python3
import ast
import os
import sys
class PrintAdder(ast.NodeTransformer):
def add_print(self, node):
print_func = ast.Name("print", ast.Load())
print_call = ast.Call(print_func, [node.value], [])
print_statement = ast.Expr(print_call)
return print_statement
def visit_Expr(self, node):
if isinstance(node.value, ast.Call) and node.value.func.id == 'print':
return node
return self.add_print(node)
def main():
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('infile', type=argparse.FileType(), nargs='?', default='-')
args = parser.parse_args()
with args.infile as infile:
code = infile.read()
file_name = args.infile.name
tree = ast.parse(code, file_name, 'exec')
tree = PrintAdder().visit(tree)
tree = ast.fix_missing_locations(tree)
bytecode = compile(tree, file_name, 'exec')
exec(bytecode)
if __name__ == '__main__':
main()

How to check if a string is at the beginning of line in spite of tabs or whitespaces?

#!/usr/bin/python
import subprocess as sp
args = ["awk", r'/^word/ { print $1}','anyfile.py' ]
p = sp.Popen(args, stdin = sp.PIPE, stdout = sp.PIPE, stderr = sp.PIPE )
How to get word at the beginning of line in spite of tabs and or whitespaces?
print p.stdout.read()

You can simply use a regular expression, like this
import re
re.match(r'^\s*word', line)
Here,
^ indicates beginning of string
\s* means zero or more whitespace characters
word is the actual word you are looking for.

What about using startswith after using strip-
>>>'\n \t \r asasasas ash'.strip().startswith('asa')
>>>True

You can adapt the search pattern used by awk to accept leading whitespace characters in your input file:
import subprocess as sp
args = ["awk", r'/^\s*word/ { print $1}','anyfile.py' ]
p = sp.Popen(args, stdin = sp.PIPE, stdout = sp.PIPE, stderr = sp.PIPE )
print p.stdout.read()
But in this case, I don't see why not perform the parsing directly in Python:
with open("anyfile.py") as f:
for line in f:
if line.lstrip().startswith("word"):
print "found match!"
For reference:
https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects
https://docs.python.org/2/library/stdtypes.html#str.lstrip
https://docs.python.org/2/library/stdtypes.html#str.startswith

Using only built-in string methods it's as easy as:
str(p.stdout.read()).split()[0]
This should grant you the first word in the string.

Sounds like what you really want to do is parse the abstract synatax tree of a python file using the ast module. There are much better ways of doing that than using regular expressions. Here's an example:
import ast
class FunctionVisitor(ast.NodeVisitor):
def __init__(self):
self.second_arg_names = []
def visit_FunctionDef(self, func):
"""collect function names and the second argument of functions that
have two or more arguments"""
args = func.args.args
if len(args) > 1:
self.second_arg_names.append((func.name, args[1].id))
self.generic_visit(func)
def find_func_args(filename=__file__):
"""defaults to looking at this file"""
with open(filename) as f:
source = f.read()
tree = ast.parse(source)
visitor = FunctionVisitor()
visitor.visit(tree)
print visitor.second_arg_names
assert visitor.second_arg_names == [("visit_FunctionDef", "func")]
if __name__ == "__main__":
find_func_args()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a set from a variable instead of a file - python

If you want to write to a string buffer file-like use StringIO >>> from StringIO import StringIO >>> installed_packages = StringIO() >>> installed_packages.write('test') >>> installed_packages.getvalue() 'test'

The set data type takes an iterable as a parameter, therefore if installedPackages a string with multiple items you need to split it by the delimiter. For example, the following code would split the string by all commas: for line in set(installedPackages.split(',')) - set(b): c.write(line)

Related

Python stdout and stdout.buffer capture

Python write both commands and their output to a file

How to update line with modified data in Jython?

Can I configure python to have matlab like print?

How to check if a string is at the beginning of line in spite of tabs or whitespaces?

Categories

Resources