Accessing MALLET's diagnostics file via Gensim

Accessing MALLET's diagnostics file via Gensim - python

Is there a way to access MALLET's diagnostics file or its content by using the provided API via Gensim in Python?

Seems like there is no possibility.
I solved this issue by running MALLET in the command line via Python's subprocess module:
import subprocess
from pathlib import Path
MALLET_PATH = r"C:\mallet" # set to where your "bin/mallet" path is
seglen = 500
topic_count = 20
start = 0
iterations = 20
num_threads = 10 # determines threads used for parallel training
# remember to change backslashes if needed
wdir = Path("../..")
corpusdir = wdir.joinpath("5_corpus", f"seglen-{seglen}")
corpusdir.mkdir(exist_ok=True, parents=True)
mallet_dir = wdir.joinpath("6_evaluation/models/mallet", f"seglen-{seglen}")
topic_dir = mallet_dir.joinpath(f"topics-{topic_count}")
def create_input_files():
# create MALLETs input files
for file in corpusdir.glob("*.txt"):
output = mallet_dir.joinpath(f"{file.stem}.mallet")
# doesn't need to happen more than once -- usually.
if output.is_file(): continue
print(f"--{file.stem}")
cmd = f"bin\\mallet import-file " \
f"--input {file.absolute()} " \
f"--output {output.absolute()} " \
f"--keep-sequence"
subprocess.call(cmd, cwd=MALLET_PATH, shell=True)
print("import finished")
def modeling():
# start modeling
for file in mallet_dir.glob("*.mallet"):
for i in range(start, iterations):
print("iteration ", str(i))
print(f"--{file.stem}")
# output directory
formatdir = topic_dir.joinpath(f"{file.stem.split('-')[0]}")
outputdir = formatdir.joinpath(f"iteration-{i}")
outputdir.mkdir(parents=True, exist_ok=True)
outputdir = str(outputdir.absolute())
# output files
statefile = outputdir + r"\topic-state.gz"
keysfile = outputdir + r"\keys.txt"
compfile = outputdir + r"\composition.txt"
diagnostics_xml = outputdir + r"\diagnostics.xml"
# building cmd string
cmd = f"bin\\mallet train-topics " \
f"--input {file.absolute()} " \
f"--num-topics {topic_count} " \
f"--output-state {statefile} " \
f"--output-topic-keys {keysfile} " \
f"--output-doc-topics {compfile} " \
f"--diagnostics-file {diagnostics_xml} " \
f"--num-threads {num_threads}"
# call mallet
subprocess.call(cmd, cwd=MALLET_PATH, shell=True)
print("models trained")
#create_input_files()
modeling()

Related

Trimming videos with 'ffmpeg and ffprobe'

I am working on an ETL process, and I'm now in the final stage of preprocessing my videos. I used the script below (reference: #FarisHijazi) to first auto detected black-screen frames using ffprobe and trim them out using ffmpeg.
The script worked for me but the problems are:
It cut off all other good frames together with the first bad frames. e.g. if gBgBgBgB represents a sequence of good and BAD frames for 5sec each, the script only returned the first g(5sec) and cut off the other BgBgBgB after it. I want to have only g g g g where all B B B B has been removed
I also want to detect other colors aside black-screen e.g. green-screen or red-screen or blurry part of video
Script doesn't work if video has no audio in it.
import argparse
import os
import shlex
import subprocess
parser = argparse.ArgumentParser(
__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
parser.add_argument("input", type=str, help="input video file")
parser.add_argument(
"--invert",
action="store_true",
help="remove nonblack instead of removing black",
)
args = parser.parse_args()
##FIXME: sadly you must chdir so that the ffprobe command will work
os.chdir(os.path.split(args.input)[0])
args.input = os.path.split(args.input)[1]
spl = args.input.split(".")
outpath = (
".".join(spl[:-1])
+ "."
+ ("invert" if args.invert else "")
+ "out."
+ spl[-1]
)
def delete_back2back(l):
from itertools import groupby
return [x[0] for x in groupby(l)]
def construct_ffmpeg_trim_cmd(timepairs, inpath, outpath):
cmd = f'ffmpeg -i "{inpath}" -y -r 20 -filter_complex '
cmd += '"'
for i, (start, end) in enumerate(timepairs):
cmd += (
f"[0:v]trim=start={start}:end={end},setpts=PTS-STARTPTS,format=yuv420p[{i}v]; "
+ f"[0:a]atrim=start={start}:end={end},asetpts=PTS-STARTPTS[{i}a]; "
)
for i, (start, end) in enumerate(timepairs):
cmd += f"[{i}v][{i}a]"
cmd += f"concat=n={len(timepairs)}:v=1:a=1[outv][outa]"
cmd += '"'
cmd += f' -map [outv] -map [outa] "{outpath}"'
return cmd
def get_blackdetect(inpath, invert=False):
ffprobe_cmd = f'ffprobe -f lavfi -i "movie={inpath},blackdetect[out0]" -show_entries tags=lavfi.black_start,lavfi.black_end -of default=nw=1 -v quiet'
print("ffprobe_cmd:", ffprobe_cmd)
lines = (
subprocess.check_output(shlex.split(ffprobe_cmd))
.decode("utf-8")
.split("\n")
)
times = [
float(x.split("=")[1].strip()) for x in delete_back2back(lines) if x
]
assert len(times), "no black scene detected"
if not invert:
times = [0] + times[:-1]
timepairs = [
(times[i], times[i + 1]) for i in range(0, len(times) // 2, 2)
]
return timepairs
if __name__ == "__main__":
timepairs = get_blackdetect(args.input, invert=args.invert)
cmd = construct_ffmpeg_trim_cmd(timepairs, args.input, outpath)
print(cmd)
os.system(cmd)

Weird NameError with python function in Snakemake script

This is an extension of a question I asked yesterday. I have looked all over StackOverflow and have not found an instance of this specific NameError:
Building DAG of jobs...
Updating job done.
InputFunctionException in line 148 of /home/nasiegel/2022-h1n1/Snakefile:
Error:
NameError: free variable 'combinator' referenced before assignment in enclosing scope
Wildcards:
Traceback:
File "/home/nasiegel/2022-h1n1/Snakefile", line 131, in aggregate_decompress_h1n1
I assumed it was an issue having to do with the symbolic file paths in my function:
def aggregate_decompress_h1n1(wildcards):
checkpoint_output = checkpoints.decompress_h1n1.get(**wildcards).output[0]
filenames = expand(
SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_fastqc.zip",
SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip",
OUTPUTDIR + "{basenames}_quant/quant.sf",
basenames = glob_wildcards(os.path.join(checkpoint_output, "{basenames}_R1.fastq.gz")).basenames)
return filenames
However, hardcoding the paths does not resolve the issue. I've attached the full Snakefile below any advice would be appreciated.
Original file
# Snakemake file - input raw reads to generate quant files for analysis in R
configfile: "config.yaml"
import io
import os
import pandas as pd
import pathlib
from snakemake.exceptions import print_exception, WorkflowError
#----SET VARIABLES----#
PROJ = config["proj_name"]
INPUTDIR = config["raw-data"]
SCRATCH = config["scratch"]
REFERENCE = config["ref"]
OUTPUTDIR = config["outputDIR"]
# Adapters
SE_ADAPTER = config['seq']['SE']
SE_SEQUENCE = config['seq']['trueseq-se']
# Organsim
TRANSCRIPTOME = config['transcriptome']['rhesus']
SPECIES = config['species']['rhesus']
SAMPLE_LIST = glob_wildcards(INPUTDIR + "{basenames}_R1.fastq.gz").basenames
rule all:
input:
"final.txt",
# dowload referemce files
REFERENCE + SE_ADAPTER,
REFERENCE + SPECIES,
# multiqc
SCRATCH + "fastqc/raw_multiqc.html",
SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
SCRATCH + "fastqc/trimmed_multiqc.html",
SCRATCH + "fastqc/trimmed_multiqc_general_stats.txt"
rule download_trimmomatic_adapter_file:
output: REFERENCE + SE_ADAPTER
shell: "curl -L -o {output} {SE_SEQUENCE}"
rule download_transcriptome:
output: REFERENCE + SPECIES
shell: "curl -L -o {output} {TRANSCRIPTOME}"
rule download_data:
output: "high_quality_files.tgz"
shell: "curl -L -o {output} https://osf.io/pcxfg/download"
checkpoint decompress_h1n1:
output: directory(INPUTDIR)
input: "high_quality_files.tgz"
shell: "tar xzvf {input}"
rule fastqc:
input: INPUTDIR + "{basenames}_R1.fastq.gz"
output:
raw_html = SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
raw_zip = SCRATCH + "fastqc/{basenames}_R1_fastqc.zip"
conda: "env/rnaseq.yml"
wrapper: "0.80.3/bio/fastqc"
rule multiqc:
input:
raw_qc = expand(SCRATCH + "fastqc/{basenames}_R1_fastqc.zip", basenames=SAMPLE_LIST),
trim_qc = expand(SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip", basenames=SAMPLE_LIST)
output:
raw_multi_html = SCRATCH + "fastqc/raw_multiqc.html",
raw_multi_stats = SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
trim_multi_html = SCRATCH + "fastqc/trimmed_multiqc.html",
trim_multi_stats = SCRATCH + "fastqc/trimmed_multiqc_general_stats.txt"
conda: "env/rnaseq.yml"
shell:
"""
multiqc -n multiqc.html {input.raw_qc} #run multiqc
mv multiqc.html {output.raw_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.raw_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
#repeat for trimmed data
multiqc -n multiqc.html {input.trim_qc} #run multiqc
mv multiqc.html {output.trim_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.trim_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
"""
rule trimmmomatic_se:
input:
reads= INPUTDIR + "{basenames}_R1.fastq.gz",
adapters= REFERENCE + SE_ADAPTER,
output:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
unpaired = SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz"
conda: "env/rnaseq.yml"
log: SCRATCH + "logs/fastqc/{basenames}_R1_trim_unpaired.log"
shell:
"""
trimmomatic SE {input.reads} \
{output.reads} {output.unpaired} \
ILLUMINACLIP:{input.adapters}:2:0:15 LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 MINLEN:25
"""
rule fastqc_trim:
input: SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz"
output:
html = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
zip = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip"
log: SCRATCH + "logs/fastqc/{basenames}_R1_trimmed.log"
conda: "env/rnaseq.yml"
wrapper: "0.35.2/bio/fastqc"
rule salmon_quant:
input:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
index_dir = OUTPUTDIR + "quant/sc_ensembl_index"
output: OUTPUTDIR + "{basenames}_quant/quant.sf"
params: OUTPUTDIR + "{basenames}_quant"
log: SCRATCH + "logs/salmon/{basenames}_quant.log"
conda: "env/rnaseq.yml"
shell:
"""
salmon quant -i {input.index_dir} --libType A -r {input.reads} -o {params} --seqBias --gcBias --validateMappings
"""
def aggregate_decompress_h1n1(wildcards):
checkpoint_output = checkpoints.decompress_h1n1.get(**wildcards).output[0]
filenames = expand(
SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_fastqc.zip",
SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip",
OUTPUTDIR + "{basenames}_quant/quant.sf",
basenames = glob_wildcards(os.path.join(checkpoint_output, "{basenames}_R1.fastq.gz")).basenames)
return filenames
rule salmon_index:
input: REFERENCE + SPECIES
output: directory(OUTPUTDIR + "quant/sc_ensembl_index")
conda: "env/rnaseq.yml"
shell: "salmon index --index {output} --transcripts {input} # --type quasi"
rule done:
input: aggregate_decompress_h1n1
output: "final.txt"
shell: "touch {output}"

I think it's due to you used expand function in a wrong way, expand only accepts two positional arguments, where the first one is pattern and the second one is function (optional). If you want to supply multiple patterns you should wrap these patterns in list.
After some studying on source code of snakemake, it turns out expand function doesn't check if user provides < 3 positional arguments, there is a variable combinator in if-else that would only be created when there are 1 or 2 positional arguments, the massive amount of positional arguments you provide skip this part and lead to the error when it tries to use combinator later.
Source code: https://snakemake.readthedocs.io/en/v6.5.4/_modules/snakemake/io.html

Trucanted string in subprocess.call()

import os,subprocess,io
path = "C:\\Users\\Awesome\\Music\\unconverted"
des = "C:\\Users\\Awesome\\Music\\converted"
def convert( path, des):
command = "ffmpeg -i " +path+" -ab 192k "+des + "-y "
subprocess.call(command)
for song in os.listdir(path):
filepath = os.path.join(path,song)
despath = os.path.join(des, song[len(song)-3]+"mp3")
convert(filepath,despath)
print("complete")
this code return this error
C:\Users\Awesome\Music\unconverted\KYLE: No such file or directory
the full file name is C:\Users\Awesome\Music\unconverted\KYLE - Playinwitme (feat Kehlani).m4a I have no idea why it is truncating after the first word.

The problem is that the command will have a path with a space in it like this ffmpeg -i C:\\Users\\Awesome\\Music\\unconverted\\KYLE - Playinwitme (feat Kehlani).m4a .....,
You should remove the spaces from the name of the file or insert the whole name inside double-quotes. Also change song[len(song)-3]+"mp3" to song[0 : len(song)-3]+"mp3"
import os,subprocess,io
path = "C:\\Users\\Awesome\\Music\\unconverted"
des = "C:\\Users\\Awesome\\Music\\converted"
def convert( path, des):
command = "ffmpeg -i " + f"\"{path}\"" + " -ab 192k " + f"\"{des}\"" + " -y"
subprocess.call(command)
for song in os.listdir(path):
filepath = os.path.join(path,song)
despath = os.path.join(des, song[0 : len(song)-3]+"mp3")
convert(filepath,despath)
print("complete")

Instead of forming a command string and passing it to subprocess.call, passing it as a list of arguments to the method will do the trick.
import os,subprocess,io
path = "C:\\Users\\Awesome\\Music\\unconverted"
des = "C:\\Users\\Awesome\\Music\\converted"
def convert( path, des):
command_lis = ["ffmpeg", "-i", path, "-ab", "192k",des,"-y"]
subprocess.call(command_lis)
for song in os.listdir(path):
filepath = os.path.join(path,song)
despath = os.path.join(des, song[0:len(song)-3]+"mp3")
convert(filepath,despath)
print("complete")

How to make a file transfer program in python

The title might not be relevant for my question becuase I don't actually want a wireless file transfering script, I need a file manager type.
I want something with which I can connect my phone with my pc (eg: hotspot and wifi) and then I would like to show text file browser (I have the code for that) by sending lists of all files and folders using os.listdir(), whenever the selected option is a file (os.path.isdir() == False), I would like to transfer the file and run it(like: picture, video, etc).
The file Browser code which I wrote runs on windows and also Android (after making a few changes) using qpython.
My code is
import os
def FileBrowser(cwd = os.getcwd()):
while True:
if cwd[-1:] != "\\":
cwd = cwd + "\\"
files = os.listdir(cwd)
count = 1
tmpp = ""
print("\n\n" + "_"*50 +"\n\n")
print(cwd + "\n")
for f in files:
if os.path.isdir(cwd + f) == True:
s1 = str(count) + ". " + f
tmps1 = 40 - (len(s1)+5)
t2 = int(tmps1/3)
s1 = s1 + " " * t2 + "-" * (tmps1 - t2)
print(s1 + "<dir>")
else:
print(str(count) + ". " + f + tmpp)
count = count + 1
s = raw_input("Enter the file/Directory: ")
if s == "...":
tmp1 = cwd.count("\\")
tmp2 = cwd.rfind("\\")
if tmp1 > 1:
cwd = cwd[0:tmp2]
tmp2 = cwd.rfind("\\")
cwd = cwd[0:tmp2+1]
continue
else:
continue
else:
s = int(s) - 1
if os.path.isdir(cwd + files[s]) == True:
cwd = cwd + files[s] + "\\"
continue
else:
f1 = files[s]
break
return f1
def main():
fb = FileBrowser()
main()

A very naive approach using Python is to go to the root of the directory you want to be served and use:
python -m SimpleHTTPServer
The connect to it on port 8000.

you may need to socket programming. creating a link (connection) between your PC and you smart phone and then try to transfer files

Imagemagick's convert errors in python script with subprocess.Popen

I am trying to generate transparent background images with a python script run from the command line but I have a hard time passing all the arguments to subprocess.Popen so that Imagemagick's convert doesn't through me errors.
Here is my code:
# Import modules
import os
import subprocess as sp
# Define useful variables
fileList = os.listdir('.')
fileList.remove(currentScriptName)
# Interpret return code
def interpretReturnCode(returnCode) :
return 'OK' if returnCode is 0 else 'ERROR, check the script'
# Create background images
def createDirectoryAndBackgroundImage() :
# Ask if numbers-height or numbers-width before creating the directory
numbersDirectoryType = raw_input('Numbers directory: type "h" for "numbers-height" or "w" for "numbers-width": ')
if numbersDirectoryType == 'h' :
# Create 'numbers-height' directory
numbersDirectoryName = 'numbers-height'
numbersDirectory = interpretReturnCode(sp.call(['mkdir', numbersDirectoryName]))
print '%s%s' % ('Create "numbers-height" directory...', numbersDirectory)
# Create background images
startNumber = int(raw_input('First number for the background images: '))
endNumber = (startNumber + len(fileList) + 1)
for x in range(startNumber, endNumber) :
createNum = []
print 'createNum just after reset and before adding things to it: ', createNum, '\n'
print 'start' , x, '\n'
createNum = 'convert -size 143x263 xc:transparent -font "FreeSans-Bold" -pointsize 22 -fill \'#242325\' "text 105,258'.split()
createNum.append('\'' + str(x) + '\'"')
createNum.append('-draw')
createNum.append('./' + numbersDirectoryName + '/' + str(x) + '.png')
print 'createNum set up, createNum submittet to subprocess.Popen: ', createNum
createNumImage = sp.Popen(createNum, stdout=sp.PIPE)
createNumImage.wait()
creationNumReturnCode = interpretReturnCode(createNumImage.returncode)
print '%s%s%s' % ('\tCreate numbers image...', creationNumReturnCode, '\n')
elif numbersDirectoryType == 'w' :
numbersDirectoryName = 'numbers-width'
numbersDirectory = interpretReturnCode(sp.call(['mkdir', numbersDirectoryName]))
print '%s%s' % ('Create "numbers-width" directory...', numbersDirectory)
# Create background images
startNumber = int(raw_input('First number for the background images: '))
endNumber = (startNumber + len(fileList) + 1)
for x in range(startNumber, endNumber) :
createNum = []
print 'createNum just after reset and before adding things to it: ', createNum, '\n'
print 'start' , x, '\n'
createNum = 'convert -size 224x122 xc:transparent -font "FreeSans-Bold" -pointsize 22-fill \'#242325\' "text 105,258'.split()
createNum.append('\'' + str(x) + '\'"')
createNum.append('-draw')
createNum.append('./' + numbersDirectoryName + '/' + str(x) + '.png')
print 'createNum set up, createNum submittet to subprocess.Popen: ', createNum
createNumImage = sp.Popen(createNum, stdout=sp.PIPE)
createNumImage.wait()
creationNumReturnCode = interpretReturnCode(createNumImage.returncode)
print '%s%s%s' % ('\tCreate numbers image...', creationNumReturnCode, '\n')
else :
print 'No such directory type, please start again'
numbersDirectoryType = raw_input('Numbers directory: type "h" for "numbers-height" or "w" for "numbers-width": ')
For this I get the following errors, for each picture:
convert.im6: unable to open image `'#242325'': No such file or directory # error/blob.c/OpenBlob/2638.
convert.im6: no decode delegate for this image format `'#242325'' # error/constitute.c/ReadImage/544.
convert.im6: unable to open image `"text': No such file or directory # error/blob.c/OpenBlob/2638.
convert.im6: no decode delegate for this image format `"text' # error/constitute.c/ReadImage/544.
convert.im6: unable to open image `105,258': No such file or directory # error/blob.c/OpenBlob/2638.
convert.im6: no decode delegate for this image format `105,258' # error/constitute.c/ReadImage/544.
convert.im6: unable to open image `'152'"': No such file or directory # error/blob.c/OpenBlob/2638.
convert.im6: no decode delegate for this image format `'152'"' # error/constitute.c/ReadImage/544.
convert.im6: option requires an argument `-draw' # error/convert.c/ConvertImageCommand/1294.
I tried to change the order of the arguments without success, to use shell=True in Popen (but then the function interpretReturCode returns a OK while no image is created (number-heights folder is empty).

I would strongly recommend following the this process:
Pick a single file and directory
change the above so that sp.Popen is replaced by a print statement
Run the modified script from the command line
Try using the printed command output from the command line
Modify the command line until it works
Modify the script until it produces the command line that is exactly the same
Change the print back to sp.Popen - Then, (if you still have a problem:
Try modifying your command string to start echo convert so that
you can see what, if anything, is happening to the parameters during
the processing by sp.Popen.
There is also this handy hint from the python documents:
>>> import shlex, subprocess
>>> command_line = raw_input()
/bin/vikings -input eggs.txt -output "spam spam.txt" -cmd "echo '$MONEY'"
>>> args = shlex.split(command_line)
>>> print args
['/bin/vikings', '-input', 'eggs.txt', '-output', 'spam spam.txt', '-cmd', "echo '$MONEY'"]
>>> p = subprocess.Popen(args) # Success!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Accessing MALLET's diagnostics file via Gensim - python

Is there a way to access MALLET's diagnostics file or its content by using the provided API via Gensim in Python?

Related

Trimming videos with 'ffmpeg and ffprobe'

Weird NameError with python function in Snakemake script

Trucanted string in subprocess.call()

How to make a file transfer program in python

Imagemagick's convert errors in python script with subprocess.Popen

Categories

Resources