Python - DocxTemplate - Not Printing "&" in final document - python

I am running a script that takes the names from a csv file and populates them into individual word documents from a template. I got that part. But here is where I need a bit of help.
Some cells in the csv file are two names, such as "Bobby & Sammy." When I go check the populated word document, it only has "Bobby Sammy." I know that the "&" is a special character, but I am not sure what I have to do for it to populate the word documents correctly.
Any and all help is appreciated.
Edit: Code
csvfn = "Addresses.csv"
df = pd.read_csv('Addresses.csv')
def mkw(n):
tpl = DocxTemplate('Envelope_Template.docx')
df_to_doct = df.to_dict()
x = df.to_dict(orient='records')
context = x
tpl.render(context[n])
tpl.save("%s.docx" %str(n))
wait = time.sleep(random.randint(1,3))
~
print ("There will be ", df2, "files")
~
for i in range(0, df2):
print("Making file: ",f"{i}," ,"..Please Wait...")
mkw(i)
print("Done! - Now check your files")
~ Denotes new cell, I am using JupyterLab
File is a standard csv file
Standard CSV File
Without "&" Prints fine
Empty space where "&" is supposed to be

Related

Using python and regex to find stings and put them together to replace the filename of a .pdf- the rename fails when using more than one group

I have several thousand pdfs which I need to re-name based on the content. The layouts of the pdfs are inconsistent. To re-name them I need to locate a specific string "MEMBER". I need the value after the string "MEMBER" and the values from the two lines above MEMBER, which are Time and Date values respectively.
So:
STUFF
STUFF
STUFF
DD/MM/YY
HH:MM:SS
MEMBER ######
STUFF
STUFF
STUFF
I have been using regex101.com and have ((.*(\n|\r|\r\n)){2})(MEMBER.\S+) which matches all of the values I need. But it puts them across four groups with group 3 just showing a carriage return.
What I have so far looks like this:
import fitz
from os import DirEntry, curdir, chdir, getcwd, rename
from glob import glob as glob
import re
failed_pdfs = []
count = 0
pdf_regex = r'((.*(\n|\r|\r\n)){2})(MEMBER.\S+)'
text = ""
get_curr = getcwd()
directory = 'PDF_FILES'
chdir(directory)
pdf_list = glob('*.pdf')
for pdf in pdf_list:
with fitz.open(pdf) as pdf_obj:
for page in pdf_obj:
text += page.get_text()
new_file_name = re.search(pdf_regex, text).group().strip().replace(":","").replace("-","") + '.pdf'
text = ""
#clean_new_file_name = new_file_name.translate({(":"): None})
print(new_file_name)
# Tries to rename a pdf. If the filename doesn't already exist
# then rename. If it does exist then throw an error and add to failed list
try:
rename(pdf, new_file_name )
except WindowsError:
count += 1
failed_pdfs.append(str(count) + ' - FAILED TO RENAME: [' + pdf + " ----> " + str(new_file_name) + "]")
If I specify a group in the re.search portion- Like for instance Group 4 which contains the MEMBER ##### value, then the file renames successfully with just that value. Similarly, Group 2 renames with the TIME value. I think the multiple lines are preventing it from using all of the values I need. When I run it with group(), the print value shows as
DATE
TIME
MEMBER ######.pdf
And the log count reflects the failures.
I am very new at this, and stumbling around trying to piece together how things work. Is the issue with how I built the regex or with the re.search portion? Or everything?
I have tried re-doing the Regular Expression, but I end up with multiple lines in the results, which seems connected to the rename failure.
The strategy is to read the page's text by words and sort them adequately.
If we then find "MEMBER", the word following it represents the hashes ####, and the two preceeding words must be date and time respectively.
found = False
for page in pdf_obj:
words = page.get_text("words", sort=True)
# all space-delimited strings on the page, sorted vertically,
# then horizontally
for i, word in enumerate(words):
if word[4] == "MEMBER":
hashes = words[i+1][4] # this must be the word after MEMBER!
time-string = words[i-1][4] # the time
date_string = words[i-2][4] # the date
found = True
break
if found == True: # no need to look in other pages
break

Parse lines in files with similar strings using Python

AH! I'm new to Python. Trying to get the pattern here, but could use some assistance to get unblocked.
Scenario:
testZip.zip file with test.rpt files inside
The .rpt files have multiple areas of interest ("AOI") to parse
AOI1: Line starting with $$
AOI2: Multiple lines starting with a single $
Goal:
To get AOI's into tabular format for upload to SQL
Sample file:
$$ADD ID=TEST BATCHID='YEP' PASSWORD=NOPE
###########################################################################################
$KEY= 9/21/2020 3:53:55 PM/2002/B0/295.30/305.30/4FAOA973_3.0_v2.19.2.0_20150203_1/20201002110149
$TIMESTAMP= 20201002110149
$MORECOLUMNS= more columns
$YETMORE = yay
Tried so far:
import zipfile
def get_aoi1(zip):
z = zipfile.ZipFile(zip)
for f in z.namelist():
with z.open(f, 'r') as rptf:
for l in rptf.readlines():
if l.find(b"$$") != -1:
return l
def get_aoi2(zip):
z = zipfile.ZipFile(zip)
for f in z.namelist():
with z.open(f, 'r') as rptf:
for l in rptf.readlines():
if l.find(b"$") != -1:
return l
aoi1 = get_aoi1('testZip.zip')
aoi2 = get_aoi2('testZip.zip')
print(aoi1)
print(aoi2)
Results:
I get the same results for both functions
b"$$ADD ID=TEST BATCHID='YEP' PASSWORD=NOPE\r\n"
b"$$ADD ID=TEST BATCHID='YEP' PASSWORD=NOPE\r\n"
How do I get the results in text instead of bytes (b) and remove the \r\n from AOI1?
There doesn't seem to be an r option for z.open()
I've been unsuccessful with .strip()
EDIT 1:
Thanks for the pep #furas!
return l.strip().decode() worked for removing the new line and b
How do I get the correct results from AOI2 (lines with a single $ in a tabular format)?
EDIT 2:
#furas 2021!
Adding the following logic to aoi2 function worked great.
col_array = []
for l in rptf.readlines():
if not l.startswith(b"$$") and l.startswith(b"$"):
col_array.append(l)
return col_array

Python: is there a maximum of values the write() functions could process?

I´m new in python so I would be thankful for every help...
My problem is the following:
I wrote a program in python analysing gene sequences of a huge database (more than 600 genes). With the help of the write() function the program should insert the results in a text file - one result per gene. Opening my output file, there are only the first genes followed by "..." followed by the last gene.
Is there a maximum this function could process? How do I make python write all results?
relevant part of code:
fasta_df3 = pd.read_table(fasta_out3, delim_whitespace=True, names=
('qseqid','sseqid', 'evalue', 'pident'))
fasta_df3_sorted = fasta_df3.sort_values(by='qseqid', ascending = True)
fasta_df3_grouped = fasta_df3_sorted.groupby('qseqid')
for qseqid, fasta_df3_sorted in fasta_df3_grouped:
subj3_pident_max = str(fasta_df3_grouped['pident'].max())
subj3_pident_min = str(fasta_df3_grouped['pident'].min())
current_gene = str(qseqid)
with open(dir_output+outputall_file+".txt","a") as gene_list:
gene_list.write("\n"+"subj3: {} \t {} \t {}".format(current_gene,
subj3_pident_max, subj3_pident_min))
gene_list.close()

Parse ~4k files for a string (sophisticated conditions)

Problem description
There is a set of ~4000 python files with the following struture:
#ScriptInfo(number=3254,
attibute=some_value,
title="crawler for my website",
some_other_key=some_value)
scenario_name = entity.get_script_by_title(title)
The goal
The goal is to get the value of the title from the ScriptInfo decorator (in this case it is "crawler for my website"), but there are a couple of problems:
1) There is no rule for naming a variable that contains the title. That's why it can be title_name, my_title, etc. See example:
#ScriptInfo(number=3254,
attibute=some_value,
my_title="crawler for my website",
some_other_key=some_value)
scenario_name = entity.get_script_by_title(my_title)
2) The #ScriptInfo decorator may have more than two arguments so getting its contents from between the parentheses in order to get the second parameter's value is not an option
My (very naive) solution
But the piece of code that stays unchanged is the scenario_name = entity.get_script_by_title(my_title) line. Taking this into account, I've come up with the solution:
import re
title_variable_re = r"scenario_name\s?=\s?entity\.get_script_by_title\((.*)\)"
with open("python_file.py") as file:
for line in file:
if re.match(regexp, line):
title_variable = re.match(title_variable_re, line).group(1)
title_re = title_variable + r"\s?=\s\"(.*)\"?"
with open("python_file.py") as file:
for line in file:
if re.match(title_re, line):
title_value = re.match(regexp, line).group(1)
print title_value
This snippet of code does the following:
1) Traverses (see the first with open) the script file and gets the variable with title value because it is up to a programmer to choose its name
2) Traverses the script file again (see the second with open) and gets the title's value
The question for the stackoverflow family
Is there a better and more efficient way to get the title's (my_title's, title_name's, etc) value than traversing the script file two times?
If you open the file only once and save all lines into fileContent, add break where appropriate, and reuse the matches to access the captured groups, you obtain something like this (with parentheses after print for 3.x, without for 2.7):
import re
title_value = None
title_variable_re = r"scenario_name\s?=\s?entity\.get_script_by_title\((.*)\)"
with open("scenarioName.txt") as file:
fileContent = list(file.read().split('\n'))
title_variable = None
for line in fileContent:
m1 = re.match(title_variable_re, line)
if m1:
title_variable = m1.group(1)
break
title_re = r'\s*' + title_variable + r'\s*=\s*"([^"]*)"[,)]?\s*'
for line in fileContent:
m2 = re.match(title_re, line)
if m2:
title_value = m2.group(1)
break
print(title_value)
Here an unsorted list of changes in the regular expressions:
Allow space before the title_variable, that's what the r'\s*' + is for
Allow space around =
Allow comma or closing round paren in the end of the line in title_re, that's what the [,)]? is for
Allow some space in the end of the line
When tested on the following file as input:
#ScriptInfo(number=3254,
attibute=some_value,
my_title="crawler for my website",
some_other_key=some_value)
scenario_name = entity.get_script_by_title(my_title)
it produces the following output:
crawler for my website

How to load a dataframe from a file containing unwanted characters?

I'm in need of some knowledge on how to fix an error I have made while collecting data. The collected data has the following structure:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
I normally wouldn't have added "[" or "]" to .txt file when writing the data to it, line per line. However, the mistake was made and thus when loading the file it will separate it the following way:
Is there a way to load the data properly to pandas?
On the snippet that I can cut and paste from the question (which I named test.txt), I could successfully read a dataframe via
Purging square brackets (with sed on a Linux command line, but this can be done e.g. with a text editor, or in python if need be)
sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line
Loading the dataframe (in a python console)
import pandas as pd
pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')
(not sure that this will work for the entirety of your file though).
Consider below code which reads the text in myfile.text which looks like below:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]
The code below removes [ and ] from the text and then splits every string in the list of string by , excluding the first string which are headers. Some Message contains ,, which causes another column (NAN otherwise) and hence the code takes them into one string, which intended.
Code:
with open('myfile.txt', 'r') as my_file:
text = my_file.read()
text = text.replace("[", "")
text = text.replace("]", "")
df = pd.DataFrame({
'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))
Output:
Author Message
0 littleblackcat There's a lot of redditors here that live in the area maybe/hopefully someone saw something.
1 Kruse In other words it's basically creating a mini tornado.
Here are a few more options to add to the mix:
You could use parse the lines yourself using ast.literal_eval, and then load them into a pd.DataFrame directly using an iterator over the lines:
import pandas as pd
import ast
with open('data', 'r') as f:
lines = (ast.literal_eval(line) for line in f)
header = next(lines)
df = pd.DataFrame(lines, columns=header)
print(df)
Note, however, that calling ast.literal_eval once for each line may not be very fast, especially if your data file has a lot of lines. However, if the data file is not too big, this may be an acceptable, simple solution.
Another option is to wrap an arbitrary iterator (which yields bytes) in an IterStream. This very general tool (thanks to Mechanical snail) allows you to manipulate the contents of any file and then re-package it into a file-like object. Thus, you can fix the contents of the file, and yet still pass it to any function which expects a file-like object, such as pd.read_csv. (Note: I've answered a similar question using the same tool, here.)
import io
import pandas as pd
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def clean(f):
for line in f:
yield line.strip()[1:-1]+b'\n'
with open('data', 'rb') as f:
# https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
print(df)
A pure pandas option is to change the separator from , to ", " in order to have only 2 columns, and then, strip the unwanted characters, which to my understanding are [,], " and space:
import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''
df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]
print(df)
# Output (note the space before the There's is also gone
# Author Message
# 0 littleblackcat There's a lot of redditors here that live in t...
# 1 Kruse In other words, it's basically creating a mini...
For now the following solution was found:
sep = '[|"|]'
Using a multi-character separator allowed for the brackets to be stored in different columns in a pandas dataframe, which were then dropped. This avoids having to strip the words line for line.

Categories