Python docx add_paragraph() inserts leading newline - python

I'm able to use a paragraph object to select font size, color, bold, etc. within a table cell. But, add_paragraph() seems to always insert a leading \n into the cell and this messes up the formatting on some tables.
If I just use the cell.text('') method it doesn't insert this newline but then I can't control the text attributes.
Is there a way to eliminate this leading newline?
Here is my function:
def add_table_cell(table, row, col, text, fontSize=8, r=0, g=0, b=0, width=-1):
cell = table.cell(row,col)
if (width!=-1):
cell.width = Inches(width)
para = cell.add_paragraph(style=None)
para.alignment = WD_ALIGN_PARAGRAPH.LEFT
run = para.add_run(text)
run.bold = False
run.font.size = Pt(fontSize)
run.font.color.type == MSO_COLOR_TYPE.RGB
run.font.color.rgb = RGBColor(r, g, b)

I tried the following and it worked out for me. Not sure if is the best approach:
cells[0].text = 'Some text' #Write the text to the cell
#Modify the paragraph alignment, first paragraph
cells[0].paragraphs[0].paragraph_format.alignment=WD_ALIGN_PARAGRAPH.CENTER

The solution that I find is to use text attribute instead of add_paragraph() but than use add_run():
row_cells[0].text = ''
row_cells[0].paragraphs[0].add_run('Total').bold = True
row_cells[0].paragraphs[0].paragraph_format.alignment = WD_ALIGN_PARAGRAPH.RIGHT

I've look through the documentation of cell, and it's not the problem of add_paragraph(). The problem is when you having a cell, by default, it will have a paragraph inside it.
class docx.table._Cell:
paragraphs: ... By default, a new cell contains a single paragraph. Read-only
Therefore, if you want to add paragraphs in the first row of cell, you should first delete the default paragraph first. Since python-docx don't have paragraph.delete(), you can use the function mention in this github issue: feature: Paragraph.delete()
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
Therefore, you should do something like:
cell = table.cell(0,0)
paragraph = cell.paragraphs[0]
delete_paragraph(paragraph)
paragraph = cell.add_paragraph('text you want to add', style='style you want')
Update at 10/8/2022
Sorry, the above approach is kinda unnecessary.
It's much intuitive to edit the default paragraph instead of first deleting it and add it back.
For the function add_table_cell, just replace the para = cell.paragraphs[0]
and para.style = None, the para.style = None is not necessary as it should be default value for a new paragraph.

Here is what worked for me. I don't call add_paragraph(). I just reference the first paragraph with this call -> para = cell.paragraphs[0]. Everything else after that is the usual api calls.
table = doc.add_table( rows=1, cols=3 ) # bar codes
for tableRow in table.rows:
for cell in tableRow.cells:
para = cell.paragraphs[0]
run = para.add_run( "*" + specIDStr + "*" )
font = run.font
font.name = 'Free 3 of 9'
font.size = Pt( 20 )
run = para.add_run( "\n" + specIDStr
+ "\n" + firstName + " " + lastName
+ "\tDOB: " + dob )
font = run.font
font.name = 'Arial'
font.size = Pt( 8 )

Related

Selenium - Get text inside of table cells

Trying to get the text inside of the table cells, but have no luck.
I am trying to get the text inside of these cells:
(th and td)
The code works, kind of. It prints out the value as a normal " " (space).
code:
driver.get('https://www.komplett.se/product/1165487/datorutrustning/datorkomponenter/chassibarebone/big-tower/phanteks-eclipse-p500-air')
parent_table = driver.find_element_by_xpath("/html/body/div[2]/main/div[2]/div[2]/div[3]/div/div[2]/div/section[2]/div/div/div")
count_of_tables = len(parent_table.find_elements_by_xpath("./table"))
for x in range(count_of_tables):
parent_tr = driver.find_element_by_xpath(f"/html/body/div[2]/main/div[2]/div[2]/div[3]/div/div[2]/div/section[2]/div/div/div/table[{x + 1}]/tbody")
count_of_tr = len(parent_tr.find_elements_by_xpath("./tr"))
print(count_of_tr)
for y in range(count_of_tr):
th = driver.find_element_by_xpath(f'/html/body/div[2]/main/div[2]/div[2]/div[3]/div/div[2]/div/section[2]/div/div/div/table[{x + 1}]/tbody/tr[{y+1}]/th')
td = driver.find_element_by_xpath(f'/html/body/div[2]/main/div[2]/div[2]/div[3]/div/div[2]/div/section[2]/div/div/div/table[{x + 1}]/tbody/tr[{y + 1}]/td')
print(th.text)
print(td.text)
for y in range(count_of_tr):
th = driver.find_element_by_xpath(
f'/html/body/div[2]/main/div[2]/div[2]/div[3]/div/div[2]/div/section[2]/div/div/div/table[{x + 1}]/tbody/tr[{y+1}]/th')
td = driver.find_element_by_xpath(
f'/html/body/div[2]/main/div[2]/div[2]/div[3]/div/div[2]/div/section[2]/div/div/div/table[{x + 1}]/tbody/tr[{y + 1}]/td')
print(th.get_attribute("textContent"))
print(td.get_attribute("textContent"))
use get attribute text content as , text will retrieve text visible in view port only

Camelot switches characters around

I'm trying to parse tables in a PDF using Camelot. The cells have multiple lines of texts in them, and some have an empty line separating portions of the text:
First line
Second line
Third line
I would expect this to be parsed as First line\nSecond line\n\nThird line (notice the double line breaks), but I get this instead: T\nFirst line\nSecond line\nhird line. The first character after a double-line-break moves to the beginning of the text, and I only get a single line-break instead.
I also tried using tabula, but that one messes up de entire table (data-frame actually) when there is an empty row in the table, and also in case of some words it puts a space between the characters.
EDIT:
My main issue is the removal of multiple line-breaks. The other one I could fix from code if I knew where the empty lines were.
my friend, can you check the example here
https://camelot-py.readthedocs.io/en/master/user/advanced.html#improve-guessed-table-rows
tables = camelot.read_pdf('group_rows.pdf', flavor='stream', row_tol=10)
tables[0].df
I solved the same problem with the code below
tables = camelot.read_pdf(file, flavor = 'stream', table_areas=['24,618,579,93'], columns=['67,315,369,483,571'], row_tol=10,strip_text='t\r\n\v')
I also encountered the same problem in case of a double line break. It was Switching Characters around as its doing in your case. I Spent some time looking at the code and i did some changes and fixed the issue. You can use the below code.
After Adding the below code, instead of using camelot.read_pdf, use the custom method i made read_pdf_custom()
And for a better experience, i suggest you using camelot v==0.8.2
import sys
import warnings
from camelot import read_pdf
from camelot import handlers
from camelot.core import TableList
from camelot.parsers import Lattice
from camelot.parsers.base import BaseParser
from camelot.core import Table
import camelot
from camelot.utils import validate_input, remove_extra,TemporaryDirectory,get_page_layout,get_text_objects,get_rotation,is_url,download_url,scale_image,scale_pdf,segments_in_bbox,text_in_bbox,merge_close_lines,get_table_index,compute_accuracy,compute_whitespace
from camelot.image_processing import (
adaptive_threshold,
find_lines,
find_contours,
find_joints,
)
class custom_lattice(Lattice):
def _generate_columns_and_rows(self, table_idx, tk):
# select elements which lie within table_bbox
t_bbox = {}
v_s, h_s = segments_in_bbox(
tk, self.vertical_segments, self.horizontal_segments
)
custom_horizontal_indexes=[]
custom_vertical_indexes=[]
for zzz in self.horizontal_text:
try:
h_extracted_text=self.find_between(str(zzz),"'","'").strip()
h_text_index=self.find_between(str(zzz),"LTTextLineHorizontal","'").strip().split(",")
custom_horizontal_indexes.append(h_text_index[1])
except:
pass
inserted=0
for xxx in self.vertical_text:
v_extracted_text=self.find_between(str(xxx),"'","'").strip()
v_text_index=self.find_between(str(xxx),"LTTextLineVertical","'").strip().split(",")
custom_vertical_indexes.append(v_text_index[1])
vertical_second_index=v_text_index[1]
try:
horizontal_index=custom_horizontal_indexes.index(vertical_second_index)
self.horizontal_text.insert(horizontal_index,xxx)
except Exception as exxx:
pass
self.vertical_text=[]
t_bbox["horizontal"] = text_in_bbox(tk, self.horizontal_text)
t_bbox["vertical"] = text_in_bbox(tk, self.vertical_text)
t_bbox["horizontal"].sort(key=lambda x: (-x.y0, x.x0))
t_bbox["vertical"].sort(key=lambda x: (x.x0, -x.y0))
self.t_bbox = t_bbox
cols, rows = zip(*self.table_bbox[tk])
cols, rows = list(cols), list(rows)
cols.extend([tk[0], tk[2]])
rows.extend([tk[1], tk[3]])
cols = merge_close_lines(sorted(cols), line_tol=self.line_tol)
rows = merge_close_lines(sorted(rows, reverse=True), line_tol=self.line_tol)
cols = [(cols[i], cols[i + 1]) for i in range(0, len(cols) - 1)]
rows = [(rows[i], rows[i + 1]) for i in range(0, len(rows) - 1)]
return cols, rows, v_s, h_s
def _generate_table(self, table_idx, cols, rows, **kwargs):
print("\n")
v_s = kwargs.get("v_s")
h_s = kwargs.get("h_s")
if v_s is None or h_s is None:
raise ValueError("No segments found on {}".format(self.rootname))
table = Table(cols, rows)
table = table.set_edges(v_s, h_s, joint_tol=self.joint_tol)
table = table.set_border()
table = table.set_span()
pos_errors = []
for direction in ["vertical", "horizontal"]:
for t in self.t_bbox[direction]:
indices, error = get_table_index(
table,
t,
direction,
split_text=self.split_text,
flag_size=self.flag_size,
strip_text=self.strip_text,
)
if indices[:2] != (-1, -1):
pos_errors.append(error)
indices = Lattice._reduce_index(
table, indices, shift_text=self.shift_text
)
for r_idx, c_idx, text in indices:
temp_text=text.strip().replace("\n","")
if len(temp_text)==1:
text=temp_text
table.cells[r_idx][c_idx].text = text
accuracy = compute_accuracy([[100, pos_errors]])
if self.copy_text is not None:
table = Lattice._copy_spanning_text(table, copy_text=self.copy_text)
data = table.data
table.df = pd.DataFrame(data)
table.shape = table.df.shape
whitespace = compute_whitespace(data)
table.flavor = "lattice"
table.accuracy = accuracy
table.whitespace = whitespace
table.order = table_idx + 1
table.page = int(os.path.basename(self.rootname).replace("page-", ""))
# for plotting
_text = []
_text.extend([(t.x0, t.y0, t.x1, t.y1) for t in self.horizontal_text])
_text.extend([(t.x0, t.y0, t.x1, t.y1) for t in self.vertical_text])
table._text = _text
table._image = (self.image, self.table_bbox_unscaled)
table._segments = (self.vertical_segments, self.horizontal_segments)
table._textedges = None
return table
class PDFHandler(handlers.PDFHandler):
def parse(
self, flavor="lattice", suppress_stdout=False, layout_kwargs={}, **kwargs
):
tables = []
with TemporaryDirectory() as tempdir:
for p in self.pages:
self._save_page(self.filepath, p, tempdir)
pages = [
os.path.join(tempdir, f"page-{p}.pdf") for p in self.pages
]
parser = custom_lattice(**kwargs) if flavor == "lattice" else Stream(**kwargs)
for p in pages:
t = parser.extract_tables(
p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
)
tables.extend(t)
return TableList(sorted(tables))
def read_pdf_custom(
filepath,
pages="1",
password=None,
flavor="lattice",
suppress_stdout=False,
layout_kwargs={},
**kwargs
):
if flavor not in ["lattice", "stream"]:
raise NotImplementedError(
"Unknown flavor specified." " Use either 'lattice' or 'stream'"
)
with warnings.catch_warnings():
if suppress_stdout:
warnings.simplefilter("ignore")
validate_input(kwargs, flavor=flavor)
p = PDFHandler(filepath, pages=pages, password=password)
kwargs = remove_extra(kwargs, flavor=flavor)
tables = p.parse(
flavor=flavor,
suppress_stdout=suppress_stdout,
layout_kwargs=layout_kwargs,
**kwargs
)
return tables

Updating folium changed the Popup box width

Recently I updated folium from 0.5.0 to 0.11.0 and thereafter I am experiencing a problem with the popup box. With the update the popup box seem to have shrinked in width and the text is coming in separate lines, which happened to appear in the same line with the previous version of folium. No changes been made with the code.
How can I change the popup box look like the previous one, i.e., text does not break the line?
Popup box code:
fgc.add_child(folium.Marker(location=[lt, ln], popup= "<h4> <b>Thana :&nbsp" + di +"</h4></b>"+ "<br><b>Cases Total: &nbsp: </b>"+str(ca)+ " person "+ "<br>" + "<b>Cases 24 hours : </b>"+ str(da)+ " person "+"<br>"+"<b>Cases 7 days: </b>"+str(we)+ " person "+"<br><b>Neighbouhood affected : </b>"+str(ne)
How I handled this was to create a IFrame to handle the dataframe variables and then just passed the that to the popup class, this should work for database or dataframe.
for (index, row) in df.iterrows():
if row.loc['BRANCH'] == 1:
iframe = folium.IFrame('Account#:' + str(row.loc['ACCT']) + '<br>' + 'Name: ' + row.loc['NAME'] + '<br>' + 'Terr#: ' + str(row.loc['TERR']))
popup = folium.Popup(iframe, min_width=300, max_width=300)
folium.Marker(location=[row.loc['LAT'], row.loc['LON']], icon=folium.Icon(color=row.loc['COLOR'], icon='map-marker', prefix='fa'), popup=popup).add_to(map1)
Without reproducible code it is not possible to give you a tailored solution. As a general suggestion, you could use folium.Popup() with the combo of min_width and max_width parameters to force the width of a popup.
For example:
import folium
m = folium.Map(location=[43.775, 11.254],
zoom_start=5)
html = '''1 aaaaaaaaaaaaaaaaaa aaaa aaa aa aaaaa aaa aaaa a a a a<br>2 aaaaaaaaaa aaa aaaaa aaaaa<br>3 aaaaa aaaaaa aaaaa aaa aaaaa<br>4 aaa aaa aaaaaaaa
'''
iframe = folium.IFrame(html)
popup = folium.Popup(iframe,
min_width=500,
max_width=500)
marker = folium.Marker([43.775, 11.254],
popup=popup).add_to(m)
m
and you get:
def color(elev):
if elev == "STARTED":
col = 'orange.png'
elif elev=="COMPLETED":
col = 'vehicle3_w30.png'
elif elev =="DELIVERED":
col = 'vehicle3_w30.png'
else:
col='grey.png'
return col
icon_url = "grey.png"
icon = folium.features.CustomIcon(icon_url,
icon_size=(12, 12))
for lat,lan,name,event_name,officer,update_at in zip(df['fSourceLatitude'],df['fSourceLongitude'],df['officer_name'],df['event_name'],df["user_name"],df["update_at"]):
bikeColor = color(event_name)
biker = folium.features.CustomIcon(bikeColor, icon_size=(20,40))
popContent = ("Updated At: " + str(update_at) + '<br>' +\
"Officer ID : " + str(officer) + '<br>'+\
"Status: {}".format(event_name))
iframe = folium.IFrame(popContent)
popup1 = folium.Popup(iframe,
min_width=500,
max_width=500)
folium.Marker(location=[lat,lan],popup = popup1,icon= biker).add_to(map5)
It worked for me, you need to initiate marker with custom icon in each iteration as shown in this code, It will work perfectly...
He is trying to fetch the data from database that's why it is breaking, If he did write the data using html tag then there will be no problem. But the main fact inside of html tag you have to use fetch data.

Python-PPTX: Changing table style or adding borders to cells

I've started putting together some code to take Pandas data and put it into a PowerPoint slide. The template I'm using defaults to Medium Style 2 - Accent 1 which would be fine as changing the font and background are fairly easy, but there doesn't appear to be an implemented portion to python-pptx that allows for changing cell borders. Below is my code, open to any solution. (Altering the XML or changing the template default to populate a better style would be good options for me, but haven't found good documentation on how to do either). Medium Style 4 would be ideal for me as it has exactly the borders I'm looking for.
import pandas
import numpy
from pptx import Presentation
from pptx.util import Inches, Pt
from pptx.dml.color import RGBColor
#Template Location
tmplLoc = 'C:/Desktop/'
#Read in Template
prs = Presentation(tmplLoc+'Template.pptx')
#Import data as Pandas Dataframe - dummy data for now
df = pandas.DataFrame(numpy.random.randn(10,10),columns=list('ABCDEFGHIJ'))
#Determine Table Header
header = list(df.columns.values)
#Determine rows and columns
in_rows = df.shape[0]
in_cols = df.shape[1]
#Insert table from C1 template
slide_layout = prs.slide_layouts[11]
slide = prs.slides.add_slide(slide_layout)
#Set slide title
title_placeholder = slide.shapes.title
title_placeholder.text = "Slide Title"
#Augment placeholder to be a table
placeholder = slide.placeholders[1]
graphic_frame = placeholder.insert_table(rows = in_rows+1, cols = in_cols)
table = graphic_frame.table
#table.apply_style = 'MediumStyle4'
#table.apply_style = 'D7AC3CCA-C797-4891-BE02-D94E43425B78'
#Set column widths
table.columns[0].width = Inches(2.23)
table.columns[1].width = Inches(0.9)
table.columns[2].width = Inches(0.6)
table.columns[3].width = Inches(2)
table.columns[4].width = Inches(0.6)
table.columns[5].width = Inches(0.6)
table.columns[6].width = Inches(0.6)
table.columns[7].width = Inches(0.6)
table.columns[8].width = Inches(0.6)
table.columns[9].width = Inches(0.6)
#total_width = 2.23+0.9+0.6+2+0.6*6
#Insert data into table
for rows in xrange(in_rows+1):
for cols in xrange(in_cols):
#Write column titles
if rows == 0:
table.cell(rows, cols).text = header[cols]
table.cell(rows, cols).text_frame.paragraphs[0].font.size=Pt(14)
table.cell(rows, cols).text_frame.paragraphs[0].font.color.rgb = RGBColor(255, 255, 255)
table.cell(rows, cols).fill.solid()
table.cell(rows, cols).fill.fore_color.rgb=RGBColor(0, 58, 111)
#Write rest of table entries
else:
table.cell(rows, cols).text = str("{0:.2f}".format(df.iloc[rows-1,cols]))
table.cell(rows, cols).text_frame.paragraphs[0].font.size=Pt(10)
table.cell(rows, cols).text_frame.paragraphs[0].font.color.rgb = RGBColor(0, 0, 0)
table.cell(rows, cols).fill.solid()
table.cell(rows, cols).fill.fore_color.rgb=RGBColor(255, 255, 255)
#Write Table to File
prs.save('C:/Desktop/test.pptx')
Maybe not really clean code but allowed me to adjust all borders of all cells in a table:
from pptx.oxml.xmlchemy import OxmlElement
def SubElement(parent, tagname, **kwargs):
element = OxmlElement(tagname)
element.attrib.update(kwargs)
parent.append(element)
return element
def _set_cell_border(cell, border_color="000000", border_width='12700'):
tc = cell._tc
tcPr = tc.get_or_add_tcPr()
for lines in ['a:lnL','a:lnR','a:lnT','a:lnB']:
ln = SubElement(tcPr, lines, w=border_width, cap='flat', cmpd='sng', algn='ctr')
solidFill = SubElement(ln, 'a:solidFill')
srgbClr = SubElement(solidFill, 'a:srgbClr', val=border_color)
prstDash = SubElement(ln, 'a:prstDash', val='solid')
round_ = SubElement(ln, 'a:round')
headEnd = SubElement(ln, 'a:headEnd', type='none', w='med', len='med')
tailEnd = SubElement(ln, 'a:tailEnd', type='none', w='med', len='med')
Based on this post: https://groups.google.com/forum/#!topic/python-pptx/UTkdemIZICw
In case someone else comes across this issue again, some changes should be made to the solution posted by JuuLes87 to avoid that Microsoft Office PowerPoint requires to repair the generated presentation.
After carefully inspecting the xml string of the table generated by pptx, I found that the requirement to repair the presentation seemed to be due to the duplicated nodes of 'a:lnL' or 'a:lnR' or 'a:lnT' or 'a:lnB' in the children elements of 'a:tcPr'. So we only need to remove nodes of ['a:lnL','a:lnR','a:lnT','a:lnB'] before these nodes are inserted as below.
from pptx.oxml.xmlchemy import OxmlElement
def SubElement(parent, tagname, **kwargs):
element = OxmlElement(tagname)
element.attrib.update(kwargs)
parent.append(element)
return element
def _set_cell_border(cell, border_color="000000", border_width='12700'):
tc = cell._tc
tcPr = tc.get_or_add_tcPr()
for lines in ['a:lnL','a:lnR','a:lnT','a:lnB']:
# Every time before a node is inserted, the nodes with the same tag should be removed.
tag = lines.split(":")[-1]
for e in tcPr.getchildren():
if tag in str(e.tag):
tcPr.remove(e)
# end
ln = SubElement(tcPr, lines, w=border_width, cap='flat', cmpd='sng', algn='ctr')
solidFill = SubElement(ln, 'a:solidFill')
srgbClr = SubElement(solidFill, 'a:srgbClr', val=border_color)
prstDash = SubElement(ln, 'a:prstDash', val='solid')
round_ = SubElement(ln, 'a:round')
headEnd = SubElement(ln, 'a:headEnd', type='none', w='med', len='med')
tailEnd = SubElement(ln, 'a:tailEnd', type='none', w='med', len='med')
I had a hard time figuring out why this wasn't working. For anyone else struggling with this, I had to add the following to the end of the function:
return cell
When using, you want to use the function as such:
cell = _set_cell_border(cell)

Freeze cells in excel using xlwt

I am creating worksheets on a fly and not naming them anything. I am unable to freeze the first column and row. I tired working with naming the sheet when adding it to the workbook and it works. However doesn't work on the fly. Below is the code
base = xlwt.Workbook()
for k,v in MainDict.items():
base.add_sheet(k.upper())
col_width = 256 * 50
xlwt.add_palette_colour("custom_colour", 0x21)
pattern = 'url:(.*)'
search = re.compile(pattern)
base.set_colour_RGB(0x21, 251, 228, 228)
style = xlwt.easyxf('pattern: pattern solid, fore_colour custom_colour;font : bold on;alignment: horiz center;font: name Times New Roman size 20;font:underline single')
index = MainDict.keys().index(k)
ws = base.get_sheet(index)
ws.set_panes_frozen(True)
try:
for i in itertools.count():
ws.col(i).width = col_width
except ValueError:
pass
style1 = xlwt.easyxf('font: name Times New Roman size 15')
style2 = xlwt.easyxf('font : bold on;font: name Times New Roman size 12')
col=0
for sk in MainDict[k].keys():
ws.write(0,col,sk.upper(),style)
col+=1
row =1
for mk in MainDict[k][sk].keys():
for lk,lv in MainDict[k][sk][mk].items():
for items in lv:
text = ('%s URL: %s')%(items,lk)
links =('No data Found. Please visit the URL: %s')% (lk)
url = re.findall(pattern,text)
if len(items) != 0:
if re.match(pattern,text)==True:
ws.write(row,col-1,url,style2)
else:
ws.write(row,col-1,text,style1)
row+=1
else:
ws.write(row,col-1,links,style2)
#ws.Column(col-1,ws).width = 10000
row+=1
default_book_style = base.default_style
default_book_style.font.height = 20 * 36
base.save('project7.xls')
You have to use
ws.set_panes_frozen(True)
ws.set_horz_split_pos(1)
ws.set_vert_split_pos(1)
to make frozen take effect.
The reason this isn't working may be the result of the "get_sheet() function. Instead, store the add_sheet() call to "ws" and use that:
#base.add_sheet(k.upper())
ws = base.add_sheet(k.upper())
And then you need this sequence of attributes to freeze top row:
#ws = base.get_sheet(index)
#ws.set_panes_frozen(True)
ws.set_horz_split_pos(1)
ws.set_vert_split_pos(1)
ws.panes_frozen = True
ws.remove_splits = True
I tested this using your code snippet and it works on my end.
For reference, you can set these attributes either via function or as assignment:
ws.set_panes_frozen(True)
ws.set_remove_splits(True)

Categories