Related
I want to print the dataframe into a pdf, in a table like structure. Also, I have other data that I want to print on the same page.
I tried to print the dataframe row by row and this is what I tried:
from fpdf import FPDF
import pandas as pd
pdf = FPDF(format='letter', unit='in')
pdf.add_page()
pdf.set_font('helvetica', 'BU', 8)
pdf.ln(0.25)
data = [
[1, 'denumire1', 'cant1', 'pret1', 'valoare1'],
[2, 'denumire2', 'cant2', 'pret2', 'valoare2'],
[3, 'denumire3', 'cant3', 'pret3', 'valoare3'],
[4, 'denumire4', 'cant4', 'pret4', 'valoare4'],
]
df = pd.DataFrame(data, columns=['Nr. crt.', 'Denumire', 'Cant.', 'Pret unitar', 'Valoarea'])
for index, row in df.iterrows():
pdf.cell(7, 0.5,str(row['Nr. crt.'])+str(row['Denumire'])+ str(row['Cant.'])+ str(row['Pret unitar'])+ str(row['Valoarea']))
pdf.output('test.pdf', 'F')
However, the format is not readable.
How could I print the dataframe to the pdf using FPDF,so that it aligns in the page?
This is how the dataframe looks now, using the given code:
The fpdf module is a rather low level library. You have to explicitely write each cell after computing the cell width. Here you use a letter size (8 x 11.5 in.), and have 5 columns so a 1.6 width seems legitimate. Code could be:
...
for index, row in df.iterrows():
for data in row.values:
pdf.cell(1.6, 0.5, str(data)) # write each data for the row in its cell
pdf.ln() # go to next line after each row
IMPORTANT: for my solution, we need to iterate through the DataFrame. And I know this is not ideal since it's very time consuming for larger size DataFrames. But since you are printing the results in a table I'm assuming it's a small sample. But consider using more efficient methods.
First, let's import the needed modules and create de DataFrame:
import pandas as pd
import math
from fpdf import FPDF
data = [
[1, 'denumire1', 'cant1', 'pret1', 'valoare1'],
[2, 'denumire2', 'cant2', 'pret2', 'valoare2'],
[3, 'denumire3', 'cant3', 'pret3', 'valoare3'],
[4, 'denumire4', 'cant4', 'pret4', 'valoare4'],
]
df = pd.DataFrame(data, columns=['Nr. crt.', 'Denumire', 'Cant.', 'Pretunitar',
'Valoarea'])
Now we can create our document, add a page and set margins and font
# Creating document
pdf = FPDF("P", "mm", "A4")
pdf.set_margins(left= 10, top= 10)
pdf.set_font("Helvetica", style= "B", size= 14)
pdf.set_text_color(r= 0, g= 0, b= 0)
pdf.add_page()
Now we can create the first element of our table: the header. I'm assuming we will print on the table only the given columns so I'll use their names as headers.
Since we have 5 columns with multiple characters, we must take in consideration the fact that we might need more than one line for the header, in case a cell has too many characters for a single line.
To solve that, line height must be equal to the font size times the number of lines needed (eg.: if you have a str with width of 150 and the cell has width of 100, you will need 2 lines (1.5 rounded up)). But we need to do this to every column name and use the higher value as our number of lines.
Also, I'm assuming you will equally divide the whole width of the page minus margins for the 5 columns (cells).
# Creating our table headers
cell_width = (210 -10 -10) / len(df.columns)
line_height = pdf.font_size
number_lines = 1
for i in df.columns:
new_number_lines = math.ceil(pdf.get_string_width(str(i)) / cell_width)
if new_number_lines > number_lines:
number_lines = new_number_lines
Now, with our line height for the header, we can iterate through the columns names and print each one. I'll use style "B" and size 14 for the headers (defined earlier).
for i in df.columns:
pdf.multi_cell(w= cell_width, h= line_height * number_lines * 1.5,
txt=str(i), align="C", border="B", new_x="RIGHT", new_y="TOP",
max_line_height= line_height)
pdf.ln(line_height * 1.5 * number_lines)
After that we must iterate through all the dataframe and for each iteration we must create cells with the content. Also, for each iteration we have to account for differences in text size and, therefore, number of lines. But by now you probably figured out that the process is the same as before: we iterate through the line to calculate the number of lines needed and then use that value to define cells with the content.
Before printing the body of the table, I'm removing the bold style.
# Changing font style
pdf.set_font("Helvetica", style= "", size= 14)
# Creating our table row by row
for index, row in df.iterrows():
number_lines = 1
for i in range(len(df.columns)):
new_number_lines = math.ceil(pdf.get_string_width(str(row[i])) / cell_width)
if new_number_lines > number_lines:
number_lines = new_number_lines
for i in range(len(df.columns)):
pdf.multi_cell(w=cell_width, h=line_height * number_lines * 1.5,
txt=str(row[i]), align="C", border="B", new_x="RIGHT", new_y="TOP", max_line_height= line_height)
pdf.ln(line_height * 1.5 * number_lines)
pdf.output("table.pdf")
I keep getting this error when trying to add an empty column in an imported CSV.
"IndexError: index 27 is out of bounds for axis 0 with size 25"
The original CSV spans A-Z (0-25 columns) then AA, AB, AC, AD (26, 27, 28, 29).
OriginalCSV
The csv with the error currently stretches A-Z but the error occurs when trying to add the column after then - in this case AA. I guess that would be 26.
Problem CSV
Here is the code:
```
#import CSV to dataframe
orders = pd.read_csv("Orders.csv", header=None)
#copy columns needed from order to ordersNewCols
ordersNewCols = orders.iloc[:,[1, 3, 11, 12, 15]]
#create new dataframe - ordersToSubmit
ordersToSubmit = pd.DataFrame()
#copy columns from ordersNewCols to ordersToSubmit
ordersToSubmit = ordersNewCols.copy()
ordersToSubmit.to_csv("ordersToSubmit.csv", index=False)
#Insert empty columns where needed.
ordersToSubmit.insert(2,None,'')
ordersToSubmit.insert(3,None,'')
ordersToSubmit.insert(4,None,'')
ordersToSubmit.insert(6,None'')
ordersToSubmit.insert(7,None,'')
ordersToSubmit.insert(8,None'')
ordersToSubmit.insert(9,None,'')
ordersToSubmit.insert(10,None,'')
ordersToSubmit.insert(11,None,'')
ordersToSubmit.insert(12,None,'')
ordersToSubmit.insert(13,None,'')
ordersToSubmit.insert(14,None,'')
ordersToSubmit.insert(15,None,'')
ordersToSubmit.insert(16,None,'')
ordersToSubmit.insert(18,None,'')
ordersToSubmit.insert(19,None,'')
ordersToSubmit.insert(20,None,'')
ordersToSubmit.insert(21,None,'')
ordersToSubmit.insert(22,None,'')
ordersToSubmit.insert(23,None,'')
ordersToSubmit.insert(27,None,'')
IndexError: index 27 is out of bounds for axis 0 with size 25
'''
How do I expand it to not bring up the error?
CSV screenprint
Without having a look at your csv file, it is hard to tell what is causing this issue.
Anyways...
From pandas.DataFrame.insert documentation
locint
Insertion index. Must verify 0 <= loc <= len(columns).
As you can see, it says loc must be between len(columns), so what you are trying is illegal according to this. I think if you try to insert on an index that is less than len(columns), it will shift remaining column by 1 to the right
Here is the code I am working with.
dfs=dfs[['Reserved']] #the column that I need to insert
dfs=dfs.applymap(str) #json did not accept the nan so needed to convert
sh=gc.open_by_key('KEY') #would open the google sheet
sh_dfs=sh.get_worksheet(0) #getting the worksheet
sh_dfs.insert_rows(dfs.values.tolist()) #inserts the dfs into the new worksheet
Running this code would insert the rows at the first column of the worksheet but what I am trying to accomplish is adding/inserting the column at the very last, column p.
In your situation, how about the following modification? In this modification, at first, the maximum column is retrieved. And, the column number is converted to the column letter, and the values are put to the next column of the last column.
From:
sh_dfs.insert_rows(dfs.values.tolist())
To:
# Ref: https://stackoverflow.com/a/23862195
def colnum_string(n):
string = ""
while n > 0:
n, remainder = divmod(n - 1, 26)
string = chr(65 + remainder) + string
return string
values = sh_dfs.get_all_values()
col = colnum_string(max([len(r) for r in values]) + 1)
sh_dfs.update(col + '1', dfs.values.tolist(), value_input_option='USER_ENTERED')
Note:
If an error like exceeds grid limits occurs, please insert the blank column.
Reference:
update
I have a .xlsx file which looks as the attached file. What is the most common way to extract the different data parts from this excel file in Python?
Ideally there would be a method that is defined as :
pd.read_part_csv(columns=['data1', 'data2','data3'], rows=['val1', 'val2', 'val3'])
and returns an iterator over pandas dataframes which hold the values in the given table.
here is a solution with pylightxl that might be a good fit for your project if all you are doing is reading. I wrote the solution in terms of rows but you could just as well have done it in terms of columns. See docs for more info on pylightxl https://pylightxl.readthedocs.io/en/latest/quickstart.html
import pylightxl
db = pylightxl.readxl('Book1.xlsx')
# pull out all the rowIDs where data groups start
keyrows = [rowID for rowID, row in enumerate(db.ws('Sheet1').rows,1) if 'val1' in row]
# find the columnIDs where data groups start (like in your example, not all data groups start in col A)
keycols = []
for keyrow in keyrows:
# add +1 since python index start from 0
keycols.append(db.ws('Sheet1').row(keyrow).index('val1') + 1)
# define a dict to hold your data groups
datagroups = {}
# populate datatables
for tableIndex, keyrow in enumerate(keyrows,1):
i = 0
# data groups: keys are group IDs starting from 1, list: list of data rows (ie: val1, val2...)
datagroups.update({tableIndex: []})
while True:
# pull out the current group row of data, and remove leading cells with keycols
datarow = db.ws('Sheet1').row(keyrow + i)[keycols[tableIndex-1]:]
# check if the current row is still part of the datagroup
if datarow[0] == '':
# current row is empty and is no longer part of the data group
break
datagroups[tableIndex].append(datarow)
i += 1
print(datagroups[1])
print(datagroups[2])
[[1, 2, 3, ''], [4, 5, 6, ''], [7, 8, 9, '']]
[[9, 1, 4], [2, 4, 1], [3, 2, 1]]
Note that output of table 1 has extra '' on it, that is because the size of the sheet data is larger than your group size. You can easily remove these with list.remove('') if you like
my goal for this question is to insert a comma between every character in every column value, which have been hashed and padded to a length of 19 digits.
The code below works partially, but the array values get messed up by trying to apply the f_comma function to the column value...thanks for any help!
I've taken some of the answers from other questions and have created the following code:
using this function -
def f_comma(p_string, n=1):
p_string = str(p_string)
return ','.join(p_string[i:i+n] for i in range(0, len(p_string), n))
And opening a tsv file
data = pd.read_csv('a1.tsv', sep = '\t', dtype=object)
I have modified another answer to do the following -
h = 1
try:
while data.columns[h]:
a = data.columns[h]
data[a] = f_comma((abs(data[a].apply(hash))).astype(str).str.zfill(19))
h += 1
except IndexError:
pass
which returns this array
array([[ '0, , , , ,4,1,7,5,7,0,1,4,5,4,6,1,6,5,3,1,4,6,1,\n,N,a,m,e,:, ,d,a,t,e,,, ,d,t,y,p,e,:, ,o,b,j,e,c,t',
'0, , , , ,6,2,9,1,6,7,0,8,4,2,8,2,9,1,0,9,5,9,4,\n,N,a,m,e,:, ,n,a,m,e,,, ,d,t,y,p,e,:, ,o,b,j,e,c,t']], dtype=object)
without the f_comma function the array looks like -
array([['3556968867719847281', '3691880917405293133']], dtype=object)
The goal is an array like this -
array([['3,5,5,6,9,6,8,8,6,7,7,1,9,8,4,7,2,8,1', '3,6,9,1,8,8,0,9,1,7,4,0,5,2,9,3,1,3,3']], dtype=object)
You should be able to use pandas string functions.
e.g. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.join.html
df["my_column"].str.join(',')