Pandas PDF to CSV with Auto Column Adjuster - python

Someone helped me with a program so that I can convert .PDF files from that format to .CSV but I would like to add an auto column adjuster to this program
Mass PDF to CSV Code:
import os
import glob
import tabula
path="/Users/username/Downloads/"
for filepath in glob.glob(path+'*.pdf'):
name=os.path.basename(filepath)
tabula.convert_into(input_path=filepath,
output_path=path+name+".csv",
pages="all")
Auto Column Adjuster Code:
import pandas as pd
from UliPlot.XLSX import auto_adjust_xlsx_column_width
# Load example dataset
file_encoding = "cp1252"
df = pd.read_csv("/Users/username/Downloads/", encoding=file_encoding)
# df.set_index("Timestamp", inplace=True)
# Export dataset to XLSX
with pd.ExcelWriter("example.xlsx") as writer:
df.to_excel(writer, sheet_name="MySheet")
auto_adjust_xlsx_column_width(df, writer, sheet_name="MySheet", margin=0)
If these two programs can be merged it would speed the process of me having to manually adjusting every file. Do note that the PDF to CSV Code takes a folder entry where as the Auto Column Adjuster Code takes a file entry.
Link to an example of my datasets:
https://drive.google.com/drive/folders/1nkLgo5tSFsxOTCa5EMWZlezDFi8AyaDq?usp=sharing
Thanks for helping

Related

pandas dataframe exporting to csv or excel missing rows/records

I have looped through two HTML files (the two files have some similar columns/headers, one file has additional columns, screenshots included below). I wanted to load everything into a single dataframe and then export it to excel or csv. But, when I export, I only see records from one HTML file. In my case, I am only seeing the records from the collection_item_shorterned.html file.
Screenshot of HTML files:
Screenshot of printout when running program:
Screenshot of the excel output (there should be 7 records total):
Code:
import os
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 5000)
yelp_path = os.path.join('..','data_files','yelp\\',)
print(f"printing path: {yelp_path}")
dir = os.listdir(yelp_path)
print(f"print dir: {dir}")
for i in range(len(dir)):
data = pd.read_html(yelp_path + dir[i])
dataframe = pd.concat(data)
print(dataframe)
dataframe.to_excel("combined_yelp_data.xlsx", index=False)

Read CSV files and Report creation in XLSX

Overview:
THIS PROGRAM/FUNCTION READS ALL - INDIVIDUAL METRICS FILES AND CREATES A EXCEL REPORT WITH ALL METRICS.
import glob, os, sys
import csv
import xlsxwriter
from pathlib import Path
import pandas as pd
from openpyxl import Workbook
#Output file name and location
#format for header object.
# Write the column headers with the defined format.
for col_number, value in enumerate(f3.columns.values):
worksheet_object.write(0, col_number + 1, value,
header_format_object)
writer_object.save()
Output in Terminal (Success)
PS C:\Users\Python-1> &
Actual output of file in Folder:
C:\Users\Desktop\Cobol\Outputs
Actual Output in XLSX file
Problem: Results are good, however the S.No Column in XLSX file [number of programs, starts with zero instead of 1]
S.No
0
1
Have you tried a reindex?
Set the index before write the csv.
For example:
f3.index = np.arange(1, len(f3) + 1)

Python/Pandas: Filter out files with specific keyword

I am splitting a xlsm file ( with multiple sheets) into a csv with each sheet as a separate csv file. I want to save into csv files only the sheets whose name contain the keyword "Robot" or "Auto". How can I do it? Currently it is saving all sheets into csv files. Here is the code I am using -
import pandas as pd
xl = pd.ExcelFile('Sample_File.xlsm')
for sheet in xl.sheet_names:
df = pd.read_excel(xl,sheet_name=sheet)
df1.to_csv(f"{sheet}.csv",index=False)
Can you try this?
import pandas as pd
import re
xl = pd.ExcelFile('Sample_File.xlsm')
for sheet in xl.sheet_names:
if re.search('Robot|Auto', sheet):
df = pd.read_excel(xl,sheet_name=sheet)
df.to_csv(f"{sheet}.csv",index=False)

Concatenating Excel and CSV files

I've been asked to compile data files into one Excel spreadsheet using Python, but they are all either Excel files or CSV's. I'm trying to use the following code:
import glob, os
import shutil
import pandas as pd
par_csv = set(glob.glob("*Light*")) + - set(glob.glob("*all*")) - set(glob.glob("*Untitled"))
par
df = pd.DataFrame()
for file in par:
print(file)
df = pd.concat([df, pd.read(file)])
Is there a way I can use the pd.concat function to read the files in more than one format (si both xlsx and csv), instead of one or the other?

Pandas Create Excel with Table formatted as a Table

I have a .csv file that I am converting into a table format using the following python script. In order to make this useful, I need to create a table within the Excel that holds the data (actually formatted as a table (Insert > Table). Is this possible within python? I feel like it should be relatively easy, but can't find anything on the internet.
The idea here is that the python takes the csv file, converts it to xlsx with a table embedded on sheet1, and then moves it to the correct folder.
import os
import shutil
import pandas as pd
src = r"C:\Users\xxxx\Python\filename.csv"
src2 = r"C:\Users\xxxx\Python\filename.xlsx"
read_file = pd.read_csv (src) - convert to Excel
read_file.to_excel (src2, index = None, header=True)
dest = path = r"C:\Users\xxxx\Python\repository"
destination = shutil.copy2(src2, dest)
Edit: I got sidetracked by the original MWE.
This should work, using xlsxwriter:
import pandas as pd
import xlsxwriter
#Dummy data
my_data={"list1":[1,2,3,4], "list2":"a b c d".split()}
df1=pd.DataFrame(my_data)
df1.to_csv("myfile.csv", index=False)
df2=pd.read_csv("myfile.csv")
#List of column name dictionaries
headers=[{"header" : i} for i in list(df2.columns)]
#Create and propagate workbook
workbook=xlsxwriter.Workbook('output.xlsx')
worksheet1=workbook.add_worksheet()
worksheet1.add_table(0, 0, len(df2), len(df2.columns)-1, {"columns":headers, "data":df2.values.tolist()})
workbook.close()

Categories