Handling multiple pdf files - python

I have created a folder with 158 pdf files. I want to extract data of each file. Here is what I have done so far.
Importing modules
from itertools import chain
import pandas as pd
import tabulate
from tabula import read_pdf
Reading data file
data_A = read_pdf('D:\\Code\\Scraping\\DMKQ\\A.pdf', pages='all',encoding='latin1')
data_B = read_pdf('D:\\Code\\Scraping\\DMKQ\\B.pdf', pages='all',encoding='latin1')
# Generating Dataframe and print(len) for each file.
data_A_c = chain(*[data_A[i].values for i in range(0,len(data_A))])
headers=chain(data_A[0])
df_A = pd.DataFrame(data_A_c,columns=headers)
df_A.set_index('Name', inplace=True)
print(len(df_A.index))
data_B_c = chain(*[data_B[i].values for i in range(0,len(data_B))])
headers=chain(data_B[0])
df_B = pd.DataFrame(data_B_c,columns=headers)
df_B.set_index('Name', inplace=True)
print(len(df_B.index))
At the moment, I have to copy the code and change the name for each new file respectively, which is time consuming and almost impossible to perform, given that my folder has 158 files in total.
Does anybody knows how to execute this entire process more efficiently?

Related

Speed up reading multiple pickle (or csv?) files

At my company the sales data per month is stored in a folder as a CVS file. To already speed up to reading process in Python I am transforming the CSV files to Pickle files. Right now I have the following code to read all the individual pickle files and append them together in the dataframe:
import os, glob
import pandas as pd
import glob
import os.path
# Enter path of folder#
path = "link to the folder"
# find all pickle files
all_files = glob.glob(path + "/*.pkl")
df = pd.concat(
(pd.read_pickle(file).assign(filename=file) for file in all_files),
ignore_index=True,
)
I have 38 individual pickle files and the total size of the pickle files are 95 MB. This doesn't seem like a lot to me, but still it takes 56s to load all data into the dataframe.
Is there anything that can speed up this proces? Many thanks in advance!
Best,
Kav

Loop/iterate through a directory of excel files & add to the bottom of the dataframe

I am currently working on importing and formatting a large number of excel files (all the same format/scheme, but different values) with Python.
I have already read in and formatted one file and everything worked fine so far.
I would now do the same for all the other files and combine everything in one dataframe, i.e. read in the first excel in one dataframe, add the second at the bottom of the dataframe, add the third at the bottom the dataframe, and so on until I have all the excel files in one dataframe.
So far my script looks something like this:
import pandas as pd
import numpy as np
import xlrd
import os
path = os.getcwd()
path = "path of the directory"
wbname = "name of the excel file"
files = os.listdir(path)
files
wb = xlrd.open_workbook(path + wbname)
# I only need the second sheet
df = pd.read_excel(path + wbname, sheet_name="sheet2", skiprows = 2, header = None,
skipfooter=132)
# here is where all the formatting is happening ...
df
So, "files" is a list with all file relevant names. Now I have to try to put one file after the other into a loop (?) so that they all eventually end up in df.
Has anyone ever done something like this or can help me here.
Something like this might work:
import os
import pandas as pd
list_dfs=[]
for file in os.listdir('path_to_all_xlsx'):
df = pd.read_excel(file, <the rest of your config to parse>)
list_dfs.append(df)
all_dfs = pd.concat(list_dfs)
You read all the dataframes and add them to a list, and then the concat method adds them all together int one big dataframe.

Concatenating Excel and CSV files

I've been asked to compile data files into one Excel spreadsheet using Python, but they are all either Excel files or CSV's. I'm trying to use the following code:
import glob, os
import shutil
import pandas as pd
par_csv = set(glob.glob("*Light*")) + - set(glob.glob("*all*")) - set(glob.glob("*Untitled"))
par
df = pd.DataFrame()
for file in par:
print(file)
df = pd.concat([df, pd.read(file)])
Is there a way I can use the pd.concat function to read the files in more than one format (si both xlsx and csv), instead of one or the other?

How to avoid truncating my CSV output in a Python for loop?

I have a large folder of CSV files and I need to go into each one and add a new column with a new field.
My code seems to only return the number of rows that are in the first file. All my output files now only have 67 rows. I'm thinking this is because the first CSV had 67 rows and then my code just stuck to that? Any ideas?
import pandas as pd
import glob, os
files = glob.glob('path/*.csv')
for file in files: # loop through each file
df['client'] = 'newContent'
df.to_csv(file)
If my comment wasn't clear enough, here's the modified program:
import pandas as pd
import glob, os
files = glob.glob('path/*.csv')
for file in files: # loop through each file
df = pd.read_csv(file)
df['client'] = 'newContent'
df.to_csv(file)

How to merge multiple Excel files from a folder and it's sub folders using Python

I have multiple Excel spreedsheets in given folder and it's sub folder. All have same file name string with suffix as date and time. How to merge them all into one single file while making worksheet name and titles as index for appending data frames. Typically there would be small chunks of 200 KB each file of ~100 files in subfolders or 20 MB of ~10 files in subfolders
This may help you to merge all the xlsx file in current directory.
import glob
import os
import pandas as pd
output = pd.DataFrame()
for file in glob.glob(os.getcwd()+"\\*.xlsx"):
cn = pd.read_excel(file)
output = output.append(cn)
output.to_csv(os.getcwd()+"\\outPut.csv", index = False, na_rep = "NA", header=None)
print("Completed +::" )
Note : you need xlrd-1.1.0 library along with pandas to read xlsx files.
I have tried operating using static file name definitions, would be good if it makes consolation by column header from dynamic file list pick, whichever starts with .xls* (xls / xlsx / xlsb / xlsm) and .csv and .txt
import pandas as pd
db = pd.read_excel("/data/Sites/Cluster1 0815.xlsx")
db1 = pd.read_excel("/data/Sites/Cluster2 0815.xlsx")
db2 = read_excel("/data/Sites/Cluster3 0815.xlsx")
sdb = db.append(db1)
sdb = sdb.append(db2)
sdb.to_csv("/data/Sites/sites db.csv", index = False, na_rep = "NA", header=None)
Dynamic file list merge found to have the below output. However the processing time has to be counted on...
gur.com/QKTKw.jpg
While running on batch files the code generated below error (please to note that these file are asymmetric in information carried) attached is snap:

Categories