Extract text from multiple PDFs and write to a single CSV - python

I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I am also able to get a list of all the PDF filenames in a given directory. But when I try to put the two together and write the results to a single CSV, I get a CSV with headers but no data.
Here is my code:
import os
pdf_files = [name for name in os.listdir("C:\\My\\Directory\\Path") if name.endswith(".pdf")] #get all files in directory
pdf_files_path = ["C:\\My\\Directory\\Path\\" + pdf_files[i] for i in range(len(pdf_files))] #add directory path
import pandas as pd
df = pd.DataFrame(columns=['FileName','Text'])
for i in range(len(pdf_files)):
scraped_text = convert_pdf_to_txt(pdf_files_path[i])
df.append({ 'FileName': pdf_files[i], 'Text': scraped_text[i]},ignore_index=True)
df.to_csv('output.csv')
The variables have the following values:
pdf_files: ['12280_2007_Article_9000.pdf', '12280_2007_Article_9001.pdf', '12280_2007_Article_9002.pdf', '12280_2007_Article_9003.pdf', '12280_2007_Article_9004.pdf', '12280_2007_Article_9005.pdf', '12280_2007_Article_9006.pdf', '12280_2007_Article_9007.pdf', '12280_2007_Article_9008.pdf', '12280_2007_Article_9009.pdf']
pdf_files_path: ['C:\\My\\Directory Path\\12280_2007_Article_9000.pdf', etc...]
Empty DataFrame
Columns: [FileName, Text]
Index: []
Update: based on a suggestion from #AMC I checked the contents of scraped_text in the loop. For the Text column, it seems that I'm looping through the characters in the first PDF file, rather than looping through each file in the directly. Also, the contents of the loop are not getting written to the dataframe or CSV.
12280_2007_Article_9000.pdf E
12280_2007_Article_9001.pdf a
12280_2007_Article_9002.pdf s
12280_2007_Article_9003.pdf t
12280_2007_Article_9004.pdf
12280_2007_Article_9005.pdf A
12280_2007_Article_9006.pdf s
12280_2007_Article_9007.pdf i
12280_2007_Article_9008.pdf a
12280_2007_Article_9009.pdf n

I guess you don't need pandas for that. You can make it simpler by using the standard library csv.
Another thing that can be improved, if you are using Python 3.4+, is to replace os with pathlib.
Here is an almost complete example:
import csv
from pathlib import Path
folder = Path('c:/My/Directory/Path')
csv_file = Path('c:/path/to/output.csv')
with csv_file.open('w', encoding='utf-8') as f:
writer = csv.writer(f, csv.QUOTE_ALL)
writer.writerow(['FileName', 'Text'])
for pdf_file in folder.glob('*.pdf'):
pdf_text = convert_pdf_to_txt(pdf_file).replace('\n', '|')
writer.writerow([pdf_file.name, pdf_text])
Another thing to bear in mind is to be sure pdf_text will be a single line or else your csv file will be kind of broken. One way to work around that is to pick an arbitrary character to use in place of the new line marks. If you pick the pipe character, for example, than you can do something like this, prior to writer.writerow:
pdf_text.replace('\n', '|')
It is not meant to be a complete example but a starting point. I hope it helps.

Related

Creating Multiple .txt files from an Excel file with Python Loop

My work is in the process of switching from SAS to Python and I'm trying to get a head start on it. I'm trying to write separate .txt files, each one representing all of the values of an Excel file column.
I've been able to upload my Excel sheet and create a .txt file fine, but for practical uses, I need to find a way to create a loop that will go through and make each column into it's own .txt file, and name the file "ColumnName.txt".
Uploading Excel Sheet:
import pandas as pd
wb = pd.read_excel('placements.xls')
Creating single .txt file: (Named each column A-Z for easy reference)
with open("A.txt", "w") as f:
for item in wb['A']:
f.write("%s\n" % item)
Trying my hand at a for loop (to no avail):
import glob
for file in glob.glob("*.txt"):
f = open(( file.rsplit( ".", 1 )[ 0 ] ) + ".txt", "w")
f.write("%s\n" % item)
f.close()
The first portion worked like a charm and gave me a .txt file with all of the relevant data.
When I used the glob command to attempt making some iterations, it doesn't error out, but only gives me one output file (A.txt) and the only data point in A.txt is the letter A. I'm sure my inputs are way off... after scrounging around forever it's what I found that made sense and ran, but I don't think I'm understanding the inputs going in to the command, or if what I'm running is just totally inaccurate.
Any help anyone would give would be much appreciated! I'm sure it's a simple loop, just hard to wrap your head around when you're so new to python programming.
Thanks again!
I suggest use pandas for write files by to_csv, only change extension to .txt:
# Uploading Excel Sheet:
import pandas as pd
df = pd.read_excel('placements.xls')
# Creating single .txt file: (Named each column A-Z for easy reference)
for col in df.columns:
print (col)
#python 3.6+
df[col].to_csv(f"{col}.txt", index=False, header=None)
#python bellow
#df[col].to_csv("{}.txt".format(col), index=False, header=None)

Take average of each column in multiple csv files using Python

I am a beginner in Python. I have searched about my problem but could not find the exact requirement.
I have a folder in which there are multiple files getting scored for each experimental measurement. Their names follow a trend, e.g. XY0001.csv, XY0002.csv ... XY0040.csv. I want to read all of these files and take the average of each column in all files, storing it in 'result.csv' in the same format.
I would suggest to use pandas (import pandas as pd). I suggest to start by reading the file using pd.read_csv(). How to read the files exactly depends on how your CSV files are formatted, I cannot tell that from here. If you want to read all files in a directory (which may be the easiest solution for this problem), try to use read all files.
Then, you could concatenate all files using pd.concat(). Lastly, you can calculate the metrics you want to generate (use the search functionality to find how to calculate each specific metric). A nice function that does a lot of stuff for you is the describe function.
For access multiple files you can use glob module.
import glob
path =r'/home/root/csv_directory'
filenames = glob.glob(path + "/*.csv")
Python's pandas module have a method to parse csv file. It also some options to manage and process csv files.
import pandas as pd
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
.read_csv() method is used for parse csv files.
pd.concat(dfs, ignore_index=True)
.concat() used to concatenate all data into one dataframe and its easy for processing.
The following makes use of the glob module to get a list of all files in the current folder of the form X*.csv, i.e. all CSV files starting with x. For each file it finds, it first skips a header row (optional) and it then loads all remaining rows using a zip() trick to transpose the list of rows into a list of columns.
For each column, it converts each cell into an integer and sums the values, dividing this total by the number of elements found, thus giving an average for each column. It then writes the values to your output result.csv in the format filename, av_col1, av_col2 etc:
import glob
import csv
with open('result.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
for filename in glob.glob('X*.csv'):
print (filename)
with open(filename, newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
averages = []
for col in zip(*csv_input):
averages.append(sum(int(x) for x in col) / len(col))
csv_output.writerow([filename] + averages)
So if you had XY0001.csv containing:
Col1,Col2,Col3
6,1,10
2,1,20
5,2,30
result.csv would be written as follows:
XY0001.csv,4.333333333333333,1.3333333333333333,20.0
Tested using Python 3.5.2

Looping through multiple excel files in python using pandas

I know this type of question is asked all the time. But I am having trouble figuring out the best way to do this.
I wrote a script that reformats a single excel file using pandas.
It works great.
Now I want to loop through multiple excel files, preform the same reformat, and place the newly reformatted data from each excel sheet at the bottom, one after another.
I believe the first step is to make a list of all excel files in the directory.
There are so many different ways to do this so I am having trouble finding the best way.
Below is the code I currently using to import multiple .xlsx and create a list.
import os
import glob
os.chdir('C:\ExcelWorkbooksFolder')
for FileList in glob.glob('*.xlsx'):
print(FileList)
I am not sure if the previous glob code actually created the list that I need.
Then I have trouble understanding where to go from there.
The code below fails at pd.ExcelFile(File)
I beleive I am missing something....
# create for loop
for File in FileList:
for x in File:
# Import the excel file and call it xlsx_file
xlsx_file = pd.ExcelFile(File)
xlsx_file
# View the excel files sheet names
xlsx_file.sheet_names
# Load the xlsx files Data sheet as a dataframe
df = xlsx_file.parse('Data',header= None)
# select important rows,
df_NoHeader = df[4:]
#then It does some more reformatting.
'
Any help is greatly appreciated
I solved my problem. Instead of using the glob function I used the os.listdir to read all my excel sheets, loop through each excel file, reformat, then append the final data to the end of the table.
#first create empty appended_data table to store the info.
appended_data = []
for WorkingFile in os.listdir('C:\ExcelFiles'):
if os.path.isfile(WorkingFile):
# Import the excel file and call it xlsx_file
xlsx_file = pd.ExcelFile(WorkingFile)
# View the excel files sheet names
xlsx_file.sheet_names
# Load the xlsx files Data sheet as a dataframe
df = xlsx_file.parse('sheet1',header= None)
#.... do so reformating, call finished sheet reformatedDataSheet
reformatedDataSheet
appended_data.append(reformatedDataSheet)
appended_data = pd.concat(appended_data)
And thats it, it does everything I wanted.
you need to change
os.chdir('C:\ExcelWorkbooksFolder')
for FileList in glob.glob('*.xlsx'):
print(FileList)
to just
os.chdir('C:\ExcelWorkbooksFolder')
FileList = glob.glob('*.xlsx')
print(FileList)
Why does this fix it? glob returns a single list. Since you put for FileList in glob.glob(...), you're going to walk that list one by one and put the result into FileList. At the end of your loop, FileList is a single filename - a single string.
When you do this code:
for File in FileList:
for x in File:
the first line will assign File to the first character of the last filename (as a string). The second line will assign x to the first (and only) character of File. This is not likely to be a valid filename, so it throws an error.

Python csv write module, need output into different rows instead of columns

I am trying to get a list of files for a certain path in a csv file. I get the desired results but they are in a single row. How can the output appear in different rows.
My code is follows:
import csv
import os
path = raw_input("Enter Path:")
dirList=os.listdir(path)
csvOut=open('outputnew.csv','w')
w = csv.writer(csvOut)
w.writerow(dirList)
csvOut.close()
Call writerow multiple times to put the list into different rows.
for directory in dirList:
w.writerow([directory])
(But in this case I don't see the need of using CSV...)

Find "string" in Text File - Add it to Excel File Using Python

I ran a grep command and found several hundred instances of a string in a large directory of data. This file is 2 MB and has strings that I would like to extract out and put into an Excel file for easy access later. The part that I'm extracting is a path to a data file I need to work on later.
I have been reading about Python lately and thought I could somehow do this extraction automatically. But I'm a bit stumped how to start. I have this so far:
data = open("C:\python27\text.txt").read()
if "string" in data:
But then I'm not sure what to use to get out of the file what I want. Anything for a beginner to chew on?
EDIT
Here is some more info on what I was looking for. I have several hundred lines in a text file. Each line has a path and some strings like this:
/path/to/file:STRING=SOME_STRING, ANOTHER_STRING
What I would like from these lines are the paths of those lines with a specific "STRING=SOME_STRING". For example if the line looks like this, I want the path (/path/to/file) to be extracted to another file:
/path/to/file:STRING=SOME_STRING
All this is quite easily done with standard Python, but for "excel" (xls,or xlsx) files -- you'd have to install a third party library for that. However, if you need just a 2D table that cna open up on a spreadsheed you can use Comma Separated Values (CSV) files - these are comaptible with Excel and other spreadsheet software, and comes integrated in Python.
As for searching a string inside a file, it is straightforward. You may not even need regular expressions for most things. What information do you want along with the string?
Also, the "os" module onthse standardlib has some functions to list all files in a directory, or in a directory tree. The most straightforward is os.listdir(path)
String methods like "count" and "find" can be used beyond "in" to locate the string in a file, or count the number of ocurrences.
And finally, the "CSV" module can write a properly formated file to read in ay spreadsheet.
Along the away, you may abuse python's buit-in list objects as an easy way to manipulate data sets around.
Here is a sample programa that counts strings given in the command line found in files in a given directory,, and assembles a .CSV table with them:
# -*- coding: utf-8 -*-
import csv
import sys, os
output_name = "count.csv"
def find_in_file(path, string_list):
count = []
file_ = open(path)
data = file_.read()
file_.close()
for string in string_list:
count.append(data.count(string))
return count
def main():
if len(sys.argv) < 3:
print "Use %s directory_path <string1>[ string2 [...]])\n" % __package__
sys.exit(1)
target_dir = sys.argv[1]
string_list = sys.argv[2:]
csv_file = open(output_name, "wt")
writer = csv.writer(csv_file)
header = ["Filename"] + string_list
writer.writerow(header)
for filename in os.listdir(target_dir):
path = os.path.join(target_dir, filename)
if not os.path.isfile(path):
continue
line = [filename] + find_in_file(path, string_list)
writer.writerow(line)
csv_file.close()
if __name__=="__main__":
main()
The steps to do this are as follows:
Make a list of all files in the directory (This isn't necessary if you're only interested in a single file)
Extract the names of those files that you're interested in
In a loop, read in those files line by line
See if the line matches your pattern
Extract the part of the line before the first : character
So, the code would look something like this, provided your text files are formatted the way you've shown in the question and that this format is reliably correct:
import sys, os, glob
dir_path = sys.argv[1]
if dir_path[-1] != os.sep: dir_path+=os.sep
file_list = glob.glob(dir_path+'*.txt') #use standard *NIX wildcards to get your file names, in this case, all the files with a .txt extension
with open('out_file.csv', 'w') as out_file:
for filename in file_list:
with open(filename, 'r') as in_file:
for line in in_file:
if 'STRING=SOME_STRING' in line:
out_file.write(line.split(':')[0]+'\n')
This program would be run as python extract_paths.py path/to/directory and would give you a file called out_file.csv in your current directory.
This file can then be imported into Excel as a CSV file. If your input is less reliable than you've suggested, regular expressions might be a better choice.

Categories