Related
I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I am also able to get a list of all the PDF filenames in a given directory. But when I try to put the two together and write the results to a single CSV, I get a CSV with headers but no data.
Here is my code:
import os
pdf_files = [name for name in os.listdir("C:\\My\\Directory\\Path") if name.endswith(".pdf")] #get all files in directory
pdf_files_path = ["C:\\My\\Directory\\Path\\" + pdf_files[i] for i in range(len(pdf_files))] #add directory path
import pandas as pd
df = pd.DataFrame(columns=['FileName','Text'])
for i in range(len(pdf_files)):
scraped_text = convert_pdf_to_txt(pdf_files_path[i])
df.append({ 'FileName': pdf_files[i], 'Text': scraped_text[i]},ignore_index=True)
df.to_csv('output.csv')
The variables have the following values:
pdf_files: ['12280_2007_Article_9000.pdf', '12280_2007_Article_9001.pdf', '12280_2007_Article_9002.pdf', '12280_2007_Article_9003.pdf', '12280_2007_Article_9004.pdf', '12280_2007_Article_9005.pdf', '12280_2007_Article_9006.pdf', '12280_2007_Article_9007.pdf', '12280_2007_Article_9008.pdf', '12280_2007_Article_9009.pdf']
pdf_files_path: ['C:\\My\\Directory Path\\12280_2007_Article_9000.pdf', etc...]
Empty DataFrame
Columns: [FileName, Text]
Index: []
Update: based on a suggestion from #AMC I checked the contents of scraped_text in the loop. For the Text column, it seems that I'm looping through the characters in the first PDF file, rather than looping through each file in the directly. Also, the contents of the loop are not getting written to the dataframe or CSV.
12280_2007_Article_9000.pdf E
12280_2007_Article_9001.pdf a
12280_2007_Article_9002.pdf s
12280_2007_Article_9003.pdf t
12280_2007_Article_9004.pdf
12280_2007_Article_9005.pdf A
12280_2007_Article_9006.pdf s
12280_2007_Article_9007.pdf i
12280_2007_Article_9008.pdf a
12280_2007_Article_9009.pdf n
I guess you don't need pandas for that. You can make it simpler by using the standard library csv.
Another thing that can be improved, if you are using Python 3.4+, is to replace os with pathlib.
Here is an almost complete example:
import csv
from pathlib import Path
folder = Path('c:/My/Directory/Path')
csv_file = Path('c:/path/to/output.csv')
with csv_file.open('w', encoding='utf-8') as f:
writer = csv.writer(f, csv.QUOTE_ALL)
writer.writerow(['FileName', 'Text'])
for pdf_file in folder.glob('*.pdf'):
pdf_text = convert_pdf_to_txt(pdf_file).replace('\n', '|')
writer.writerow([pdf_file.name, pdf_text])
Another thing to bear in mind is to be sure pdf_text will be a single line or else your csv file will be kind of broken. One way to work around that is to pick an arbitrary character to use in place of the new line marks. If you pick the pipe character, for example, than you can do something like this, prior to writer.writerow:
pdf_text.replace('\n', '|')
It is not meant to be a complete example but a starting point. I hope it helps.
I have a large csv file containing information on sampled pathogens representing several different species. I want to split this csv file by species, so I will have one csv file per species. The data in the file aren't in any particular order. My csv file looks like this:
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044420,EQUI0208,1336,Streptococcus equi,15/10/2010,2010,Belgium,Belgium
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852528,2789STDY5834916,154046,Hungatella hathewayi,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852530,2789STDY5834918,33039,Ruminococcus torques,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852533,2789STDY5834921,40520,Blautia obeum,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852535,2789STDY5834923,1150298,Fusicatenibacter saccharivorans,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852537,2789STDY5834925,1407607,Fusicatenibacter,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852540,2789STDY5834928,39492,Eubacterium siraeum,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852544,2789STDY5834932,292800,Flavonifractor plautii,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852551,2789STDY5834939,169435,Anaerotruncus colihominis,2013,2013,United Kingdom,UK
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044418,EQUI0206,1336,Streptococcus equi,05/02/2010,2010,Belgium,Belgium
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044419,EQUI0207,1336,Streptococcus equi,29/07/2010,2010,Belgium,Belgium
The name of the species is at index 5.
I originally tried this:
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("file.csv")),
lambda row: row[5]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
But this fails because the data aren't ordered by species and there isn't an append arguement for the output (that I'm aware of) so each time the script encounters a new entry of a species that it has already written to a file it overwrites the first entries.
Is there a simple way to order the data by species and then execute the above script or a way to append the output of the above script to a file instead of overwriting it?
Also I'd ideally like each of the output files to be named after the species they contain.
Thanks.
In reference to your comment: "there isn't an append arguement for the output (that I'm aware of)", you can use 'a' instead of 'w' to append to the file like:
with open("%s.csv" % key, "a")
Probably is not the best approach because if you run the code two times you'll get everything double.
You could sort the csv files using the same lambda function as you're using for the groupby operation:
import csv
from itertools import groupby
groupfunc = lambda row: row[5]
for key, rows in groupby(sorted(csv.reader(open("file.csv")),key=groupfunc),groupfunc):
with open("%s.csv" % key, "w") as output:
cw = csv.writer(output)
cw.writerows(rows)
note:
I rewrote the write routine to use csv module as output
I created a variable for your lambda so no copy-paste
Note that you have to cleanup your csv files if you change your input data, because if one species isn't in the new data, the old csv remains on the disk. I would to that with some code like:
import glob,os
for f in glob.glob("*.csv"):
os.remove(f)
But beware of the *.csv pattern because it's too wide and it may be a little too effective on your other csv files :)
Note: This method uses sort and is therefore more memory hungry. You could choose to open each file in append mode instead as the other solution suggests to save memory, but perform more file I/O.
I am a beginner in Python. I have searched about my problem but could not find the exact requirement.
I have a folder in which there are multiple files getting scored for each experimental measurement. Their names follow a trend, e.g. XY0001.csv, XY0002.csv ... XY0040.csv. I want to read all of these files and take the average of each column in all files, storing it in 'result.csv' in the same format.
I would suggest to use pandas (import pandas as pd). I suggest to start by reading the file using pd.read_csv(). How to read the files exactly depends on how your CSV files are formatted, I cannot tell that from here. If you want to read all files in a directory (which may be the easiest solution for this problem), try to use read all files.
Then, you could concatenate all files using pd.concat(). Lastly, you can calculate the metrics you want to generate (use the search functionality to find how to calculate each specific metric). A nice function that does a lot of stuff for you is the describe function.
For access multiple files you can use glob module.
import glob
path =r'/home/root/csv_directory'
filenames = glob.glob(path + "/*.csv")
Python's pandas module have a method to parse csv file. It also some options to manage and process csv files.
import pandas as pd
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
.read_csv() method is used for parse csv files.
pd.concat(dfs, ignore_index=True)
.concat() used to concatenate all data into one dataframe and its easy for processing.
The following makes use of the glob module to get a list of all files in the current folder of the form X*.csv, i.e. all CSV files starting with x. For each file it finds, it first skips a header row (optional) and it then loads all remaining rows using a zip() trick to transpose the list of rows into a list of columns.
For each column, it converts each cell into an integer and sums the values, dividing this total by the number of elements found, thus giving an average for each column. It then writes the values to your output result.csv in the format filename, av_col1, av_col2 etc:
import glob
import csv
with open('result.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
for filename in glob.glob('X*.csv'):
print (filename)
with open(filename, newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
averages = []
for col in zip(*csv_input):
averages.append(sum(int(x) for x in col) / len(col))
csv_output.writerow([filename] + averages)
So if you had XY0001.csv containing:
Col1,Col2,Col3
6,1,10
2,1,20
5,2,30
result.csv would be written as follows:
XY0001.csv,4.333333333333333,1.3333333333333333,20.0
Tested using Python 3.5.2
I have a set of data saved across multiple .csv files with a fixed number of columns. Each column corresponds to a different measurement.
I would like to add a header to each file. The header will be identical for all files, and is comprised of three rows. Two of these rows are used to identify their corresponding columns.
I am thinking that I could save the header in a separate .csv file, then iteratively merge it with each data file using a for loop.
How can I do this in python? I am new to the language.
Yeah, you can do that easily with pandas. It will be faster and easier than what you're currently thinking which may create problems.
Three simple commands will be used for reading, merging and putting that in a new file and they are:
pandas.read_csv()
pandas.merge()
pandas.to_csv()
You can read what arguments you have to use and more details about them here.
for your case you may need first to create new files with
the headers with them. then you would do another loop to
add the rows, but skipping the header.
import csv
with open("data_out.csv","a") as fout:
# first file:
with open("data.csv") as f: # you header file
for line in f:
fout.write(line)
with open("data_2.csv") as f:
next(f) # this will skip first line
for line in f:
fout.write(line)
Instead of running a for loop appending two files for multiple files, an easier solution would be to put all the csv files you want to merge into a single folder and feed the path to the program. This will merge all the csv files into a single csv file.
(Note: The attributes of each file must be same)
import os
import pandas as pd
#give the path to the folder containing the multiple csv files
dirList = os.listdir(path)
#Put all their names into a list
filenames = []
for item in dirList:
if ".csv" in item:
filenames.append(item)
#Create a dataframe and make sure it's empty (not required but safe practice if using for appending)
df1 = pd.Dataframe()
df1.drop(df1.index, inplace=True)
#Convert each file to a dataframe and append it to dataframe df1
for f in filenames:
df = pd.read_csv(f)
df1 = df1.append(df)
#Convert the dataframe into a single csvfile
df1.to_csv(csvfile, encoding='utf-8', index=False)
I am fairly new to programming and I am currently stumped by an issue. I wrote a little script in python to compare to csv files which contain usernames for email groups that must be maintained. The script I wrote was working perfectly for csv files that were in the 200-300 item range, however, I just started testing files with a couple thousand values and my script seems to miss the very last item on each list.
The idea here is that I have 2 csv files, an old list and a new list. The way I receive the files is a bit finicky, so before processing the lists, I create an new csv form each of the csv's I am dealing with. So basically a old_clean csv and a new_clean csv. I then check each item in old_clean against new_clean to see if each item is in the new list, if it's not it get's added to remove.csv to be processed into the email system latter. I then run the test the other way around to find new names which go into add.csv. The issue I have is that the last name on the list, which is on both the old and new csv file, shows up on the add.csv and the remove.csv.
As I stated, this only happens with larger files. My code is below, any help would be appreciated.
import sys
import csv
import re
import os
###Works in python 2.5###
#create a new csv for cleaned values from first csv entered
o = open("first.csv","w")
data = open(sys.argv[1]).read()
o.write( re.sub(" ","",data) )
o.close()
#create a new csv for cleaned values from second csv entered
n = open("second.csv","w")
data = open(sys.argv[2]).read()
n.write( re.sub(" ","",data) )
n.close()
#create csv of names to remove from group
remove = open("Changes/remove.csv","w")
#create csv of names to add to group
add = open("Changes/add.csv","w")
time.sleep(3)
#adds any names from first list not found in second list to the remove.csv
for line in open("first.csv"):
if line not in open("second.csv"):
remove.write(line)
remove.close()
#adds any names from second list not found in first list to the add.csv
for line in open ("second.csv"):
if line not in open("first.csv"):
add.write(line)
add.close()
#remove the generated "clean" csv files
os.remove("first.csv")
os.remove("second.csv")