Reading CSV files and rewriting them without certain rows Python - python

I am new to programming. I have hundreds of CSV files in a folder and certain files have the letters DIF in the second column. I want to rewrite the CSV files without those lines in them. I have attempted doing that for one file and have put my attempt below. I need also need help getting the program to do that dfor all the files in my directory. Any help would be appreciated.
Thank you
import csv
reader=csv.reader(open("40_5.csv","r"))
for row in reader:
if row[1] == 'DIF':
csv.writer(open('40_5N.csv', 'w')).writerow(row)

I made some changes to your code:
import csv
import glob
import os
fns = glob.glob('*.csv')
for fn in fns:
reader=csv.reader(open(fn,"rb"))
with open (os.path.join('out', fn), 'wb') as f:
w = csv.writer(f)
for row in reader:
if not 'DIF' in row:
w.writerow(row)
The glob command produces a list of all files ending with .csv in the current directory. If you want to give the source directory as an argument to your program, have a look into sys.argv or argparse (especially the latter is very powerful for command line parsing).
You also have to be careful when opening a file in 'w' mode: It means truncating the file, i.e. in your loop you would always overwrite the existing file, ending up in only one csv line.
The direcotry 'out' must exist or the script will produce an IOError.
Links:
open
sys.argv
argparse
glob

Most sequence types support the in or not in operators, which are much simpler to use to test for values than figuring index positions.
for row in reader:
if not 'DIF' in row:
csv.writer(open('40_5N.csv', 'w')).writerow(row)

If you're willing to install numpy, you can also read a csv file into the convenient numpy array format with either recfromcsv or the more general genfromtxt (genfromtxt requires you specify the comma delimiter), and you can specify which rows and columns to ignore. Documentation can be found here for genfromtxt:
http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
And here for recfromcsv: http://nullege.com/codes/search/numpy.recfromcsv?fulldoc=1

Related

Extract text from multiple PDFs and write to a single CSV

I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I am also able to get a list of all the PDF filenames in a given directory. But when I try to put the two together and write the results to a single CSV, I get a CSV with headers but no data.
Here is my code:
import os
pdf_files = [name for name in os.listdir("C:\\My\\Directory\\Path") if name.endswith(".pdf")] #get all files in directory
pdf_files_path = ["C:\\My\\Directory\\Path\\" + pdf_files[i] for i in range(len(pdf_files))] #add directory path
import pandas as pd
df = pd.DataFrame(columns=['FileName','Text'])
for i in range(len(pdf_files)):
scraped_text = convert_pdf_to_txt(pdf_files_path[i])
df.append({ 'FileName': pdf_files[i], 'Text': scraped_text[i]},ignore_index=True)
df.to_csv('output.csv')
The variables have the following values:
pdf_files: ['12280_2007_Article_9000.pdf', '12280_2007_Article_9001.pdf', '12280_2007_Article_9002.pdf', '12280_2007_Article_9003.pdf', '12280_2007_Article_9004.pdf', '12280_2007_Article_9005.pdf', '12280_2007_Article_9006.pdf', '12280_2007_Article_9007.pdf', '12280_2007_Article_9008.pdf', '12280_2007_Article_9009.pdf']
pdf_files_path: ['C:\\My\\Directory Path\\12280_2007_Article_9000.pdf', etc...]
Empty DataFrame
Columns: [FileName, Text]
Index: []
Update: based on a suggestion from #AMC I checked the contents of scraped_text in the loop. For the Text column, it seems that I'm looping through the characters in the first PDF file, rather than looping through each file in the directly. Also, the contents of the loop are not getting written to the dataframe or CSV.
12280_2007_Article_9000.pdf E
12280_2007_Article_9001.pdf a
12280_2007_Article_9002.pdf s
12280_2007_Article_9003.pdf t
12280_2007_Article_9004.pdf
12280_2007_Article_9005.pdf A
12280_2007_Article_9006.pdf s
12280_2007_Article_9007.pdf i
12280_2007_Article_9008.pdf a
12280_2007_Article_9009.pdf n
I guess you don't need pandas for that. You can make it simpler by using the standard library csv.
Another thing that can be improved, if you are using Python 3.4+, is to replace os with pathlib.
Here is an almost complete example:
import csv
from pathlib import Path
folder = Path('c:/My/Directory/Path')
csv_file = Path('c:/path/to/output.csv')
with csv_file.open('w', encoding='utf-8') as f:
writer = csv.writer(f, csv.QUOTE_ALL)
writer.writerow(['FileName', 'Text'])
for pdf_file in folder.glob('*.pdf'):
pdf_text = convert_pdf_to_txt(pdf_file).replace('\n', '|')
writer.writerow([pdf_file.name, pdf_text])
Another thing to bear in mind is to be sure pdf_text will be a single line or else your csv file will be kind of broken. One way to work around that is to pick an arbitrary character to use in place of the new line marks. If you pick the pipe character, for example, than you can do something like this, prior to writer.writerow:
pdf_text.replace('\n', '|')
It is not meant to be a complete example but a starting point. I hope it helps.

How can I make the program to load all text files in glob one by one?

Every time, all the operations are taking place only on the first file of glob folder..
The code is not updating the file in globs.. It's is taking only the first text file.
The code is written in the image..
There is no error msg. I want it to perform same operations on all text files inside glob.
From what I understood, you need to first open your writer, and then read what you need to read and write what you read in a loop . (then close your writer if you need) Not specifically your answer but should be something similar to this:
with open('some.csv', 'w') as somefile:
for filename in glob.glob('path'):
with open(filename, 'r') as file_to_read:
some_data=file_to_read.readlines()
#if you need loop again in data
#or do whatever you want
for i in range(len(some_data)):
data=some_data[i]
writer=csv.writer(somefile)
writer.writerow([filename, data])

How do you use Python to print a list of files from a CSV?

I was hoping to use a CSV file with a list of file paths in one column and use Python to print the actual files.
We are using Window 7 64-bit.
I have got it to print a file directly:
import os
os.startfile(r'\\fileserver\Sales\Sell Sheet1.pdf, 'print')
The issues comes in when I bring in the CSV file. I think I'm not formatting it correctly because I keep getting:
FileNotFoundError: [WinError2] The system cannot find the file specified: "['\\\\fileserver\\Sales\\Sell Sheet1']"
This is where I keep getting hung up:
import os
import csv
with open (r'\\fileserver\Sales\TestList.csv') as csv_file:
TestList = csv.reader(csv_file, delimiter=',')
for row in TestList:
os.startfile(str(row),'print')
My sample CSV file contains:
\\fileserver\Sales\SellSheet1
\\fileserver\Sales\SellSheet2
\\fileserver\Sales\SellSheet3
Is this an achievable goal?
You shouldn't be using str() there. The CSV reader gives you a list of rows, and each row is a list of fields. Since you just want the first field, you should get that:
os.startfile(row[0], 'print')

Take average of each column in multiple csv files using Python

I am a beginner in Python. I have searched about my problem but could not find the exact requirement.
I have a folder in which there are multiple files getting scored for each experimental measurement. Their names follow a trend, e.g. XY0001.csv, XY0002.csv ... XY0040.csv. I want to read all of these files and take the average of each column in all files, storing it in 'result.csv' in the same format.
I would suggest to use pandas (import pandas as pd). I suggest to start by reading the file using pd.read_csv(). How to read the files exactly depends on how your CSV files are formatted, I cannot tell that from here. If you want to read all files in a directory (which may be the easiest solution for this problem), try to use read all files.
Then, you could concatenate all files using pd.concat(). Lastly, you can calculate the metrics you want to generate (use the search functionality to find how to calculate each specific metric). A nice function that does a lot of stuff for you is the describe function.
For access multiple files you can use glob module.
import glob
path =r'/home/root/csv_directory'
filenames = glob.glob(path + "/*.csv")
Python's pandas module have a method to parse csv file. It also some options to manage and process csv files.
import pandas as pd
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
.read_csv() method is used for parse csv files.
pd.concat(dfs, ignore_index=True)
.concat() used to concatenate all data into one dataframe and its easy for processing.
The following makes use of the glob module to get a list of all files in the current folder of the form X*.csv, i.e. all CSV files starting with x. For each file it finds, it first skips a header row (optional) and it then loads all remaining rows using a zip() trick to transpose the list of rows into a list of columns.
For each column, it converts each cell into an integer and sums the values, dividing this total by the number of elements found, thus giving an average for each column. It then writes the values to your output result.csv in the format filename, av_col1, av_col2 etc:
import glob
import csv
with open('result.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
for filename in glob.glob('X*.csv'):
print (filename)
with open(filename, newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
averages = []
for col in zip(*csv_input):
averages.append(sum(int(x) for x in col) / len(col))
csv_output.writerow([filename] + averages)
So if you had XY0001.csv containing:
Col1,Col2,Col3
6,1,10
2,1,20
5,2,30
result.csv would be written as follows:
XY0001.csv,4.333333333333333,1.3333333333333333,20.0
Tested using Python 3.5.2

Python error: need more than one value to unpack

Ok, so I'm learning Python. But for my studies I have to do rather complicated stuff already. I'm trying to run a script to analyse data in excel files. This is how it looks:
#!/usr/bin/python
import sys
#lots of functions, not relevant
resultsdir = /home/blah
filename1=sys.argv[1]
filename2=sys.argv[2]
out = open(sys.argv[3],"w")
#filename1,filename2="CNVB_reads.403476","CNVB_reads.403447"
file1=open(resultsdir+"/"+filename1+".csv")
file2=open(resultsdir+"/"+filename2+".csv")
for line in file1:
start.p,end.p,type,nexons,start,end,cnvlength,chromosome,id,BF,rest=line.split("\t",10)
CNVs1[chr].append([int(start),int(end),float(BF)])
for line in file2:
start.p,end.p,type,nexons,start,end,cnvlength,chromosome,id,BF,rest=line.split("\t",10)
CNVs2[chr].append([int(start),int(end),float(BF)])
These are the titles of the columns of the data in the excel files and I want to split them, I'm not even sure if that is necessary when using data from excel files.
#more irrelevant stuff
out.write(filename1+","+filename2+","+str(chromosome)+","+str(type)+","+str(shared)+"\n")
This is what it should write in my output, 'shared' is what I have calculated, the rest is already in the files.
Ok, now my question, finally, when I call the script like that:
python script.py CNVB_reads.403476 CNVB_reads.403447 script.csv in my shell
I get the following error message:
start.p,end.p,type,nexons,start,end,cnvlength,chromosome,id,BF,rest=line.split("\t",10)
ValueError: need more than 1 value to unpack
I have no idea what is meant by that in relation to the data... Any ideas?
The line.split('\t', 10) call did not return eleven elements. Perhaps it is empty?
You probably want to use the csv module instead to parse these files.
import csv
import os
for filename, target in ((filename1, CNVs1), (filename2, CNVs2)):
with open(os.path.join(resultsdir, filename + ".csv"), 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
start.p, end.p = row[:2]
BF = float(row[8])
target[chr].append([int(start), int(end), BF])

Categories