using CSV module with readline() - python

Yesterday I posted the below link:
Python CSV Module read and write simultaneously
Several people suggested that I "If file b is not extremely large I would suggest using readlines() to get a list of all lines and then iterate over the list and change lines as needed."
I want to still be able to use the functionality of the CSV Module but do what they have suggested. I am new to python and still don't quite undertand how I should do this.
Could someone please provide me with an example of how I should do this.

Here is a sample that reads a CSV file using a DictReader and uses a DictWriter to write to stdout. The file has a column named PERCENT_CORRECT_FLAG, and this modifies the CSV file to set this field to 0.
#!/usr/bin/env python
from __future__ import with_statement
from __future__ import print_function
from csv import DictReader, DictWriter
import sys
def modify_csv(filename):
with open(filename) as f:
reader = DictReader(f)
writer = DictWriter(sys.stdout, fieldnames=reader.fieldnames)
for i, s in enumerate(writer.fieldnames):
print(i, s, file=sys.stdout)
for row in reader:
row['PERCENT_CORRECT_FLAG'] = '0'
writer.writerow(row)
if __name__ == '__main__':
for filename in sys.argv[1:]:
modify_csv(filename)
If you do not want to write to stdout, you can open another file for write and use that. Note that if you want to write back to the original file, you have to either:
Read the file into memory and close the file before opening for write or
Open a file with a different name for write and rename it after closing it.

Related

How to parse jsonlines file using pandas

I am new to python and trying to parse data from a file that contains millions of lines. Tried to go old school to parse it using excel but it fails. How can I parse the information efficiently and export them into an excel file so that it is easier for other people to read?
I tried using this code provided by someone else but no luck so far
import re
import pandas as pd
def clean_data(filename):
with open(filename, "r") as inputfile:
for row in inputfile:
if re.match("\[", row) is None:
yield row
with open(clean_file, 'w') as outputfile:
for row in clean_data(filename):
outputfile.write(row)
NameError: name 'clean_file' is not defined
It looks like clean_file is not defined, which is probably a problem from copy/pasteing code.
Did you mean to write to a file called "clean_file"? In which case you need to wrap it in quotes: with open("clean_file", 'w')
If you want to work with json I sugget looking into the json package which has lots of tools for loading and parsing json. Otherwise, if the json is flat, you can just use the inbuilt pandas function read_json

How do you use Python to print a list of files from a CSV?

I was hoping to use a CSV file with a list of file paths in one column and use Python to print the actual files.
We are using Window 7 64-bit.
I have got it to print a file directly:
import os
os.startfile(r'\\fileserver\Sales\Sell Sheet1.pdf, 'print')
The issues comes in when I bring in the CSV file. I think I'm not formatting it correctly because I keep getting:
FileNotFoundError: [WinError2] The system cannot find the file specified: "['\\\\fileserver\\Sales\\Sell Sheet1']"
This is where I keep getting hung up:
import os
import csv
with open (r'\\fileserver\Sales\TestList.csv') as csv_file:
TestList = csv.reader(csv_file, delimiter=',')
for row in TestList:
os.startfile(str(row),'print')
My sample CSV file contains:
\\fileserver\Sales\SellSheet1
\\fileserver\Sales\SellSheet2
\\fileserver\Sales\SellSheet3
Is this an achievable goal?
You shouldn't be using str() there. The CSV reader gives you a list of rows, and each row is a list of fields. Since you just want the first field, you should get that:
os.startfile(row[0], 'print')

Reading CSV files and rewriting them without certain rows Python

I am new to programming. I have hundreds of CSV files in a folder and certain files have the letters DIF in the second column. I want to rewrite the CSV files without those lines in them. I have attempted doing that for one file and have put my attempt below. I need also need help getting the program to do that dfor all the files in my directory. Any help would be appreciated.
Thank you
import csv
reader=csv.reader(open("40_5.csv","r"))
for row in reader:
if row[1] == 'DIF':
csv.writer(open('40_5N.csv', 'w')).writerow(row)
I made some changes to your code:
import csv
import glob
import os
fns = glob.glob('*.csv')
for fn in fns:
reader=csv.reader(open(fn,"rb"))
with open (os.path.join('out', fn), 'wb') as f:
w = csv.writer(f)
for row in reader:
if not 'DIF' in row:
w.writerow(row)
The glob command produces a list of all files ending with .csv in the current directory. If you want to give the source directory as an argument to your program, have a look into sys.argv or argparse (especially the latter is very powerful for command line parsing).
You also have to be careful when opening a file in 'w' mode: It means truncating the file, i.e. in your loop you would always overwrite the existing file, ending up in only one csv line.
The direcotry 'out' must exist or the script will produce an IOError.
Links:
open
sys.argv
argparse
glob
Most sequence types support the in or not in operators, which are much simpler to use to test for values than figuring index positions.
for row in reader:
if not 'DIF' in row:
csv.writer(open('40_5N.csv', 'w')).writerow(row)
If you're willing to install numpy, you can also read a csv file into the convenient numpy array format with either recfromcsv or the more general genfromtxt (genfromtxt requires you specify the comma delimiter), and you can specify which rows and columns to ignore. Documentation can be found here for genfromtxt:
http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
And here for recfromcsv: http://nullege.com/codes/search/numpy.recfromcsv?fulldoc=1

Reading command Line Args

I am running a script in python like this from the prompt:
python gp.py /home/cdn/test.in..........
Inside the script i need to take the path of the input file test.in and the script should read and print from the file content. This is the code which was working fine. But the file path is hard coded in script. Now I want to call the path as a command line argument.
Working Script
#!/usr/bin/python
import sys
inputfile='home/cdn/test.in'
f = open (inputfile,"r")
data = f.read()
print data
f.close()
Script Not Working
#!/usr/bin/python
import sys
print "\n".join(sys.argv[1:])
data = argv[1:].read()
print data
f.close()
What change do I need to make in this ?
While Brandon's answer is a useful solution, the reason your code is not working also deserves explanation.
In short, a list of strings is not a file object. In your first script, you open a file and operate on that object (which is a file object.). But writing ['foo','bar'].read() does not make any kind of sense -- lists aren't read()able, nor are strings -- 'foo'.read() is clearly nonsense. It would be similar to just writing inputfile.read() in your first script.
To make things explicit, here is an example of getting all of the content from all of the files specified on the commandline. This does not use fileinput, so you can see exactly what actually happens.
# iterate over the filenames passed on the commandline
for filename in sys.argv[1:]:
# open the file, assigning the file-object to the variable 'f'
with open(filename, 'r') as f:
# print the content of this file.
print f.read()
# Done.
Check out the fileinput module: it interprets command line arguments as filenames and hands you the resulting data in a single step!
http://docs.python.org/2/library/fileinput.html
For example:
import fileinput
for line in fileinput.input():
print line
In the script that isn't working for you, you are simply not opening the file before reading it. So change it to
#!/usr/bin/python
import sys
print "\n".join(sys.argv[1:])
f = open(argv[1:], "r")
data = f.read()
print data
f.close()
Also, f.close() this would error out because f has not been defined. The above changes take care of it though.
BTW, you should use at least 3 chars long variable names according to the coding standards.

Python error: need more than one value to unpack

Ok, so I'm learning Python. But for my studies I have to do rather complicated stuff already. I'm trying to run a script to analyse data in excel files. This is how it looks:
#!/usr/bin/python
import sys
#lots of functions, not relevant
resultsdir = /home/blah
filename1=sys.argv[1]
filename2=sys.argv[2]
out = open(sys.argv[3],"w")
#filename1,filename2="CNVB_reads.403476","CNVB_reads.403447"
file1=open(resultsdir+"/"+filename1+".csv")
file2=open(resultsdir+"/"+filename2+".csv")
for line in file1:
start.p,end.p,type,nexons,start,end,cnvlength,chromosome,id,BF,rest=line.split("\t",10)
CNVs1[chr].append([int(start),int(end),float(BF)])
for line in file2:
start.p,end.p,type,nexons,start,end,cnvlength,chromosome,id,BF,rest=line.split("\t",10)
CNVs2[chr].append([int(start),int(end),float(BF)])
These are the titles of the columns of the data in the excel files and I want to split them, I'm not even sure if that is necessary when using data from excel files.
#more irrelevant stuff
out.write(filename1+","+filename2+","+str(chromosome)+","+str(type)+","+str(shared)+"\n")
This is what it should write in my output, 'shared' is what I have calculated, the rest is already in the files.
Ok, now my question, finally, when I call the script like that:
python script.py CNVB_reads.403476 CNVB_reads.403447 script.csv in my shell
I get the following error message:
start.p,end.p,type,nexons,start,end,cnvlength,chromosome,id,BF,rest=line.split("\t",10)
ValueError: need more than 1 value to unpack
I have no idea what is meant by that in relation to the data... Any ideas?
The line.split('\t', 10) call did not return eleven elements. Perhaps it is empty?
You probably want to use the csv module instead to parse these files.
import csv
import os
for filename, target in ((filename1, CNVs1), (filename2, CNVs2)):
with open(os.path.join(resultsdir, filename + ".csv"), 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
start.p, end.p = row[:2]
BF = float(row[8])
target[chr].append([int(start), int(end), BF])

Categories