Reading csv file and compare objects to a list - python

I have a .txt file,primary list, with strings like this:
f
r
y
h
g
j
and I have a .csv file,recipes list, with rows like this:
d,g,r,e,w,s
j,f,o,b,x,q,h
y,n,b,w,q,j
My programe is going throw each row and counts number of objects which belongs to primary list, for example in this case outcome is:
2
3
2
I always get 0, the mistake must be silly, but I can't figure it out:
from __future__ import print_function
import csv
primary_data = open('test_list.txt','r')
primary_list = []
for line in primary_data.readlines():
line.strip('\n')
primary_list.append(line)
recipes_reader = csv.reader(open('test.csv','r'), delimiter =',')
for row in recipes_reader:
primary_count = 0
for i in row:
if i in primary_list:
primary_count += 1
print (primary_count)

Here's the bare-essentials pedal-to-the-metal version:
from __future__ import print_function
import csv
with open('test_list.txt', 'r') as f: # with statement ensures your file is closed
primary_set = set(line.strip() for line in f)
with open('test.csv', 'rb') as f: #### see note below ###
for row in csv.reader(f): # delimiter=',' is the default
print(sum(i in primary_set for i in row)) # i in primary_set has int value 0 or 1
Note: In Python 2.x, always open csv files in binary mode. In Python3.x, always open csv files with newline=''

The reading into primary_list adds \n to each number - you should remove it:
When appending to primary_list do:
for line in primary_data:
primary_list.append(line.strip())
Note the strip call. Also, as you can see, you don't really need realines, since for line in primary_data already does what you need when primary_data is a file object.
Now, as a general comment, since you're using the primary list for lookup, I suggest replacing the list by a set - this will make things much faster if the list is large. Python sets are very efficient for key-based lookup, lists are not designed for that purpose.

Following code would solve the problem.
from __future__ import print_function
import csv
primary_data = open('test_list.txt','r')
primary_list = [line.rstrip() for line in primary_data]
recipies_reader = csv.reader(open('recipies.csv','r'), delimiter =',')
for row in recipies_reader:
count = 0
for i in row:
if i in primary_list:
count += 1
print (count)
Output
2
3
2

Related

Reading a numbers off a list from a txt file, but only upto a comma

This is data from a lab experiment (around 717 lines of data). Rather than trying to excell it, I want to import and graph it on either python or matlab. I'm new here btw... and am a student!
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
more numbers : see Screenshot of more data from my file
I just can't figure out how to read the line up until a comma. Specifically, I need the Load numbers for one of my arrays/list, so for example on the first line I only need 62.638 (which would be the first number on my first index on my list/array).
How can I get an array/list of this, something that iterates/reads the list and ignores strings?
Thanks!
NOTE: I use Anaconda + Jupyter Notebooks for Python & Matlab (school provided software).
EDIT: Okay, so I came home today and worked on it again. I hadn't dealt with CSV files before, but after some searching I was able to learn how to read my file, somewhat.
import csv
from itertools import islice
with open('Blue_bar_GroupD.txt','r') as BB:
BB_csv = csv.reader(BB)
x = 0
BB_lb = []
while x < 7: #to skip the string data
next(BB_csv)
x+=1
for row in islice(BB_csv,0,758):
print(row[0]) #testing if I can read row data
Okay, here is where I am stuck. I want to make an arraw/list that has the 0th index value of each row. Sorry if I'm a freaking noob!
Thanks again!
You can skip all lines till the first data row and then parse the data into a list for later use - 700+ lines can be easily processd in memory.
Therefor you need to:
read the file line by line
remember the last non-empty line before number/comma/dot ( == header )
see if the line is only number/comma/dot, else increase a skip-counter (== data )
seek to 0
skip enough lines to get to header or data
read the rest into a data structure
Create test file:
text = """
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
"""
with open ("t.txt","w") as w:
w.write(text)
Some helpers and the skipping/reading logic:
import re
import csv
def convert_row(row):
"""Convert one row of data into a list of mixed ints and others.
Int is the preferred data type, else string is used - no other tried."""
d = []
for v in row:
try:
# convert to int && add
d.append(float(v))
except:
# not an int, append as is
d.append(v)
return d
def count_to_first_data(fh):
"""Count lines in fh not consisting of numbers, dots and commas.
Sideeffect: will reset position in fh to 0."""
skiplines = 0
header_line = 0
fh.seek(0)
for line in fh:
if re.match(r"^[\d.,]+$",line):
fh.seek(0)
return skiplines, header_line
else:
if line.strip():
header_line = skiplines
skiplines += 1
raise ValueError("File does not contain pure number rows!")
Usage of helpers / data conversion:
data = []
skiplines = 0
with open("t.txt","r") as csvfile:
skip_to_data, skip_to_header = count_to_first_data(csvfile)
for _ in range(skip_to_header): # skip_to_data if you do not want the headers
next(csvfile)
reader = csv.reader(csvfile, delimiter=',',quotechar='"')
for row in reader:
row_data = convert_row(row)
if row_data:
data.append(row_data)
print(data)
Output (reformatted):
[['Load (lbf)', 'Time (s)', 'Crosshead (in)', 'Extensometer (in)'],
[62.638, 0.9, 0.0, 8e-05],
[122.998, 1.7, 0.001, 0.00012]]
Doku:
re.match
csv.reader
Method of file objekts (i.e.: seek())
With this you now have "clean" data that you can use for further processing - including your headers.
For visualization you can have a look at matplotlib
I would recommend reading your file with python
data = []
with open('my_txt.txt', 'r') as fd:
# Suppress header lines
for i in range(6):
fd.readline()
# Read data lines up to the first column
for line in fd:
index = line.find(',')
if index >= 0:
data.append(float(line[0:index]))
leads to a list containing your data of the first column
>>> data
[62.638, 122.998]
The MATLAB solution is less nice, since you have to know the number of data lines in your file (which you do not need to know in the python solution)
n_header = 6
n_lines = 2 % Insert here 717 (as you mentioned)
M = csvread('my_txt.txt', n_header, 0, [n_header 0 n_header+n_lines-1 0])
leads to:
>> M
M =
62.6380
122.9980
For the sake of clarity: You can also use MATLABs textscan function to achieve what you want without knowing the number of lines, but still, the python code would be the better choice in my opinion.
Based on your format, you will need to do 3 steps. One, read all lines, two, determine which line to use, last, get the floats and assign them to a list.
Assuming you file name is name.txt, try:
f = open("name.txt", "r")
all_lines = f.readlines()
grid = []
for line in all_lines:
if ('"' not in line) and (line != '\n'):
grid.append(list(map(float, line.strip('\n').split(','))))
f.close()
The grid will then contain a series of lists containing your group of floats.
Explanation for fun:
In the "for" loop, i searched for the double quote to eliminate any string as all strings are concocted between quotes. The other one is for skipping empty lines.
Based on your needs, you can use the list grid as you please. For example, to fetch the first line's first number, do
grid[0][0]
as python's list counts from 0 to n-1 for n elements.
This is super simple in Matlab, just 2 lines:
data = dlmread('data.csv', ',', 6,0);
column1 = data(:,1);
Where 6 and 0 should be replaced by the row and column offset you want. So in this case, the data starts at row 7 and you want all the columns, then just copy over the data in column 1 into another vector.
As another note, try typing doc dlmread in matlab - it brings up the help page for dlmread. This is really useful when you're looking for matlab functions, as it has other suggestions for similar functions down the bottom.

Trying to copy column1 from a csv file to another empty file using python

I'm looking for a way using python to copy the first column from a csv into an empty file. I'm trying to learn python so any help would be great!
So if this is test.csv
A 32
D 21
C 2
B 20
I want this output
A
D
C
B
I've tried the following commands in python but the output file is empty
f= open("test.csv",'r')
import csv
reader = csv.reader(f,delimiter="\t")
names=""
for each_line in reader:
names=each_line[0]
First, you want to open your files. A good practice is to use the with statement (that, technically speaking, introduces a context manager) so that when your code exits from the with block all the files are automatically closed
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
next you want a loop on the lines of the input file (note the indentation, we are inside the with block), line splitting is automatic when you read a text file with lines separated by newlines…
for line in inpfile:
each line is a string, but you think of it as two fields separated by white space — this situation is so common that strings have a method to deal with this situation (note again the increasing indent, we are in the for loop block)
fields = line.split()
by default .split() splits on white space, but you can use, e.g., split(',') to split on commas, etc — that said, fields is a list of strings, for your first record it is equal to ['A', '32'] and you want to output just the first field in this list… for this purpose a file object has the .write() method, that writes a string, just a string, to the file, and fields[0] IS a string, but we have to add a newline character to it because, in this respect, .write() is different from print().
outfile.write(fields[0]+'\n')
That's all, but if you omit my comments it's 4 lines of code
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
for line in inpfile:
fields = line.split()
outfile.write(fields[0]+'\n')
When you are done with learning (some) Python, ask for an explanation of this...
with open('test.csv') as ifl, open('out.csv', 'w') as ofl:
ofl.write('\n'.join(line.split()[0] for line in ifl))
Addendum
The csv module in such a simple case adds the additional conveniences of
auto-splitting each line into a list of strings
taking care of the details of output (newlines, etc)
and when learning Python it's more fruitful to see how these steps can be done using the bare language, or at least that it is my opinion…
The situation is different when your data file is complex, has headers, has quoted strings possibly containing quoted delimiters etc etc, in those cases the use of csv is recommended, as it takes into account all the gory details. For complex data analisys requirements you will need other packages, not included in the standard library, e.g., numpy and pandas, but that is another story.
This answer reads the CSV file, understanding a column to be demarked by a space character. You have to add the header=None otherwise the first row will be taken to be the header / names of columns.
ss is a slice - the 0th column, taking all rows as denoted by :
The last line writes the slice to a new filename.
import pandas as pd
df = pd.read_csv('test.csv', sep=' ', header=None)
ss = df.ix[:, 0]
ss.to_csv('new_path.csv', sep=' ', index=False)
import csv
reader = csv.reader(open("test.csv","rb"), delimiter='\t')
writer = csv.writer(open("output.csv","wb"))
for e in reader:
writer.writerow(e[0])
The best you can do is create a empty list and append the column and then write that new list into another csv for example:
import csv
def writetocsv(l):
#convert the set to the list
b = list(l)
print (b)
with open("newfile.csv",'w',newline='',) as f:
w = csv.writer(f, delimiter=',')
for value in b:
w.writerow([value])
adcb_list = []
f= open("test.csv",'r')
reader = csv.reader(f,delimiter="\t")
for each_line in reader:
adcb_list.append(each_line)
writetocsv(adcb_list)
hope this works for you :-)

AttributeError: FileInput instance has no attribute '__exit__'

I am trying to read from multiple input files and print the second row from each file next to each other as a table
import sys
import fileinput
with fileinput.input(files=('cutflow_TTJets_1l.txt ', 'cutflow_TTJets_1l.txt ')) as f:
for line in f:
proc(line)
def proc(line):
parts = line.split("&") # split line into parts
if "&" in line: # if at least 2 parts/columns
print parts[1] # print column 2
But I get a "AttributeError: FileInput instance has no attribute '__exit__'"
The problem is that as of python 2.7.10, the fileinput module does not support being used as a context manager, i.e. the with statement, so you have to handle closing the sequence yourself. The following should work:
f = fileinput.input(files=('cutflow_TTJets_1l.txt ', 'cutflow_TTJets_1l.txt '))
for line in f:
proc(line)
f.close()
Note that in recent versions of python 3, you can use this module as a context manager.
For the second part of the question, assuming that each file is similarly formatted with an equal number of data lines of the form xxxxxx & xxxxx, one can make a table of the data from the second column of each data as follows:
Start with an empty list to be a table where the rows will be lists of second column entries from each file:
table = []
Now iterate over all lines in the fileinput sequence, using the fileinput.isfirstline() to check if we are at a new file and make a new row:
for line in f:
if fileinput.isfirstline():
row = []
table.append(row)
parts = line.split('&')
if len(parts) > 1:
row.append(parts[1].strip())
f.close()
Now table will be the transpose of what you really want, which is each row containing the second column entries of a given line of each file. To transpose the list, one can use zip and then loop over rows the transposed table, using the join string method to print each row with a comma separator (or whatever separator you want):
for row in zip(*table):
print(', '.join(row))
If something has open/close methods, use contextlib.closing:
import sys
import fileinput
from contextlib import closing
with closing(fileinput.input(files=('cutflow_TTJets_1l.txt ', 'cutflow_TTJets_1l.txt '))) as f:
for line in f:
proc(line)

Python Error When Attempting to Iterate Over a List of File Names

I have a list of file names, all of which have the .csv ending. I am trying to use the linecache.getline function to get 2 parts of each csv - the second row, 5th item and the 46th row, 5th item and comparing the two values (they're stock returns).
import csv
import linecache
d = open('successful_scrapes.csv')
csv = csv.reader(d)
k = []
for row in csv:
k.append(row)
x =linecache.getline('^N225.csv',2)
y = float(x.split(",")[4])
for c in k:
g = linecache.getline(c,2)
t = float(g.split(",")[4])
Everything works until the for loop over the k list. It keeps returning the error "Unhashable type: list." I've tried including quotation marks before and after each .csv file name in the list. Additionally, all the files are included in the same directory. Any thoughts?
Thanks!
You're misusing linecache, which is for working with files. There is no point in using it at all if you are going to pull the whole file into memory first.
In this case, since you have the whole CSV copied into k, just do the comparison:
yourComparisonFunction(k[1][4],k[45][4])
Alternately, you could use linecache instead of csv, and do it like so:
import linecache
file_list = ['file1','file2','file3','etc']
for f in file_list:
line2 = linecache.getline(f,2)
line2val = float(line2.split(",")[4])
line46 = linecache.getline(f,46)
line46val = float(line46.split(",")[4])
And the, I assume, add some comparison logic.
You can read the file and then just append the values to a list depending on the row number.
import csv
with open("C/a.csv", "rb") as f:
reader = csv.reader(f)
lst = [x[4] for i, x in enumerate(reader) if i == 1 or i == 45]
Then you can do the comparison with the lst's items

Python output entire column

Using python 2.4 i have a .txt file sorted into 3 colums, 9 spaces between each column which is the reason for x.split - roughly 1000 lines ex:
$1$sda4356:[sgb1_diska5.ldlbat44.libabl]talild_0329_base.rpt talild_0329_base.rpt 00000000000000005062
I'm using the following code to sort by column 3 (which is file size)
fobj = open('data.txt')
data = [ x.split() for x in fobj ]
fobj.close()
from operator import itemgetter
data.sort(key=itemgetter(2), reverse=True)
I want to print the output of an entire column and if possible with Python 2.4 even name them. If i do something like data[1] it will just output line 2 how can i get this to show column 2 instead. If i can't name it i see a few things with import csv but i can't figure out the right command to use the data i've already sorted instead of calling up the .txt file again. Most are looking for file name as shown below
with open(filename, 'r') as f:
def getcolumn(n,data):
return (i[n] for i in data) # if this dosent work replace () with []
for i in getcolumn(1,data):
print i

Categories