Search for a specific value in a specific column with Python - python

I've got a text file that is tab delimited and I'm trying to figure out how to search for a value in a specific column in this file.
I think i need to use the csv import but have been unsuccessful so far. Can someone point me in the right direction?
Thanks!
**Update**
Thanks for everyone's updates. I know I could probably use awk for this but simply for practice, I am trying to finish it in python.
I am getting the following error now:
if row.split(' ')[int(searchcolumn)] == searchquery:
IndexError: list index out of range
And here is the snippet of my code:
#open the directory and find all the files
for subdir, dirs, files in os.walk(rootdir):
for file in files:
f=open(file, 'r')
lines=f.readlines()
for line in lines:
#the first 4 lines of the file are crap, skip them
if linescounter > startfromline:
with open(file) as infile:
for row in infile:
if row.split(' ')[int(searchcolumn)] == searchquery:
rfile = open(resultsfile, 'a')
rfile.writelines(line)
rfile.write("\r\n")
print "Writing line -> " + line
resultscounter += 1
linescounter += 1
f.close()
I am taking both searchcolumn and searchquery as raw_input from the user. Im guessing the reason I am getting the list out of range now, is because it's not parsing the file correctly?
Thanks again.

You can also use the sniffer (example taken from http://docs.python.org/library/csv.html)
csvfile = open("example.csv", "rb")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)

Yes, you'll want to use the csv module, and you'll want to set delimiter to '\t':
spamReader = csv.reader(open('spam.csv', 'rb'), delimiter='\t')
After that you should be able to iterate:
for row in spamReader:
print row[n]

This prints all rows in filename with 'myvalue' in the fourth tab-delimited column:
with open(filename) as infile:
for row in infile:
if row.split('\t')[3] == 'myvalue':
print row
Replace 3, 'myvalue', and print as appropriate.

Related

Want to append a column in a file without using the Pandas

I have a file say, outfile.txt which looks like below:
1,2,3,4,0,0.95
1,2,4,4,0,0.81
5,6,3,1,0,0.89
7,6,8,8,0,0.77
6,6,4,9,0,0.88
9,9,9,1,0,0.66
4,3,6,9,0,0.85
1,2,6,7,0,0.61
Now I want to append one extra 1 to each row. So the desired output file looks like:
1,2,3,4,0,0.95,1
1,2,4,4,0,0.81,1
5,6,3,1,0,0.89,1
7,6,8,8,0,0.77,1
6,6,4,9,0,0.88,1
9,9,9,1,0,0.66,1
4,3,6,9,0,0.85,1
1,2,6,7,0,0.61,1
How can I do it? Whenever I am googling it to find a solution, I am seeing everywhere this kind of solution is provided using Pandas, But I don't want to use that.
Since your file is in csv format, csv module can help you. If you iterate over the reader object, it gives you a list of the items in each line in the file, then simply .append() what you want.
import csv
with open("outfile.txt") as f:
reader = csv.reader(f)
for line in reader:
line.append("1")
print(",".join(line))
If you have a column like column you can zip it with the reader object and append the corresponding element in the loop:
import csv
column = range(10)
with open("outfile.txt") as f:
reader = csv.reader(f)
for line, n in zip(reader, map(str, column)):
line.append(n)
print(",".join(line))
I printed, you can write it to a new file.
You can read and write files line by line with the csv module. A reader object will iterate the rows of the input file and writer.writerows will consume that iterator. You just need a bit of extra code to add the 1. Using a list generator, this example adds the extra column.
import csv
import os
filename = "outfile.txt"
tmp = filename + ".tmp"
with open(filename, newline="") as infile, open(tmp, "w", newline="") as outfile:
csv.writer(outfile).writerows(row + [1] for row in csv.reader(infile))
os.rename(tmp, filename)
Just, iterate through the file line by line and add ,1 at the end of each line:
with open('outfile.txt', 'r') as input:
with open('outfile_final.txt', 'w') as output:
for line in input:
line = line.rstrip('\n') + ',1'
print(line, file=output)

How to format txt file in Python

I am trying to convert a txt file into a csv file in Python. The current format of the txt file are several strings separated by spaces. I would like to write each string into one cell in the csv file.
The txt file has got following structure:
UserID Desktop Display (Version) (Server/Port handle), Date
UserID Desktop Display (Version) (Server/Port handle), Date
etc.
My approach would be following:
with open('licfile.txt', "r+") as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split(" ") for line in stripped if line)
with open('licfile.csv', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('user', 'desktop', 'display', 'version', 'server', 'handle', 'date'))
writer.writerows(lines)
Unfortunately this is not working as expected. I do get following ValueError: I/O operation on closed file. Additionally only the intended row headers are shown in one cell in the csv file.
Any tips on how to proceed? Many thanks in advance.
how about
with open('licfile.txt', 'r') as in_file, open('licfile.csv', 'w') as out_file:
for line in in_file:
if line.strip():
out_file.write(line.strip().replace(' ', ',') + '\n')
and for the german Excel enthusiasts...
...
...
...
... .replace(' ', ';') + '\n')
:)
You can also use the built in csv module to accomplish this easily:
import csv
with open('licfile.txt', 'r') as in_file, open('licfile.csv', 'w') as out_file:
reader = csv.reader(in_file, delimiter=" ")
writer = csv.writer(out_file, lineterminator='\n')
writer.writerows(reader)
I used lineterminator='\n' argument here as the default is \r\n and it ends up giving you an extra line of return per row in most cases.
There are also a few arguments you could use if say quoting is needed or a different delimiter is desired: https://docs.python.org/3/library/csv.html#csv-fmt-params
You are using comprehension with round brackets which will cause to create tuple object. Instead of that just use square bracket which will return list. see below example:
stripped = [line.strip() for line in in_file]
lines = [line.split(" ") for line in stripped if line]
licfile_df = pd.read_csv('licfile.txt',sep=",", header=None)

python 3 csv reader + Ignore empty records [duplicate]

This is my code i am able to print each line but when blank line appears it prints ; because of CSV file format, so i want to skip when blank line appears
import csv
import time
ifile = open ("C:\Users\BKA4ABT\Desktop\Test_Specification\RDBI.csv", "rb")
for line in csv.reader(ifile):
if not line:
empty_lines += 1
continue
print line
If you want to skip all whitespace lines, you should use this test: ' '.isspace().
Since you may want to do something more complicated than just printing the non-blank lines to the console(no need to use CSV module for that), here is an example that involves a DictReader:
#!/usr/bin/env python
# Tested with Python 2.7
# I prefer this style of importing - hides the csv module
# in case you do from this_file.py import * inside of __init__.py
import csv as _csv
# Real comments are more complicated ...
def is_comment(line):
return line.startswith('#')
# Kind of sily wrapper
def is_whitespace(line):
return line.isspace()
def iter_filtered(in_file, *filters):
for line in in_file:
if not any(fltr(line) for fltr in filters):
yield line
# A dis-advantage of this approach is that it requires storing rows in RAM
# However, the largest CSV files I worked with were all under 100 Mb
def read_and_filter_csv(csv_path, *filters):
with open(csv_path, 'rb') as fin:
iter_clean_lines = iter_filtered(fin, *filters)
reader = _csv.DictReader(iter_clean_lines, delimiter=';')
return [row for row in reader]
# Stores all processed lines in RAM
def main_v1(csv_path):
for row in read_and_filter_csv(csv_path, is_comment, is_whitespace):
print(row) # Or do something else with it
# Simpler, less refactored version, does not use with
def main_v2(csv_path):
try:
fin = open(csv_path, 'rb')
reader = _csv.DictReader((line for line in fin if not
line.startswith('#') and not line.isspace()),
delimiter=';')
for row in reader:
print(row) # Or do something else with it
finally:
fin.close()
if __name__ == '__main__':
csv_path = "C:\Users\BKA4ABT\Desktop\Test_Specification\RDBI.csv"
main_v1(csv_path)
print('\n'*3)
main_v2(csv_path)
Instead of
if not line:
This should work:
if not ''.join(line).strip():
my suggestion would be to just use the csv reader who can delimite the file into rows. Like this you can just check whether the row is empty and if so just continue.
import csv
with open('some.csv', 'r') as csvfile:
# the delimiter depends on how your CSV seperates values
csvReader = csv.reader(csvfile, delimiter = '\t')
for row in csvReader:
# check if row is empty
if not (row):
continue
You can always check for the number of comma separated values. It seems to be much more productive and efficient.
When reading the lines iteratively, as these are a list of comma separated values you would be getting a list object. So if there is no element (blank link), then we can make it skip.
with open(filename) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=",")
for row in csv_reader:
if len(row) == 0:
continue
You can strip leading and trailing whitespace, and if the length is zero after that the line is empty.
import csv
with open('userlist.csv') as f:
reader = csv.reader(f)
user_header = next(reader) # Add this line if there the header is
user_list = [] # Create a new user list for input
for row in reader:
if any(row): # Pick up the non-blank row of list
print (row) # Just for verification
user_list.append(row) # Compose all the rest data into the list
This example just prints the data in array form while skipping the empty lines:
import csv
file = open("data.csv", "r")
data = csv.reader(file)
for line in data:
if line: print line
file.close()
I find it much clearer than the other provided examples.
import csv
ifile=csv.reader(open('C:\Users\BKA4ABT\Desktop\Test_Specification\RDBI.csv', 'rb'),delimiter=';')
for line in ifile:
if set(line).pop()=='':
pass
else:
for cell_value in line:
print cell_value

how to skip blank line while reading CSV file using python

This is my code i am able to print each line but when blank line appears it prints ; because of CSV file format, so i want to skip when blank line appears
import csv
import time
ifile = open ("C:\Users\BKA4ABT\Desktop\Test_Specification\RDBI.csv", "rb")
for line in csv.reader(ifile):
if not line:
empty_lines += 1
continue
print line
If you want to skip all whitespace lines, you should use this test: ' '.isspace().
Since you may want to do something more complicated than just printing the non-blank lines to the console(no need to use CSV module for that), here is an example that involves a DictReader:
#!/usr/bin/env python
# Tested with Python 2.7
# I prefer this style of importing - hides the csv module
# in case you do from this_file.py import * inside of __init__.py
import csv as _csv
# Real comments are more complicated ...
def is_comment(line):
return line.startswith('#')
# Kind of sily wrapper
def is_whitespace(line):
return line.isspace()
def iter_filtered(in_file, *filters):
for line in in_file:
if not any(fltr(line) for fltr in filters):
yield line
# A dis-advantage of this approach is that it requires storing rows in RAM
# However, the largest CSV files I worked with were all under 100 Mb
def read_and_filter_csv(csv_path, *filters):
with open(csv_path, 'rb') as fin:
iter_clean_lines = iter_filtered(fin, *filters)
reader = _csv.DictReader(iter_clean_lines, delimiter=';')
return [row for row in reader]
# Stores all processed lines in RAM
def main_v1(csv_path):
for row in read_and_filter_csv(csv_path, is_comment, is_whitespace):
print(row) # Or do something else with it
# Simpler, less refactored version, does not use with
def main_v2(csv_path):
try:
fin = open(csv_path, 'rb')
reader = _csv.DictReader((line for line in fin if not
line.startswith('#') and not line.isspace()),
delimiter=';')
for row in reader:
print(row) # Or do something else with it
finally:
fin.close()
if __name__ == '__main__':
csv_path = "C:\Users\BKA4ABT\Desktop\Test_Specification\RDBI.csv"
main_v1(csv_path)
print('\n'*3)
main_v2(csv_path)
Instead of
if not line:
This should work:
if not ''.join(line).strip():
my suggestion would be to just use the csv reader who can delimite the file into rows. Like this you can just check whether the row is empty and if so just continue.
import csv
with open('some.csv', 'r') as csvfile:
# the delimiter depends on how your CSV seperates values
csvReader = csv.reader(csvfile, delimiter = '\t')
for row in csvReader:
# check if row is empty
if not (row):
continue
You can always check for the number of comma separated values. It seems to be much more productive and efficient.
When reading the lines iteratively, as these are a list of comma separated values you would be getting a list object. So if there is no element (blank link), then we can make it skip.
with open(filename) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=",")
for row in csv_reader:
if len(row) == 0:
continue
You can strip leading and trailing whitespace, and if the length is zero after that the line is empty.
import csv
with open('userlist.csv') as f:
reader = csv.reader(f)
user_header = next(reader) # Add this line if there the header is
user_list = [] # Create a new user list for input
for row in reader:
if any(row): # Pick up the non-blank row of list
print (row) # Just for verification
user_list.append(row) # Compose all the rest data into the list
This example just prints the data in array form while skipping the empty lines:
import csv
file = open("data.csv", "r")
data = csv.reader(file)
for line in data:
if line: print line
file.close()
I find it much clearer than the other provided examples.
import csv
ifile=csv.reader(open('C:\Users\BKA4ABT\Desktop\Test_Specification\RDBI.csv', 'rb'),delimiter=';')
for line in ifile:
if set(line).pop()=='':
pass
else:
for cell_value in line:
print cell_value

How can I speed up this really basic python script for offsetting lines of numbers

I have a simple text file which contains numbers in ASCII text separated by spaces as per this example.
150604849
319865.301865 5810822.964432 -96.425797 -1610
319734.172256 5810916.074753 -52.490280 -122
319730.912949 5810918.098465 -61.864395 -171
319688.240891 5810889.851608 -0.339890 -1790
*<continues like this for millions of lines>*
basically I want to copy the first line as is, then for all following lines I want to offset the first value (x), offset the second value (y), leave the third value unchanged and offset and half the last number.
I've cobbled together the following code as a python learning experience (apologies if it crude and offensive, truly I mean no offence) and it works ok. However the input file I'm using it on is several GB in size and I'm wondering if there's ways to speed up the execution. Currently for a 740 MB file it takes 2 minutes 21 seconds
import glob
#offset values
offsetx = -306000
offsety = -5806000
files = glob.glob('*.pts')
for file in files:
currentFile = open(file, "r")
out = open(file[:-4]+"_RGB_moved.pts", "w")
firstline = str(currentFile.readline())
out.write(str(firstline.split()[0]))
while 1:
lines = currentFile.readlines(100000)
if not lines:
break
for line in lines:
out.write('\n')
words = line.split()
newwords = [str(float(words[0])+offsetx), str(float(words[1])+offsety), str(float(words[2])), str((int(words[3])+2050)/2)]
out.write(" ".join(newwords))
Many thanks
Don't use .readlines(). Use the file directly as an iterator:
for file in files:
with open(file, "r") as currentfile, open(file[:-4]+"_RGB_moved.pts", "w") as out:
firstline = next(currentFile)
out.write(firstline.split(None, 1)[0])
for line in currentfile:
out.write('\n')
words = line.split()
newwords = [str(float(words[0])+offsetx), str(float(words[1])+offsety), words[2], str((int(words[3]) + 2050) / 2)]
out.write(" ".join(newwords))
I also added a few Python best-practices, and you don't need to turn words[2] into a float, then back to a string again.
You could also look into using the csv module, it can handle splitting and rejoining lines in C code:
import csv
for file in files:
with open(file, "rb") as currentfile, open(file[:-4]+"_RGB_moved.pts", "wb") as out:
reader = csv.reader(currentfile, delimiter=' ', quoting=csv.QUOTE_NONE)
writer = csv.writer(out, delimiter=' ', quoting=csv.QUOTE_NONE)
out.writerow(next(reader)[0])
for row in reader:
newrow = [str(float(row[0])+offsetx), str(float(row[1])+offsety), row[2], str((int(row[3]) + 2050) / 2)]
out.writerow(newrow)
Use thé CSV package. It may be more optimized than your script and will simplify your code.

Categories