Trying to remove rows based in csv file based off column value - python

I'm trying to remove duplicated rows in a csv file based on if a column has a unique value. My code looks like this:
seen = set()
for line in fileinput.FileInput('DBA.csv', inplace=1):
if line[2] in seen:
continue # skip duplicated line
seen.add(line[2])
print(line, end='')
I'm trying to get the value of the 2 index column in every row and check if it's unique. But for some reason my seen set looks like this:
{'b', '"', 't', '/', 'k'}
Any advice on where my logic is flawed?

You're reading your file line by line, so when you pick line[2] you're actually picking the third character of each line you're running this on.
If you want to capture the value of the second column for each row, you need to parse your CSV first, something like:
import csv
seen = set()
with open("DBA.csv", "rUb") as f:
reader = csv.reader(f)
for line in reader:
if line[2] in seen:
continue
seen.add(line[2])
print(line) # this will NOT print valid CSV, it will print Python list
If you want to edit your CSV in place I'm afraid it will be a bit more complicated than that. If your CSV is not huge, you can load it in memory, truncate it and then write down your lines:
import csv
seen = set()
with open("DBA.csv", "rUb+") as f:
handler = csv.reader(f)
data = list(handler)
f.seek(0)
f.truncate()
handler = csv.writer(f)
for line in data:
if line[2] in seen:
continue
seen.add(line[2])
handler.writerow(line)
Otherwise you'll have to read your file line by line and use a buffer that you'll pass to csv.reader() to parse it, check the value of its third column and if not seen write the line to the live-editing file. If seen, you'll have to seek back to the previous line beginning before writing the next line etc.
Of course, you don't need to use the csv module if you know your line structures well which can simplify the things (you won't need to deal with passing buffers left and right), but for a universal solution it's highly advisable to let the csv module do your bidding.

Related

split the data into separate file after encountering a column name

eno,ename,
101,'sam',
102,'bill',
eno,ename,
103,'jack',
eno,ename,
104,'pam',
I have a huge .csv file in which column names reappear after certain number of rows. is there a way in python to split such data into multiple files as soon as it encounter the "repeated column names"?
I would like the above data to be in 3 separate .csv files since the same column names appear 3 times.
Challenging! Here's my solution. There is likely a more straightforward way to do this though.
with open("./file.csv", "r") as readfile:
file_number = 0
current_line_no = 0
tmpline = None
for line in readfile:
# count which file you're on. Also use write mode "W" if the first line. Else append.
with open(f"./writefile{file_number}.csv", ("w" if current_line_no == 0 else "a")) as writefile:
# check if the "headers" are appearing and if the current file has more than 1 line.
# Not sure if the header check is the best for your use case. Maybe regex is best here.
if current_line_no != 0 and ("eno" in line and "ename" in line):
file_number += 1 # increment to next file
current_line_no = 0 # reset file number
tmpline = line # remember the "current line". This needs to be added to next file.
continue # continue to next line in readfile
# if there is a templine from previous, add it to this as header.
if tmpline is not None:
writefile.write(tmpline)
tmpline = None
# write the line and increment to new line
writefile.write(line)
current_line_no += 1
I've tried to comment as best as possible. The code basically opens the files one by one as it loops through the lines of the readfile. When it reads the contents it checks if the current line is a "header". Here I simply checked if "eno" and "ename" are in the line, but there is probably a better approach for your use case. If the current line is a header, then you need to close the current file and open a new one. Hopefully this helps!
I know you asked for Python, but there are some questions that just cry out for the power of AWK :)
awk '/eno,ename/{x="F"++i ".csv";}{print > x;}' input.csv
One way of doing it is to save the headers to a variable, and then when reading the file check if the current row matches the header. If it does, increment a counter that can be used to determine which file to write to.
import csv
HEADERS = next(csv.reader(open('data.csv')))
print(HEADERS)
with open('data.csv') as f:
reader = csv.reader(f)
file_name_counter = 0
for row in reader:
if row == HEADERS:
file_name_counter += 1
with open(f'data{file_name_counter}.csv', ('w' if row == HEADERS else "a"), newline="") as f:
writer = csv.writer(f)
writer.writerow(row)
NOTE: I believe the newline="" argument is necessary on Windows, as otherwise csv.writer() will add an extra new line between each entry.

Python: Access "field" in line

I have the following .txt-File (modified bash emboss-dreg report, the original report has seqtable format):
Start End Strand Pattern Sequence
43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT
52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC
188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC
I would like to access the elements under "sequence" only, to compare them with some variables and delete the whole lines, if the comparison does not give the desired result (using Levenshtein distance for comparison).
But I can't even get started .... :(
I am searching for something like the linux -f option, to directly get to the right "field" in the line to do my comparison.
I came across re.split:
with open(textFile) as f:
for line in f:
cleaned=re.split(r'\t',line)
print(cleaned)
which results in:
[' Start End Strand Pattern Sequence\n']
['\n']
[' 43392 43420 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT\n']
['\n']
[' 52037 52064 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC\n']
['\n']
[' 188334 188360 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC\n']
['\n']
That is the closest I got to "split my lines into elements". I feel like totally going the wrong way, but searching Stack Overflow and google did not result in anything :(
I have never worked with seqtable-format before, so I tried to deal with it as .txt Maybe, there is another approach better for dealing with it?
Python is the main language I am learning, I am not so firm in Bash, but bash-answers for dealing with the issue would be ok for me, too.
I am thankful for any hint/link/help :)
The format itself seems to be using multiple lines as delimiters while your r'\t' is not doing anything (you're instructing Python to split on a literal \t). Also, based on what you've pasted the data is not using a tab delimiter anyway, but a random number of whitespaces to pad the table.
To address both, you can read the file, treat the first line as a header (if you need it), then read the rest line by line, strip the trailing\leading whitespace, check if there is any data there and if there is - further split it on whitespace to get to your line elements:
with open("your_data", "r") as f:
header = f.readline().split() # read the first line as a header
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split() # split the data on whitespace to get your elements
print(elements[-1]) # print the last element
TGATCGCACGCCGAATGGAAACACGTTTT
TGACCCTGCTTGGCGATCCCGGCGTTTC
TGATCGCGCAACTGCAGCGGGAGTTAC
As a bonus, since you have the header, you can turn it into a map and then use 'proxied' named access to get the element you're looking for so you don't need to worry about the element position:
with open("your_data", "r") as f:
# read the header and turn it into a value:index map
header = {v: i for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split()
print(elements[header["Sequence"]]) # print the Sequence element
You can also use a header map to turn your rows into dict structures for even easier access.
UPDATE: Here's how to create a header map and then use it to build a dict out of your lines:
with open("your_data", "r") as f:
# read the header and turn it into an index:value map
header = {i: v for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
# split the line, iterate over it and use the header map to create a dict
row = {header[i]: v for i, v in enumerate(line.split())}
print(row["Sequence"]) # ... or you can append it to a list for later use
As for how to 'delete' lines that you don't want for some reason, you'll have to create a temporary file, loop through your original file, compare your values, write the ones that you want to keep into the temporary file, delete the original file and finally rename the temporary file to match your original file, something like:
import shutil
from tempfile import NamedTemporaryFile
SOURCE_FILE = "your_data" # path to the original file to process
def compare_func(seq): # a simple comparison function for our sequence
return not seq.endswith("TC") # use Levenshtein distance or whatever you want instead
# open a temporary file for writing and our source file for reading
with NamedTemporaryFile(mode="w", delete=False) as t, open(SOURCE_FILE, "r") as f:
header_line = f.readline() # read the header
t.write(header_line) # write the header immediately to the temporary file
header = {v: i for i, v in enumerate(header_line.split())} # create a header map
last_line = "" # a var to store the whitespace to keep the same format
for line in f: # read the rest of the file line-by-line
row = line.strip() # first clear out the whitespace
if row: # check if there is any content left or is it an empty line
elements = row.split() # split the row into elements
# now lets call our comparison function
if compare_func(elements[header["Sequence"]]): # keep the line if True
t.write(last_line) # write down the last whitespace to the temporary file
t.write(line) # write down the current line to the temporary file
else:
last_line = line # store the whitespace for later use
shutil.move(t.name, SOURCE_FILE) # finally, overwrite the source with the temporary file
This will produce the same file sans the second row from your example since its sequence ends in a TC and our comp_function() returns False in that case.
For a bit less complexity, instead of using temporary files you can load your whole source file into the working memory and then just overwrite it, but that would work only for files that can fit your working memory while the above approach can work with files as large as your free storage space.

Python CSV writer keeps adding unnecessary quotes

I'm trying to write to a CSV file with output that looks like this:
14897,40.50891,-81.03926,168.19999
but the CSV writer keeps writing the output with quotes at beginning and end
'14897,40.50891,-81.03926,168.19999'
When I print the line normally, the output is correct but I need to do line.split() or else the csv writer puts output as 1,4,8,9,7 etc...
But when I do line.split() the output is then
['14897,40.50891,-81.03926,168.19999']
Which is written as '14897,40.50891,-81.03926,168.19999'
How do I make the quotes go away? I already tried csv.QUOTE_NONE but doesn't work.
with open(results_csv, 'wb') as out_file:
writer = csv.writer(out_file, delimiter=',')
writer.writerow(["time", "lat", "lon", "alt"])
for f in file_directory):
for line in open(f):
print line
line = line.split()
writer.writerow(line)
with line.split(), you're not splitting according to commas but to blanks (spaces, linefeeds, tabs). Since there are none, you end up with only 1 item per row.
Since this item contains commas, csv module has to quote to make the difference with the actual separator (which is also comma). You would need line.strip().split(",") for it to work, but...
using csv to read your data would be a better idea to fix this:
replace that:
for line in open(some_file):
print line
line = line.split()
writer.writerow(line)
by:
with open(some_file) as f:
cr = csv.reader(f) # default separator is comma already
writer.writerows(cr)
You don't need to read the file manually. You can simply use csv reader.
Replace the inner for loop with:
# with ensures that the file handle is closed, after the execution of the code inside the block
with open(some_file) as file:
row = csv.reader(file) # read rows
writer.writerows(row) # write multiple rows at once

writing the data in text file while converting it to csv

I am very new with python. I have a .txt file and want to convert it to a .csv file with the format I was told but could not manage to accomplish. a hand can be useful for it. I am going to explain it with screenshots.
I have a txt file with the name of bip.txt. and the data inside of it is like this
I want to convert it to csv like this csv file
So far, what I could do is only writing all the data from text file with this code:
read_files = glob.glob("C:/Users/Emrehana1/Desktop/bip.txt")
with open("C:/Users/Emrehana1/Desktop/Test_Result_Report.csv", "w") as outfile:
for f in read_files:
with open(f, "r") as infile:
outfile.write(infile.read())
So is there a solution to convert it to a csv file in the format I desire? I hope I have explained it clearly.
There's no need to use the glob module if you only have one file and you already know its name. You can just open it. It would have been helpful to quote your data as text, since as an image someone wanting to help you can't just copy and paste your input data.
For each entry in the input file you will have to read multiple lines to collect together the information you need to create an entry in the output file.
One way is to loop over the lines of input until you find one that begins with "test:", then get the next line in the file using next() to create the entry:
The following code will produce the split you need - creating the csv file can be done with the standard library module, and is left as an exercise. I used a different file name, as you can see.
with open("/tmp/blip.txt") as f:
for line in f:
if line.startswith("test:"):
test_name = line.strip().split(None, 1)[1]
result = next(f)
if not result.startswith("outcome:"):
raise ValueError("Test name not followed by outcome for test "+test_name)
outcome = result.strip().split(None, 1)[1]
print test_name, outcome
You do not use the glob function to open a file, it searches for file names matching a pattern. you could open up the file bip.txt then read each line and put the value into an array then when all of the values have been found join them with a new line and a comma and write to a csv file, like this:
# set the csv column headers
values = [["test", "outcome"]]
current_row = []
with open("bip.txt", "r") as f:
for line in f:
# when a blank line is found, append the row
if line == "\n" and current_row != []:
values.append(current_row)
current_row = []
if ":" in line:
# get the value after the semicolon
value = line[line.index(":")+1:].strip()
current_row.append(value)
# append the final row to the list
values.append(current_row)
# join the columns with a comma and the rows with a new line
csv_result = ""
for row in values:
csv_result += ",".join(row) + "\n"
# output the csv data to a file
with open("Test_Result_Report.csv", "w") as f:
f.write(csv_result)

reading from a particular tuple onwards from a file in python

Using seek and tell is not functioning properly as the tell returns the current position in bytes; I need to get the line number rather the position of file pointer to proceed.
I have a file glass.csv and I need to cluster the datasets. Each line in the file contains a number 1,2,3... like the below:
65,1.52172,13.48,3.74,0.90,72.01,0.18,9.61,0.00,0.07,1
66,1.52099,13.69,3.59,1.12,71.96,0.09,9.40,0.00,0.00,1
67,1.52152,13.05,3.65,0.87,72.22,0.19,9.85,0.00,0.17,1
68,1.52152,13.05,3.65,0.87,72.32,0.19,9.85,0.00,0.17,1
69,1.52152,13.12,3.58,0.90,72.20,0.23,9.82,0.00,0.16,1
70,1.52300,13.31,3.58,0.82,71.99,0.12,10.17,0.00,0.03,1
71,1.51574,14.86,3.67,1.74,71.87,0.16,7.36,0.00,0.12,2
72,1.51848,13.64,3.87,1.27,71.96,0.54,8.32,0.00,0.32,2
73,1.51593,13.09,3.59,1.52,73.10,0.67,7.83,0.00,0.00,2
74,1.51631,13.34,3.57,1.57,72.87,0.61,7.89,0.00,0.00,2
142,1.51851,13.20,3.63,1.07,72.83,0.57,8.41,0.09,0.17,2
143,1.51662,12.85,3.51,1.44,73.01,0.68,8.23,0.06,0.25,2
144,1.51709,13.00,3.47,1.79,72.72,0.66,8.18,0.00,0.00,2
145,1.51660,12.99,3.18,1.23,72.97,0.58,8.81,0.00,0.24,2
146,1.51839,12.85,3.67,1.24,72.57,0.62,8.68,0.00,0.35,2
147,1.51769,13.65,3.66,1.11,72.77,0.11,8.60,0.00,0.00,3
148,1.51610,13.33,3.53,1.34,72.67,0.56,8.33,0.00,0.00,3
149,1.51670,13.24,3.57,1.38,72.70,0.56,8.44,0.00,0.10,3
150,1.51643,12.16,3.52,1.35,72.89,0.57,8.53,0.00,0.00,3
I need to take some inputs from those tuples having 1 as the last number and save it in another file, (train.txt), and the remaining in another file, (test.txt). Likewise I need to take certain lines from those having 2 as the last number and append to the first file i.e. train.txt and remaining to test.txt.
I cannot get the second input but appends the first result itself.
The easiest way, assuming that you have a large file and can not simply load the whole file would be to use 1 file for each to do your sorting. If it is a small(ish) input file then just load as a comma separated file using the csv module.
As a quick and dirty method, (assuming smallish files).
data = []
with open('glass.csv', 'r') as infile:
for line in infile:
linedata = [float(val) for val in line.strip().split(',')]
data.append(linedata)
adata = sorted(data, key=lambda items: items[-1])
## Then open both your output files and write them in the required fields.
The default behavior for reading a text file is line-by-line. You can just do something like that:
with open('input.csv', 'r') as f, open('output_1.csv') as output_1, open('output_2.csv') as output_2:
for line in f:
line_fields = line.strip().split()[',']
if line_fields[-1] == '1':
output_1.write(line)
continue
if line_fields[-1] == '2':
output_2.write(line)
Or you can use the CSV module, it's much easier https://docs.python.org/2/library/csv.html

Categories