Partial Intersection of Sepecific Columns in Large CSV Files - python

I'm working on a script to find the intersection between large csv files based on the contents of only two specific columns in each file which are : Query ID and Subject ID.
A set of files are pairs of Left and Right for each species , every single file looks something like this:
Similarity (%) Query ID Subject ID
100.000000 BRADI5G01462.1_1 BRADI5G16060.1_36
90.000000 BRADI5G02480.1_5 NCRNA_11838_6689
100.000000 BRADI5G06067.1_8 NCRNA_32597_1525
90.000000 BRADI5G08380.1_12 NCRNA_32405_1776
100.000000 BRADI5G09460.2_17 BRADI5G16060.1_36
90.909091 BRADI5G10680.1_20 NCRNA_2505_6156
Right files are always longer and larger in size than Left one's !!
Here's the code snippet I have so far :
import csv
with open('#Left(Brachypodium_Japonica).csv', 'r',newline='') as Afile, open('#Right(Brachypodium_Japonica).csv', 'r',newline='') as Bfile, open('Intrsc-(Brachypodium_Japonica).csv','w',newline='') as Intrsct:
reader1=csv.reader(Afile,delimiter="\t",skipinitialspace=True)
next(reader1,None)
reader2=csv.reader(Bfile,delimiter="\t",skipinitialspace=True)
next(reader2,None)
Intrsct = csv.writer(Intrsct, delimiter="\t",skipinitialspace=True)
Intrsct.writerow(["Query ID","Subject ID","Left Similarity (%)","Right Similarity (%)"])
for row1 ,row2 in zip(Afile,Bfile):
if ((row1[1] in row2[1] and row1[2] in row2[2])):
Intrsct.writerow([row1.strip().split('\t')[1],row1.strip().split('\t')[2],row1.strip().split('\t')[0],row2.strip().split('\t')[0]])
The code above is iterating over the records of the two files simulatively and searches for contents of row(1),row(2) of first file in row(1),row(2) of the second file ; by which i.e. column-wise (compares Query ID in both files as well as Subject ID) and prints the matches on a new file in a certain order .
Th results are not exactly what I was expecting ; obviously it finds the matches for the first wanted column only ... I tried to trace back the procedure manually and find that BRADI5G02480.1_5 for instance exist in both files but not NCRNA_11838_6689 which only exists on Left side Not the Right!!
Aren't they supposed to be mirror reflection aside from the numerical values ?!
I have used this thread to write the script but it compares line by line and doesn't check the rest of the column content's for matches .
Also , I found this but it uses dictionaries and lists which isn't suitable for my file's size .
To handle the simulatively iteration thing I used this thread , but what was mentioned there about handling variant sized files wasn't really clear to me so I haven't tried it yet !!
I would really appreciate it if someone could tell me what am missing here , is the code correct or I'm using the in condition wrong ?!
Please , I really need help with this ... thanks in advance :)

The following solution is a copy of my answer given to your other question, and should hopefully give you an idea how to integrate it with your current solution.
The script reads two (or more) CSV files in and writes the intersection of row entries to a new CSV file. By that I mean if row1 in input1.csv is found anywhere in input2.csv, the row is written to the output, and so on.
import csv
files = ["input1.csv", "input2.csv"]
ldata = []
for file in files:
with open(file, "r") as f_input:
csv_input = csv.reader(f_input, delimiter="\t", skipinitialspace=True)
set_rows = set()
for row in csv_input:
set_rows.add(tuple(row))
ldata.append(set_rows)
with open("Intersection(Brachypodium_Japonica).csv", "wb") as f_output:
csv_output = csv.writer(f_output, delimiter="\t", skipinitialspace=True)
csv_output.writerows(set.intersection(*ldata))
You will need to add your file name mangling. This format made it easier to test. Tested using Python 2.7.

Related

Cross referencing two csv files in python

so as i'm out of ideas I've turned to geniuses on this site.
What I want to be able to do is to have two separate csv files. One of which has a bunch of store names on it, and the other to have black listed stores.
I'd like to be able to run a python script that reads the 'black listed' sheet, then checks if those specific names are within the other sheet, and if they are, then delete those off the main sheet.
I've tried for about two days straight and cannot for the life of me get it to work. So i'm coming to you guys to help me out.
Thanks so much in advance.
p.s If you can comment the hell out out of the script so I know what's going on it would be greatly appreciated.
EDIT: I deleted the code I originally had but hopefully this will give you an idea of what I was trying to do. (I also realise it's completely incorrect)
import csv
with open('Black List.csv', 'r') as bl:
reader = csv.reader(bl)
with open('Destinations.csv', 'r') as dest:
readern = csv.reader(dest)
for line in reader:
if line in readern:
with open('Destinations.csv', 'w'):
del(line)
The first thing you need to be aware of is that you can't update the file you are reading. Textfiles (which include .csv files) don't work like that. So you have to read the whole of Destinations.csv into memory, and then write it out again, under a new name, but skipping the rows you don't want. (You can overwrite your input file, but you will very quickly discover that is a bad idea.)
import csv
blacklist_rows = []
with open('Black List.csv', 'r') as bl:
reader = csv.reader(bl)
for line in reader:
blacklist_rows.append(line)
destination_rows = []
with open('Destinations.csv', 'r') as dest:
readern = csv.reader(dest)
for line in readern:
destination_rows.append(line)
Now at this point you need to loop through destination_rows and drop any that match something in blacklist_rows, and write out the rest. I can't suggest what the matching test should look like, because you haven't shown us your input data, so I don't actually know that blacklist_rows and destination_rows contain.
with open('FilteredDestinations.csv', 'w') as output:
writer = csv.writer(output)
for r in destination_rows:
if not r: # trap for blank rows in the input
continue
if r *matches something in blacklist_rows*: # you have to code this
continue
writer.writerow(r)
You could try Pandas
import pandas as pd
df1 = pd.read_csv("Destinations.csv")
df2 = pd.read_csv("Black List.csv")
blacklist = df2["column_name_in_blacklist_file"].tolist()
df3 = df2[~df2['destination_column_name'].isin(blacklist)]
df3.to_csv("results.csv")
print(df3)

Python program for line search and replacement in a CSV file

I am building a small tool in Python, the function of the tool is the following:
Open a master data file (cvs format)
open a log file (cvs format)
Request the user to select the pointer for the field that will need to be compared in both files.
Start comparing record by record
when the field is not found in the record proceed to look forward in the log file until the field can be found, in the meantime keep in memory the pointer for when the comparison will be continued.
Once the field is found proceed to cut the whole record off that line and place it in the right position
Here is an example :
Data file
"1","1234","abc"
"2","5678","def"
"3","9012","ghi"
log file
"1","1234","abc"
"3","9012","ghi"
"2","5678","def"
final log file :
"1","1234","abc"
"2","5678","def"
"3","9012","ghi"
I was looking in the cvs lib in python and the sqlite3 lib but there is nothing that seems to really do a swap in file so i was thinking that maybe i should just create a new file with all the records in order.
What could be done in this regard ? Is there a library or a command that can move records in an existing file ?
I would prefer to modify the existing file instead of creating a new one but if it's not possible i would just move to create a new one.
In addition to that the code i was planning to use to verify the files was this :
import csv
reader1 = csv.reader(open('data.csv', 'rb'), delimiter=',', quotechar='"'))
row1 = reader1.next()
reader2 = csv.reader(open('log.csv', 'rb'), delimiter=',', quotechar='"'))
row2 = reader2.next()
if (row1[0] == row2[0]) and (row1[2:] == row2[2:]):
#here it move to the next record
else:
#here it would run a function that replace the field
Please note that this piece of code was found at this page :
Python: Comparing specific columns in two csv files
(i don't want to take away the glory from another coder).
I just like it for it's simplicity.
Thanks to all for the attention.
Regards
Danilo

Using Python v3.5 to load a tab-delimited file, omit some rows, and output max and min floating numbers in a specific column to a new file

I've tried for several hours to research this, but every possible solution hasn't suited my particular needs.
I have written the following in Python (v3.5) to download a tab-delimited .txt file.
#!/usr/bin/env /Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5
import urllib.request
import time
timestr = time.strftime("%Y-%m-%d %H-%M-%S")
filename="/data examples/"+ "ace-magnetometer-" + timestr + '.txt'
urllib.request.urlretrieve('http://services.swpc.noaa.gov/text/ace-magnetometer.txt', filename=filename)
This downloads the file from here and renames it based on the current time. It works perfectly.
I am hoping that I can then use the "filename" variable to then load the file and do some things to it (rather than having to write out the full file path and file name, because my ultimate goal is to do the following to several hundred different files, so using a variable will be easier in the long run).
This using-the-variable idea seems to work, because adding the following to the above prints the contents of the file to STDOUT... (so it's able to find the file without any issues):
import csv
with open(filename, 'r') as f:
reader = csv.reader(f, dialect='excel', delimiter='\t')
for row in reader:
print(row)
As you can see from the file, the first 18 lines are informational.
Line 19 provides the actual column names. Then there is a line of dashes.
The actual data I'm interested in starts on line 21.
I want to find the minimum and maximum numbers in the "Bt" column (third column from the right). One of the possible solutions I found would only work with integers, and this dataset has floating numbers.
Another possible solution involved importing the pyexcel module, but I can't seem to install that correctly...
import pyexcel as pe
data = pe.load(filename, name_columns_by_row=19)
min(data.column["Bt"])
I'd like to be able to print the minimum Bt and maximum Bt values into two separate files called minBt.txt and maxBt.txt.
I would appreciate any pointers anyone may have, please.
This is meant to be a comment on your latest question to Apoc, but I'm new, so I'm not allowed to comment. One thing that might create problems is that bz_values (and bt_values, for that matter) might be a list of strings (at least it was when I tried to run Apoc's script on the example file you linked to). You could solve this by substituting this:
min_bz = min([float(x) for x in bz_values])
max_bz = max([float(x) for x in bz_values])
for this:
min_bz = min(bz_values)
max_bz = max(bz_values)
The following will work as long as all the files are formatted in the same way, i.e. data 21 lines in, same number of columns and so on. Also, the file that you linked did not appear to be tab delimited, and thus I've simply used the string split method on each row instead of the csv reader. The column is read from the file into a list, and that list is used to calculate the maximum and minimum values:
from itertools import islice
# Line that data starts from, zero-indexed.
START_LINE = 20
# The column containing the data in question, zero-indexed.
DATA_COL = 10
# The value present when a measurement failed.
FAILED_MEASUREMENT = '-999.9'
with open('data.txt', 'r') as f:
bt_values = []
for val in (row.split()[DATA_COL] for row in islice(f, START_LINE, None)):
if val != FAILED_MEASUREMENT:
bt_values.append(float(val))
min_bt = min(bt_values)
max_bt = max(bt_values)
with open('minBt.txt', 'a') as minFile:
print(min_bt, file=minFile)
with open('maxBt.txt', 'a') as maxFile:
print(max_bt, file=maxFile)
I have assumed that since you are doing this to multiple files you are looking to accumulate multiple max and min values in the maxBt.txt and minBt.txt files, and hence I've opened them in 'append' mode. If this is not the case, please swap out the 'a' argument for 'w', which will overwrite the file contents each time.
Edit: Updated to include workaround for failed measurements, as discussed in comments.
Edit 2: Updated to fix problem with negative numbers, also noted by Derek in separate answer.

How to extract data from rows in .csv file into separate .txt files using python?

I have a CSV file of interview transcripts exported from an h5 file. When I read the rows into python, the output looks something like this:
line[0]=['title,date,responses']
line[1]=['[\'Transcript 1 title\'],"[\' July 7, 1997\']","[ '\nms. vogel: i look at all sectors of insurance, although to date i\nhaven\'t really focused on the reinsurers and the brokers.\n']']
line[2]=['[\'Transcript 2 title\'],"[\' July 8, 1997\']","[ '\nmr. tozzi: i formed cambridge in 1981. we are top-down sector managers,\nconstantly searching for non-consensus companies and industries.\n']']
etc...
I'd like to extract the text from the "responses" column ONLY into separate .txt files for every row in the CSV file, saving the .txt files into a specified directory and naming them as "t1.txt", "t2.txt", etc. according to the row number. The CSV file has roughly 30K rows.
Drawing from what I've already been able to find online, this is the code I have so far:
import csv
with open("twst.csv", "r") as f:
reader = csv.reader(f)
rownumber = 0
for row in reader:
g=open("t"+str(rownumber)+".txt","w")
g.write(row)
rownumber = rownumber + 1
g.close()
My biggest problem is that this pulls all columns from the row into the .txt file, but I only want the text from the "responses" column. Once I have that, I know I can loop through the various rows in the file (right now, what I have set up is just to test the first row), but I haven't found any guidance on pulling specific columns in the python documentation. I'm also not familiar enough with python to figure out the code on my own.
Thanks in advance for the help!
There may be something that can be done with the built-in csv module. However, if the format of the csv does not change, the following code should work by just using for loops and built-in read/write.
with open('test.csv', 'r') as file:
data = file.read().split('\n')
for row in range(1, len(data)):
third_col= data[x].split(',')
with open('t' + str(x) + '.txt', 'w') as output:
output.write(third_col[2])

Batch Appending matching rows to csv files using python

I have a set of csv files and another csv file, GroundTruth2010_edited_copy.csv, which contains information I'd like to append to the end of the rows of the Set of files. The files contain information describing geologic samples. For all the files, including GroundTruth2010_edited_copy.csv, each row has an identifying 'rockid' that identifies the sample and the remainder of the row describes various parameters of the sample. I want to append corresponding information from GroundTruth2010_edited_copy.csv to the Set of csv files. That is, if the rows have the same 'rockid,' I want to combine them into a new row in a new csv file. Hence, there is a new csv file for each original csv file in the Set. Here is my code.
import os
import csv
#read in ground truth data
csvfilename='GroundTruth/GroundTruth2010_edited_copy.csv'
with open(csvfilename) as csvfile:
rocreader=csv.reader(csvfile)
path=os.getcwd()
filenames = os.listdir(path)
for filename in filenames:
if filename.endswith('.csv'):
#read csv files
r=csv.reader(open(filename))
new_data = []
for row in r:
rockid=row[-1]
for krow in rocreader:
entry=krow[0]
newentry=entry[:5] +entry[6:] #remove extra '0' from middle of entry
if newentry==rockid:
print('Ok!')
#append ground truth data
new_data.append([row, krow[1], krow[2], krow[3], krow[4]])
#write csv files
newfilename = "".join(filename.split(".csv")) + "_GT.csv"
with open(newfilename, "w") as f:
writer = csv.writer(f)
writer.writerows(new_data)
The code runs and makes my new csv files, but they are all empty. The problem seems to be that my second 'if' statement is never true: the console never prints 'Ok!' I've tried troubleshooting for a bit, and been rather frustrated. Perhaps the most frustrating thing is that after the program finishes, if I enter
rockid==newentry
The console returns 'True,' so it seems to me I should get at least one 'Ok!' for the final iteration. Can anyone help me find what's wrong?
Also, since my if statement is never true, there may also be a problem with the way I append 'new_data.'
You only open rocreader once, so when you try to use it later in the loop, you'll only get rows from it the first time through-- in the rest of the loop's runs, you're reading 0 rows (and of course getting no matches). To read it over and over, open and close it once for each time you need to use it.
But instead of re-scanning the Ground Truth file from disk (slow!) for every row of each of the other CSVs, you should read it once into a dictionary, so you can look up IDs in one step.
with open(csvfilename) as csvfile:
rocreader=csv.reader(csvfile)
rocindex = dict((row[-1], row) for row in rocreader)
Then for any key newentry, you can just check like this:
if newentry in rocindex:
truth = rocindex[newentry]
# Merge it with the row that has key `newentry`

Categories