Batch Appending matching rows to csv files using python

Batch Appending matching rows to csv files using python - python

I have a set of csv files and another csv file, GroundTruth2010_edited_copy.csv, which contains information I'd like to append to the end of the rows of the Set of files. The files contain information describing geologic samples. For all the files, including GroundTruth2010_edited_copy.csv, each row has an identifying 'rockid' that identifies the sample and the remainder of the row describes various parameters of the sample. I want to append corresponding information from GroundTruth2010_edited_copy.csv to the Set of csv files. That is, if the rows have the same 'rockid,' I want to combine them into a new row in a new csv file. Hence, there is a new csv file for each original csv file in the Set. Here is my code.
import os
import csv
#read in ground truth data
csvfilename='GroundTruth/GroundTruth2010_edited_copy.csv'
with open(csvfilename) as csvfile:
rocreader=csv.reader(csvfile)
path=os.getcwd()
filenames = os.listdir(path)
for filename in filenames:
if filename.endswith('.csv'):
#read csv files
r=csv.reader(open(filename))
new_data = []
for row in r:
rockid=row[-1]
for krow in rocreader:
entry=krow[0]
newentry=entry[:5] +entry[6:] #remove extra '0' from middle of entry
if newentry==rockid:
print('Ok!')
#append ground truth data
new_data.append([row, krow[1], krow[2], krow[3], krow[4]])
#write csv files
newfilename = "".join(filename.split(".csv")) + "_GT.csv"
with open(newfilename, "w") as f:
writer = csv.writer(f)
writer.writerows(new_data)
The code runs and makes my new csv files, but they are all empty. The problem seems to be that my second 'if' statement is never true: the console never prints 'Ok!' I've tried troubleshooting for a bit, and been rather frustrated. Perhaps the most frustrating thing is that after the program finishes, if I enter
rockid==newentry
The console returns 'True,' so it seems to me I should get at least one 'Ok!' for the final iteration. Can anyone help me find what's wrong?
Also, since my if statement is never true, there may also be a problem with the way I append 'new_data.'

You only open rocreader once, so when you try to use it later in the loop, you'll only get rows from it the first time through-- in the rest of the loop's runs, you're reading 0 rows (and of course getting no matches). To read it over and over, open and close it once for each time you need to use it.
But instead of re-scanning the Ground Truth file from disk (slow!) for every row of each of the other CSVs, you should read it once into a dictionary, so you can look up IDs in one step.
with open(csvfilename) as csvfile:
rocreader=csv.reader(csvfile)
rocindex = dict((row[-1], row) for row in rocreader)
Then for any key newentry, you can just check like this:
if newentry in rocindex:
truth = rocindex[newentry]
# Merge it with the row that has key `newentry`

Related

Using Python v3.5 to load a tab-delimited file, omit some rows, and output max and min floating numbers in a specific column to a new file

I've tried for several hours to research this, but every possible solution hasn't suited my particular needs.
I have written the following in Python (v3.5) to download a tab-delimited .txt file.
#!/usr/bin/env /Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5
import urllib.request
import time
timestr = time.strftime("%Y-%m-%d %H-%M-%S")
filename="/data examples/"+ "ace-magnetometer-" + timestr + '.txt'
urllib.request.urlretrieve('http://services.swpc.noaa.gov/text/ace-magnetometer.txt', filename=filename)
This downloads the file from here and renames it based on the current time. It works perfectly.
I am hoping that I can then use the "filename" variable to then load the file and do some things to it (rather than having to write out the full file path and file name, because my ultimate goal is to do the following to several hundred different files, so using a variable will be easier in the long run).
This using-the-variable idea seems to work, because adding the following to the above prints the contents of the file to STDOUT... (so it's able to find the file without any issues):
import csv
with open(filename, 'r') as f:
reader = csv.reader(f, dialect='excel', delimiter='\t')
for row in reader:
print(row)
As you can see from the file, the first 18 lines are informational.
Line 19 provides the actual column names. Then there is a line of dashes.
The actual data I'm interested in starts on line 21.
I want to find the minimum and maximum numbers in the "Bt" column (third column from the right). One of the possible solutions I found would only work with integers, and this dataset has floating numbers.
Another possible solution involved importing the pyexcel module, but I can't seem to install that correctly...
import pyexcel as pe
data = pe.load(filename, name_columns_by_row=19)
min(data.column["Bt"])
I'd like to be able to print the minimum Bt and maximum Bt values into two separate files called minBt.txt and maxBt.txt.
I would appreciate any pointers anyone may have, please.

This is meant to be a comment on your latest question to Apoc, but I'm new, so I'm not allowed to comment. One thing that might create problems is that bz_values (and bt_values, for that matter) might be a list of strings (at least it was when I tried to run Apoc's script on the example file you linked to). You could solve this by substituting this:
min_bz = min([float(x) for x in bz_values])
max_bz = max([float(x) for x in bz_values])
for this:
min_bz = min(bz_values)
max_bz = max(bz_values)

The following will work as long as all the files are formatted in the same way, i.e. data 21 lines in, same number of columns and so on. Also, the file that you linked did not appear to be tab delimited, and thus I've simply used the string split method on each row instead of the csv reader. The column is read from the file into a list, and that list is used to calculate the maximum and minimum values:
from itertools import islice
# Line that data starts from, zero-indexed.
START_LINE = 20
# The column containing the data in question, zero-indexed.
DATA_COL = 10
# The value present when a measurement failed.
FAILED_MEASUREMENT = '-999.9'
with open('data.txt', 'r') as f:
bt_values = []
for val in (row.split()[DATA_COL] for row in islice(f, START_LINE, None)):
if val != FAILED_MEASUREMENT:
bt_values.append(float(val))
min_bt = min(bt_values)
max_bt = max(bt_values)
with open('minBt.txt', 'a') as minFile:
print(min_bt, file=minFile)
with open('maxBt.txt', 'a') as maxFile:
print(max_bt, file=maxFile)
I have assumed that since you are doing this to multiple files you are looking to accumulate multiple max and min values in the maxBt.txt and minBt.txt files, and hence I've opened them in 'append' mode. If this is not the case, please swap out the 'a' argument for 'w', which will overwrite the file contents each time.
Edit: Updated to include workaround for failed measurements, as discussed in comments.
Edit 2: Updated to fix problem with negative numbers, also noted by Derek in separate answer.

Parse Specific Text File to CSV Format with Headers

I have a log file that is updated every few milliseconds however the information is currently saved with four(4) different delimitors. The log files contain millions of lines so the chances of performing the action in excel null.
What I have left to work on resembles:
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
I would like these set to:
Sequence,Status;Report;Header;Profile
3433,true,Report=223313,,xxxx
0323,true,Report=43838,The,xxxx
5323,true,Report=6541998,,xxxx
Meaning that I would the need the creation of a header using all portions with the equal "=" symbol following it. All of the other operations within the file are taken care of and this will be used to perform a comparative check between files and replace or append fields. As I am new to python, I only need the assistance with this portion of the program I am writing.
Thank you all in advance!

You can try this.
First of all, I called the csv library to reduce the job of putting commas and quotes.
import csv
Then I made a function that takes a single line from your log file and outputs a dictionary with the fields passed in the header. If the current line hasn't a particular field from header, it will stay filled with an empty string.
def convert_to_dict(line, header):
d = {}
for cell in header:
d[cell] = ''
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
d[key] = value
return d
Since the header and the number of fields can vary between your files, I made a function extracting them. For this, I employed a set, a collection of unique elements, but also unordered. So I converted to a list and used the sorted function. Don't forget that seek(0) call, to rewind the file!
def extract_fields(logfile):
fields = set()
for line in logfile:
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
fields.add(key)
logfile.seek(0)
return sorted(list(fields))
Lastly, I made the main piece of code, in which open both the log file to read and the csv file to write. Then, it extracts and writes the header, and writes each converted line.
if __name__ == '__main__':
with open('report.log', 'r') as logfile:
with open('report.csv', 'wb') as csvfile:
csvwriter = csv.writer(csvfile)
header = extract_fields(logfile)
csvwriter.writerow(header)
for line in logfile:
d = convert_to_dict(line, header)
csvwriter.writerow([d[cell] for cell in header])
These are the files I used as an example:
report.log
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
report.csv
Header,Profile,Report,Sequence,Status
,xxxx,223313,3433,true
The,xxxx,43838,0323,true
,xxxx,6541998,5323,true
I hope it helps! :D
EDIT: I added support for different headers.

Partial Intersection of Sepecific Columns in Large CSV Files

I'm working on a script to find the intersection between large csv files based on the contents of only two specific columns in each file which are : Query ID and Subject ID.
A set of files are pairs of Left and Right for each species , every single file looks something like this:
Similarity (%) Query ID Subject ID
100.000000 BRADI5G01462.1_1 BRADI5G16060.1_36
90.000000 BRADI5G02480.1_5 NCRNA_11838_6689
100.000000 BRADI5G06067.1_8 NCRNA_32597_1525
90.000000 BRADI5G08380.1_12 NCRNA_32405_1776
100.000000 BRADI5G09460.2_17 BRADI5G16060.1_36
90.909091 BRADI5G10680.1_20 NCRNA_2505_6156
Right files are always longer and larger in size than Left one's !!
Here's the code snippet I have so far :
import csv
with open('#Left(Brachypodium_Japonica).csv', 'r',newline='') as Afile, open('#Right(Brachypodium_Japonica).csv', 'r',newline='') as Bfile, open('Intrsc-(Brachypodium_Japonica).csv','w',newline='') as Intrsct:
reader1=csv.reader(Afile,delimiter="\t",skipinitialspace=True)
next(reader1,None)
reader2=csv.reader(Bfile,delimiter="\t",skipinitialspace=True)
next(reader2,None)
Intrsct = csv.writer(Intrsct, delimiter="\t",skipinitialspace=True)
Intrsct.writerow(["Query ID","Subject ID","Left Similarity (%)","Right Similarity (%)"])
for row1 ,row2 in zip(Afile,Bfile):
if ((row1[1] in row2[1] and row1[2] in row2[2])):
Intrsct.writerow([row1.strip().split('\t')[1],row1.strip().split('\t')[2],row1.strip().split('\t')[0],row2.strip().split('\t')[0]])
The code above is iterating over the records of the two files simulatively and searches for contents of row(1),row(2) of first file in row(1),row(2) of the second file ; by which i.e. column-wise (compares Query ID in both files as well as Subject ID) and prints the matches on a new file in a certain order .
Th results are not exactly what I was expecting ; obviously it finds the matches for the first wanted column only ... I tried to trace back the procedure manually and find that BRADI5G02480.1_5 for instance exist in both files but not NCRNA_11838_6689 which only exists on Left side Not the Right!!
Aren't they supposed to be mirror reflection aside from the numerical values ?!
I have used this thread to write the script but it compares line by line and doesn't check the rest of the column content's for matches .
Also , I found this but it uses dictionaries and lists which isn't suitable for my file's size .
To handle the simulatively iteration thing I used this thread , but what was mentioned there about handling variant sized files wasn't really clear to me so I haven't tried it yet !!
I would really appreciate it if someone could tell me what am missing here , is the code correct or I'm using the in condition wrong ?!
Please , I really need help with this ... thanks in advance :)

The following solution is a copy of my answer given to your other question, and should hopefully give you an idea how to integrate it with your current solution.
The script reads two (or more) CSV files in and writes the intersection of row entries to a new CSV file. By that I mean if row1 in input1.csv is found anywhere in input2.csv, the row is written to the output, and so on.
import csv
files = ["input1.csv", "input2.csv"]
ldata = []
for file in files:
with open(file, "r") as f_input:
csv_input = csv.reader(f_input, delimiter="\t", skipinitialspace=True)
set_rows = set()
for row in csv_input:
set_rows.add(tuple(row))
ldata.append(set_rows)
with open("Intersection(Brachypodium_Japonica).csv", "wb") as f_output:
csv_output = csv.writer(f_output, delimiter="\t", skipinitialspace=True)
csv_output.writerows(set.intersection(*ldata))
You will need to add your file name mangling. This format made it easier to test. Tested using Python 2.7.

Look for a filename in a directory compare from a row in a sqlite database and copy contents

I'm really green and new in the python world, and I'm learning as I go.
I'm trying to extract a series of rows from a sqlite database. (which I've done).
I then write those to a csv file.
I'm now trying to compare a row from that database and look into a directory where there is a filename with the same value.
So if row cell data is 1000 the file would be 1000.txt amongst a long list of others in a directory. They're all in the same folder.
Once I find the file. I want to then read that files contents and then add that to another row in my csv that I've created.
So my main question is how to compare to the directory based on the cell data which is the filename (no extension is provided in the cell just a number reference).
# Open file, get lines, close file.
# Probably prudent to add try-except here for bad file names.
msgfile = {}
filename = {}
for row in c:
msgfile = msgID[row[1]]
for filenames in os.walk(r"D:\my_source_directory\"):
##stuck here
f_open = open(msgfile,'r')
lines = f_open.readlines()
f_open.close()
print ()

os.walk yields tuples that contain (current_dir, childrendirs, files) so at the very least, your loop needs to look like for _, _, files in os.walk(...): that may be the root of your problems, but there's not much to go on here.

Why can't I repeat the 'for' loop for csv.Reader?

I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body

The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.

I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Batch Appending matching rows to csv files using python - python

Related

Using Python v3.5 to load a tab-delimited file, omit some rows, and output max and min floating numbers in a specific column to a new file

Parse Specific Text File to CSV Format with Headers

Partial Intersection of Sepecific Columns in Large CSV Files

Look for a filename in a directory compare from a row in a sqlite database and copy contents

Why can't I repeat the 'for' loop for csv.Reader?

Categories

Resources