replacing values from two csv files - python

I have a small issue that I hope you'll be able to help me with :) I've tried to provide simplified examples to help you see what I mean. I'm using python 2.6.
So, I'm currently trying to re-assign some values in a file which represents interactions between two objects. The interaction file (file1) looks something like this:
Thing1 Thing2 0.625
Thing2 Thing3 0.191
Thing1 Thing3 0.173
Whilst my other file (file2), also a tsv, looks something like:
DiffName1 Thing1 ...
DiffName2 Thing2 ...
DiffName3 Thing3 ...
Essentially, I'd like to take file1, find the corresponding 'DiffName' value in file2, and make a new file with the same layout as file1 but with 'Thing1' replaced with 'DiffName1' and so on, whilst maintaining the structure of file1. i.e two columns with corresponding interaction value.
So far, from asking questions and reading answers on here, I've achieved similar results with this script: (I've checked but there may be some redundant/wrong things in here)
import csv
import sys
interaction_file = sys.argv[1]
Out_file = sys.argv[2]
f_output = open(Out_file, 'wb')
ids = {}
with open('file2') as f_file2:
csv_file2 = csv.reader(f_file2, skipinitialspace=True)
header = next(csv_file2)
for cols in csv_file2:
ids[cols[7]] = cols[0]
with open(interaction_file, 'rb') as f_file1:
csv_file1 = csv.reader(f_file1, delimiter='\t')
csv_output = csv.writer(f_output, delimiter='\t')
for cols in csv_file1:
csv_output.writerow([ids.get(cols[0], cols[0]), ids.get(cols[1], cols[1]), cols[2]])
But for whatever reason, I suspect due to the slightly different layout of file2 compared to the file that this scripts was originally written for, I've been unable to make this work for me. I've spent quite a bit of time trying to understand each line of this file but I still can't quite get it running, possibly because I don't quite fully understand the final line:
csv_output.writerow([ids.get(cols[0], cols[0]), ids.get(cols[1], cols[1]), cols[2]])
Is anyone able to give me some advice?
Cheers,
Matthew

Is in that line ids[cols[7]] = cols[0] just a typo, you seem to have only 2 columns in your example, and you are trying to use 7th column.
What it does, is that you declare a dictionary and populate it from 2nd file. Then, when you search in dictionary with get ids.get(cols[0], cols[0]), it will search for a key cols[0] and if it is not in dictionary, it will return second argument of get function, in this case cols[0] itself.

I added some annotations to your script and changed/shortened some bits. The docs on dict.get should help you understand the last line:
import csv, sys
interaction_file, out_file = sys.argv[1], sys.argv[2]
f_output = open(out_file, 'wb')
with open('file2') as f_file2:
# get lines as list and slice off header row
rows = list(csv.reader(f_file2, skipinitialspace=True, delimiter='\t'))[1:]
# ids: Thing* as key, DiffName* as value
ids = {row[1]: row[0] for row in rows}
with open(interaction_file, 'rb') as f_file1:
csv_file1 = csv.reader(f_file1, delimiter='\t')
csv_output = csv.writer(f_output, delimiter='\t')
for row in csv_file1:
csv_output.writerow([ids.get(row[0], row[0]), ids.get(row[1], row[1]), row[2]])
# ids.get(row[0], row[0]): dict.get(key[, default])
# use value (DiffName*) for key row[0] (Thing*) from ids,
# or use row[0] (Thing*) itself
# if it is not present as a key in ids

Check that your input files have correct delimiters. And also seeing the error message would be good.

Related

Mapping values to other corresponding values

I have an excel file with +400k rows of protein_protein interactions with Entrez identifiers, I want to map the identifiers to corresponding identifiers of different database Uniprot
database looks like this:
and i want this
Provided that I have the corresponding values of each entrez id to uniprot id
Could you please suggest me an efficient way to do this, I can't think of anything other than iterating over the database
OK, this took me a minute to grok, but I think I have this for you. We discussed the example in chat, so you should probably update your question to reflect my answer since it varies from the original.
This is just iterating over the tables, so it's not a more efficient version, but I wasn't aware if you had anything at this point to start from, so at least this is something.
We're trying to create table2 from table1 and table3:
Starting with these CSV files:
table1.csv
paperA_db1,paperB_db1
9240,8601
8933,91289
table3.csv
paper_db1,paper_db2
9240,Q8ND90
8933,A6ZKI3
8601,O76081
91289,Q9BU23
We can do this using Python's csv module, like this:
import csv
mappings = {}
with open("table3.csv", newline="") as mapping_csv:
reader = csv.DictReader(mapping_csv)
for row in reader:
mappings[row["paper_db1"]] = row["paper_db2"]
table3 = {}
with open("table1.csv", newline="") as table1_csv:
reader = csv.DictReader(table1_csv)
for row in reader:
table3[mappings[row["paperA_db1"]]] = mappings[row["paperB_db1"]]
with open("table2.csv", newline="", mode="w") as table2_csv:
fieldnames = ['paperA_db2', 'paperB_db2']
writer = csv.DictWriter(table2_csv, fieldnames=fieldnames)
writer.writeheader()
for paperA_db2, paperB_db2 in table3.items():
writer.writerow(dict(paperA_db2=paperA_db2, paperB_db2=paperB_db2))
Here's that running on my machine:
Mine is very similar to #DanielSchroederDev.
I've got a error check on the lookup, so the script keeps going.
I also just use a csv.reader rather than a csv.DictReader, 2 columns is pretty easy to keep in your head.
It also seems like overkill to use pandas, but if your data is in Excel, you'll need to uses a Excel reader, much easier to use text files, so save as csv!
import csv
trans = dict()
with open("key_file.csv", "r", encoding="utf8") as f:
c = csv.reader(f)
next(c)
for row in c:
trans[row[0]] = row[1]
print(trans)
def lookup(p):
try:
return trans[p]
except KeyError:
print(f"No translation for {p}")
return 0
with open("protiens.csv", "r", encoding="utf8") as f:
c = csv.reader(f)
next(c)
new_protiens = list(map(lambda x: [lookup(x[0]), lookup(x[1])], c))
print(new_protiens)
with open("translated.csv", "w", encoding="utf8") as f:
c = csv.writer(f)
c.writerow(["proA_unipro", "proB_unipro"])
for row in new_protiens:
c.writerow(row)

How would I validate that a CSV file is delimited by certain character (in this case a backtick ( ` ) )

I have these huge CSV files that I need to validate; need to make sure they are all delimited by back tick `. I have a reader opening each file and printing it's content. Just wondering the different ways you all would go about validating that each value is delimited by the back tick character
for csvfile in self.fullcsvpathfiles:
#print("file..")
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter = "`")
for row in csv_reader:
print (row)
Not sure how to go about validating that each value is seperated by a backtick and throw an error if otherwise. These tables are huge (not that thats a problem for electricity ;) )
Method 1
With pandas library you could use pandas.read_csv() function to read the csv file with sep='`' (it specifies the delimiter). If it parses the file to a dataframe in a good shape, then you could almost be sure that's good.
Also, to automate the validation process, you could check if the number of NaN values in the dataframe is within an acceptable level. Assume your csv files do not have many blanks (so only a few NaN values are expected), you could compare the number of NaN values with a threshold you set.
import pandas as pd
nan_threshold = 20
for csvfile in self.fullcsvpathfiles:
my_df = pd.read_csv(csvfile, sep="`") # if it fails at this step, then something (probably the delimiter) must be wrong
nans = my_df.is_null().sum()
if nans > nan_threshold:
print(csvfile) # make some warning here
Refer to this page for more information about pandas.read_csv().
Method 2
As mentioned in the comments, you could also check if the number of occurrence of the delimiter is equal in each line of the file.
num_of_sep = -1 # initial value
# assume you are at the step of reading a file f
for line in f:
num = line.count("`")
if num_of_sep == -1:
num_of_sep = num
elif num != num_of_sep:
print('Some warning here')
If you don't know how many columns are in a file, you could check to make sure all the rows have the same number of columns - if you expect the header (first) to always be correct use it to determine the number of columns.
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter = "`")
ncols = len(next(csv_reader))
if not all(len(row)==ncols for row in reader):
#do something
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as f:
row = next(f)
ncols = row.count('`')
if not all(row.count('`')==ncols for row in f):
#do something
If you know how many columns are in a file...
for csvfile in self.fullcsvpathfiles:
with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
#figure out how many columns it is suppose to have here?
ncols = special_process()
csv_reader = csv.DictReader(csv_file, delimiter = "`")
if not all(len(row)==ncols for row in reader):
#do something
for csvfile in self.fullcsvpathfiles:
#figure out how many columns it is suppose to have here?
ncols = special_process()
with open(self.fullcsvpathfiles[0], mode='r') as f:
#figure out how many columns it is suppose to have here?
if not all(row.count('`')==ncols for row in f):
#do something
If you know the number of expected elements, you could inspect each line
f=open(filename,'r')
for line in f:
line=line.split("`")
if line!=numElements:
raise Exception("Bad file")
If you know the delimiter that is being accidentally inserted, you could also try to recover instead of throwing exception. Perhaps something like:
line="`".join(line).replace(wrongDelimiter,"`").split("`")
Of course, once you're that far into reading the file, there's no great need for using an external library to read the data. Just go ahead and use it.

Copy specific Column from a .csv file to another .csv file and save it as a new .csv file issue

Let's say File1 is the file where i want to copy the column from, and File 2 is the file where I want that column to be pasted at, and once pasted, save this file as a new file with extension .csv. It seem like a simple code to do, yet my first atempt of the code has giving me this error "AttributeError: 'file' object has no attribute 'writerow'
".Clearly, I seem to have no idea what I am doing wrong here. So i was wondering if you guys could help me. Here is the code I have written so far:
import csv
File1 = 'C:/Users/Alan Cedeno/Desktop/Test_Folder/dyn_0.csv'
File2 = 'C:/Users/Alan Cedeno/Desktop/Test_Folder/HiSAM1_data_160215_164858.csv'
with open(File1, "r") as r, open(File2, "a") as w:
reader = csv.reader(r, lineterminator = "\n")
writer = csv.writer(w, lineterminator = "\n")
for row in reader:
w.writerow(row[0])
If the question needs formatting please let me know. Also, if you think the code will not do what I want, a hint of where I can get started will definitely help. Please keep in mind I am a slow learner so if you can show me how to make it work step by step that will be a huge help! I just need some starter so I can follow on it and write my own. Thanks :o)
Your most immediate problem is that w is the file object... you want writer. But you've got a few other issues. First, you described 3 files, not two. Next, you need to actually insert the column. Finally, you have to decide what to do if the two files have different lengths. In this example I assumed you wanted to take the first column from the first csv file and insert it as the first column in the merged result. I tweeked file names to (hopefully) make it more clear.
The following code has several techniques for merging csvs as noted in the comments. You need to change them to your circumstances.
import os
import csv
File1 = 'C:/Users/Alan Cedeno/Desktop/Test_Folder/dyn_0.csv'
File2 = 'C:/Users/Alan Cedeno/Desktop/Test_Folder/HiSAM1_data_160215_164858.csv'
root, ext = os.path.splitext(File2)
output = root + '-new.csv'
with open(File1) as r1, open(File2) as r2, open(output, 'w') as w:
writer = csv.writer(w)
merge_from = csv.reader(r1)
merge_to = csv.reader(r2)
# skip 3 lines of headers
for _ in range(3):
next(merge_from)
for merge_from_row, merge_to_row in zip(merge_from, merge_to):
# insert from col 0 as to col 0
merge_to_row.insert(0, merge_from_row[0])
# replace from col 1 with to col 3
merge_to_row[1] = merge_from_row[3]
# delete merge_to rows 5,6,7 completely
del merge_to_row[5:8]
writer.writerow(merge_to_row)

Trying to convert a CSV file to int in Python [duplicate]

I am asking Python to print the minimum number from a column of CSV data, but the top row is the column number, and I don't want Python to take the top row into account. How can I make sure Python ignores the first line?
This is the code so far:
import csv
with open('all16.csv', 'rb') as inf:
incsv = csv.reader(inf)
column = 1
datatype = float
data = (datatype(column) for row in incsv)
least_value = min(data)
print least_value
Could you also explain what you are doing, not just give the code? I am very very new to Python and would like to make sure I understand everything.
You could use an instance of the csv module's Sniffer class to deduce the format of a CSV file and detect whether a header row is present along with the built-in next() function to skip over the first row only when necessary:
import csv
with open('all16.csv', 'r', newline='') as file:
has_header = csv.Sniffer().has_header(file.read(1024))
file.seek(0) # Rewind.
reader = csv.reader(file)
if has_header:
next(reader) # Skip header row.
column = 1
datatype = float
data = (datatype(row[column]) for row in reader)
least_value = min(data)
print(least_value)
Since datatype and column are hardcoded in your example, it would be slightly faster to process the row like this:
data = (float(row[1]) for row in reader)
Note: the code above is for Python 3.x. For Python 2.x use the following line to open the file instead of what is shown:
with open('all16.csv', 'rb') as file:
To skip the first line just call:
next(inf)
Files in Python are iterators over lines.
Borrowed from python cookbook,
A more concise template code might look like this:
import csv
with open('stocks.csv') as f:
f_csv = csv.reader(f)
headers = next(f_csv)
for row in f_csv:
# Process row ...
In a similar use case I had to skip annoying lines before the line with my actual column names. This solution worked nicely. Read the file first, then pass the list to csv.DictReader.
with open('all16.csv') as tmp:
# Skip first line (if any)
next(tmp, None)
# {line_num: row}
data = dict(enumerate(csv.DictReader(tmp)))
You would normally use next(incsv) which advances the iterator one row, so you skip the header. The other (say you wanted to skip 30 rows) would be:
from itertools import islice
for row in islice(incsv, 30, None):
# process
use csv.DictReader instead of csv.Reader.
If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as field names. you would then be able to access field values using row["1"] etc
Python 2.x
csvreader.next()
Return the next row of the reader’s iterable object as a list, parsed
according to the current dialect.
csv_data = csv.reader(open('sample.csv'))
csv_data.next() # skip first row
for row in csv_data:
print(row) # should print second row
Python 3.x
csvreader.__next__()
Return the next row of the reader’s iterable object as a list (if the
object was returned from reader()) or a dict (if it is a DictReader
instance), parsed according to the current dialect. Usually you should
call this as next(reader).
csv_data = csv.reader(open('sample.csv'))
csv_data.__next__() # skip first row
for row in csv_data:
print(row) # should print second row
The documentation for the Python 3 CSV module provides this example:
with open('example.csv', newline='') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
# ... process CSV file contents here ...
The Sniffer will try to auto-detect many things about the CSV file. You need to explicitly call its has_header() method to determine whether the file has a header line. If it does, then skip the first row when iterating the CSV rows. You can do it like this:
if sniffer.has_header():
for header_row in reader:
break
for data_row in reader:
# do something with the row
this might be a very old question but with pandas we have a very easy solution
import pandas as pd
data=pd.read_csv('all16.csv',skiprows=1)
data['column'].min()
with skiprows=1 we can skip the first row then we can find the least value using data['column'].min()
The new 'pandas' package might be more relevant than 'csv'. The code below will read a CSV file, by default interpreting the first line as the column header and find the minimum across columns.
import pandas as pd
data = pd.read_csv('all16.csv')
data.min()
Because this is related to something I was doing, I'll share here.
What if we're not sure if there's a header and you also don't feel like importing sniffer and other things?
If your task is basic, such as printing or appending to a list or array, you could just use an if statement:
# Let's say there's 4 columns
with open('file.csv') as csvfile:
csvreader = csv.reader(csvfile)
# read first line
first_line = next(csvreader)
# My headers were just text. You can use any suitable conditional here
if len(first_line) == 4:
array.append(first_line)
# Now we'll just iterate over everything else as usual:
for row in csvreader:
array.append(row)
Well, my mini wrapper library would do the job as well.
>>> import pyexcel as pe
>>> data = pe.load('all16.csv', name_columns_by_row=0)
>>> min(data.column[1])
Meanwhile, if you know what header column index one is, for example "Column 1", you can do this instead:
>>> min(data.column["Column 1"])
For me the easiest way to go is to use range.
import csv
with open('files/filename.csv') as I:
reader = csv.reader(I)
fulllist = list(reader)
# Starting with data skipping header
for item in range(1, len(fulllist)):
# Print each row using "item" as the index value
print (fulllist[item])
I would convert csvreader to list, then pop the first element
import csv
with open(fileName, 'r') as csvfile:
csvreader = csv.reader(csvfile)
data = list(csvreader) # Convert to list
data.pop(0) # Removes the first row
for row in data:
print(row)
I would use tail to get rid of the unwanted first line:
tail -n +2 $INFIL | whatever_script.py
just add [1:]
example below:
data = pd.read_csv("/Users/xyz/Desktop/xyxData/xyz.csv", sep=',', header=None)**[1:]**
that works for me in iPython
Python 3.X
Handles UTF8 BOM + HEADER
It was quite frustrating that the csv module could not easily get the header, there is also a bug with the UTF-8 BOM (first char in file).
This works for me using only the csv module:
import csv
def read_csv(self, csv_path, delimiter):
with open(csv_path, newline='', encoding='utf-8') as f:
# https://bugs.python.org/issue7185
# Remove UTF8 BOM.
txt = f.read()[1:]
# Remove header line.
header = txt.splitlines()[:1]
lines = txt.splitlines()[1:]
# Convert to list.
csv_rows = list(csv.reader(lines, delimiter=delimiter))
for row in csv_rows:
value = row[INDEX_HERE]
Simple Solution is to use csv.DictReader()
import csv
def read_csv(file): with open(file, 'r') as file:
reader = csv.DictReader(file)
for row in reader:
print(row["column_name"]) # Replace the name of column header.

How to ignore the first line of data when processing CSV data?

I am asking Python to print the minimum number from a column of CSV data, but the top row is the column number, and I don't want Python to take the top row into account. How can I make sure Python ignores the first line?
This is the code so far:
import csv
with open('all16.csv', 'rb') as inf:
incsv = csv.reader(inf)
column = 1
datatype = float
data = (datatype(column) for row in incsv)
least_value = min(data)
print least_value
Could you also explain what you are doing, not just give the code? I am very very new to Python and would like to make sure I understand everything.
You could use an instance of the csv module's Sniffer class to deduce the format of a CSV file and detect whether a header row is present along with the built-in next() function to skip over the first row only when necessary:
import csv
with open('all16.csv', 'r', newline='') as file:
has_header = csv.Sniffer().has_header(file.read(1024))
file.seek(0) # Rewind.
reader = csv.reader(file)
if has_header:
next(reader) # Skip header row.
column = 1
datatype = float
data = (datatype(row[column]) for row in reader)
least_value = min(data)
print(least_value)
Since datatype and column are hardcoded in your example, it would be slightly faster to process the row like this:
data = (float(row[1]) for row in reader)
Note: the code above is for Python 3.x. For Python 2.x use the following line to open the file instead of what is shown:
with open('all16.csv', 'rb') as file:
To skip the first line just call:
next(inf)
Files in Python are iterators over lines.
Borrowed from python cookbook,
A more concise template code might look like this:
import csv
with open('stocks.csv') as f:
f_csv = csv.reader(f)
headers = next(f_csv)
for row in f_csv:
# Process row ...
In a similar use case I had to skip annoying lines before the line with my actual column names. This solution worked nicely. Read the file first, then pass the list to csv.DictReader.
with open('all16.csv') as tmp:
# Skip first line (if any)
next(tmp, None)
# {line_num: row}
data = dict(enumerate(csv.DictReader(tmp)))
You would normally use next(incsv) which advances the iterator one row, so you skip the header. The other (say you wanted to skip 30 rows) would be:
from itertools import islice
for row in islice(incsv, 30, None):
# process
use csv.DictReader instead of csv.Reader.
If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as field names. you would then be able to access field values using row["1"] etc
Python 2.x
csvreader.next()
Return the next row of the reader’s iterable object as a list, parsed
according to the current dialect.
csv_data = csv.reader(open('sample.csv'))
csv_data.next() # skip first row
for row in csv_data:
print(row) # should print second row
Python 3.x
csvreader.__next__()
Return the next row of the reader’s iterable object as a list (if the
object was returned from reader()) or a dict (if it is a DictReader
instance), parsed according to the current dialect. Usually you should
call this as next(reader).
csv_data = csv.reader(open('sample.csv'))
csv_data.__next__() # skip first row
for row in csv_data:
print(row) # should print second row
The documentation for the Python 3 CSV module provides this example:
with open('example.csv', newline='') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
# ... process CSV file contents here ...
The Sniffer will try to auto-detect many things about the CSV file. You need to explicitly call its has_header() method to determine whether the file has a header line. If it does, then skip the first row when iterating the CSV rows. You can do it like this:
if sniffer.has_header():
for header_row in reader:
break
for data_row in reader:
# do something with the row
this might be a very old question but with pandas we have a very easy solution
import pandas as pd
data=pd.read_csv('all16.csv',skiprows=1)
data['column'].min()
with skiprows=1 we can skip the first row then we can find the least value using data['column'].min()
The new 'pandas' package might be more relevant than 'csv'. The code below will read a CSV file, by default interpreting the first line as the column header and find the minimum across columns.
import pandas as pd
data = pd.read_csv('all16.csv')
data.min()
Because this is related to something I was doing, I'll share here.
What if we're not sure if there's a header and you also don't feel like importing sniffer and other things?
If your task is basic, such as printing or appending to a list or array, you could just use an if statement:
# Let's say there's 4 columns
with open('file.csv') as csvfile:
csvreader = csv.reader(csvfile)
# read first line
first_line = next(csvreader)
# My headers were just text. You can use any suitable conditional here
if len(first_line) == 4:
array.append(first_line)
# Now we'll just iterate over everything else as usual:
for row in csvreader:
array.append(row)
Well, my mini wrapper library would do the job as well.
>>> import pyexcel as pe
>>> data = pe.load('all16.csv', name_columns_by_row=0)
>>> min(data.column[1])
Meanwhile, if you know what header column index one is, for example "Column 1", you can do this instead:
>>> min(data.column["Column 1"])
For me the easiest way to go is to use range.
import csv
with open('files/filename.csv') as I:
reader = csv.reader(I)
fulllist = list(reader)
# Starting with data skipping header
for item in range(1, len(fulllist)):
# Print each row using "item" as the index value
print (fulllist[item])
I would convert csvreader to list, then pop the first element
import csv
with open(fileName, 'r') as csvfile:
csvreader = csv.reader(csvfile)
data = list(csvreader) # Convert to list
data.pop(0) # Removes the first row
for row in data:
print(row)
I would use tail to get rid of the unwanted first line:
tail -n +2 $INFIL | whatever_script.py
just add [1:]
example below:
data = pd.read_csv("/Users/xyz/Desktop/xyxData/xyz.csv", sep=',', header=None)**[1:]**
that works for me in iPython
Python 3.X
Handles UTF8 BOM + HEADER
It was quite frustrating that the csv module could not easily get the header, there is also a bug with the UTF-8 BOM (first char in file).
This works for me using only the csv module:
import csv
def read_csv(self, csv_path, delimiter):
with open(csv_path, newline='', encoding='utf-8') as f:
# https://bugs.python.org/issue7185
# Remove UTF8 BOM.
txt = f.read()[1:]
# Remove header line.
header = txt.splitlines()[:1]
lines = txt.splitlines()[1:]
# Convert to list.
csv_rows = list(csv.reader(lines, delimiter=delimiter))
for row in csv_rows:
value = row[INDEX_HERE]
Simple Solution is to use csv.DictReader()
import csv
def read_csv(file): with open(file, 'r') as file:
reader = csv.DictReader(file)
for row in reader:
print(row["column_name"]) # Replace the name of column header.

Categories