Iteratively copy specific rows from CSV file to new file - python

I have a large tab-delimited csv file with the following format:
#mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id mirna_alignment gene_alignment mirna_start mirna_end gene_start gene_end genome_coordinates conservation align_score seed_cat energy mirsvr_score
What I would like to be able to do is iterate through rows and select items based on data (strings) in the "gene_id" field, then copy those rows to a new file.
I am a python noob, and thought it would be a good way to get my feet wet, but it is harder than it looks! I have been trying to use the csv package to manipulate the files, reading and writing basic stuff using dictreader and dictwriter. If anyone can help me out coming up with a template for the iterative searching aspect, I would be greatly indebted. So far I have:
import csv
f = open("C:\Documents and Settings\Administrator\Desktop\miRNA Scripting\mirna_predictions_short.txt", "r")
reader = csv.DictReader(f, delimiter='\t')
writer = open("output.txt",'wb')
writer = csv.writer(writer, delimiter='\t')
Then the iterative bit, bleurgh:
for row in reader:
if reader.gene_id == str(CG11710):
writer.writerow
This obviously doesnt work. Any ideas on better ways to structure this??

You're almost there! The code is nearly correct :)
Accessing dicts goes like this:
some_dict['some_key']
Instead of:
some_object.some_attribute
Creating a string isn't done with str(...) but with quotes like CG11710
In your case:
for row in reader:
if row['gene_id'] == 'CG11710':
writer.writerow(row)

Dictionaries in python are addressed like dictionary['key']. So for you it'd be reader['gene_id']. Also strings are declared in quotes "text", not like str(text). str(text) will try to cast whatever is stored in the variable text to a string, which is not what I think you want. Also writer.writerow is a function, and functions take arguments, so you need to do writer.writerow(row).

Related

Storing DataFrame output utilizing Pandas to either csv or MySql DB

question regarding pandas:
Say I created a dataframe and generated output under separate variables, rather than printing them, how would I go about combining them back into another dataframe correctly to either send as a CSV and then upload to a DB or directly upload to a DB?
Everything works fine code wise, I just haven't really seen or know of the best practice to do this. I know we can store things in list, dict, etc
What I did was:
#imported all modules
object = df.iloc[0,0]
#For loop magic goes here
#nested for loop
#if conditions are met, do this
result = df.iloc[i, k+1]
print(object, result)
I've also stored them into a separate DataFrame trying:
df2 = pd.DataFrame({'object': object, 'result' : result}, index=[0])
df2.to_csv('output.csv', index=False, mode='a')
The only problem with that is that it appends everything to each row, most likely do to the append and perhaps not including it in the for loop. Which is odd because the raw output is EXACTLY how I'm trying to get it into a csv or into a DB
As saying though, looking to combine both values back into a dataframe for speed. I tried concat etc, but no luck, so I was wondering what the correct format would be? Thanks
So it turns out that after more research and revising, I solved my issue
Referenced this and personal revisions, this is a basis of what I did:
Empty space in between rows after using writer in python
import csv
/* Had to wrap in a for loop that is not listed and append to file while clearing it first to remove spaces in each row*/
with open('csvexample.csv', wb+, newline='') as file:
reader = csv.reader(file)
for row in reader:
print(row)
Additional supporting material:
Confused by python file mode "w+"

Labelling and Grouping Postcodes using Python

I'm fairly new to Python and I am attempting to group various postcodes together under predefined labels. For example "SA31" would be labelled a "HywelDDAPostcode"
I have some code where I read lots of postcodes from a singled columned file into a list and compare them with postcodes that are in predefined lists. However, when I output my postcode labels only the Label "UKPostcodes" is outputted for every postcode in my original file. It would appear that the first two conditions in my code always evaluate to false no matter what. Am I doing the right thing using "in"? Or perhaps it's a file reading issue? I'm not sure
The input file is simply a file which contains a list of postcodes ( in reality it has thousands of rows)
The CSV file
Here is my code:
import csv
with open('postcodes.csv', newline='') as f:
reader = csv.reader(f)
your_list = list(reader)
my_list =[]
HywelDDAPostcodes=["SA46","SY23","SY24","SA18","SA16","SA43","SA31","SA65","SA61","SA62","SA17","SA48","SA40","SA19","SA20","SA44","SA15","SA14","SA73","SA32","SA67","SA45",
"SA38","SA42","SA41","SA72","SA71","SA69","SA68","SA33","SA70","SY25","SA34","LL40","LL42","LL36","SY18","SY17","SY20","SY16","LD6"]
NationalPostcodes=["LL58","LL59","LL60","LL61","LL62","LL63","LL64","LL65","LL66","LL67","LL68","LL69","LL70","LL71","LL72","LL73","LL74","LL75","LL76","LL77","LL78",
"NP1","NP2","NP23","NP3","CF31","CF32","CF33","CF34","CF35","CF36","CF3","CF46","CF81","CF82","CF83","SA35","SA39","SA4","SA47","LL16","LL18","LL21","LL22","LL24","LL25","LL26","LL27","LL28","LL29","LL30","LL31","LL32","LL33","LL34","LL57","CH7","LL11","LL15","LL16","LL17","LL18","LL19","LL20","LL21","LL22","CH1","CH4","CH5","CH6","CH7","LL12","CF1","CF32","CF35","CF5","CF61","CF62","CF63","CF64","CF71","LL23","LL37","LL38","LL39","LL41","LL43","LL44","LL45","LL46","LL47","LL48","LL49","LL51","LL52","LL53","LL54","LL55","LL56","LL57","CF46","CF47","CF48","NP4","NP5","NP6","NP7","SA10","SA11","SA12","SA13","SA8","CF3","NP10","NP19","NP20","NP9","SA36","SA37","SA63","SA64","SA66","CF44","CF48","HR3","HR5","LD1","LD2","LD3","LD4","LD5","LD7","LD8","NP8","SY10","SY15","SY19","SY21","SY22","SY5","CF37","CF38","CF39","CF4","CF40","CF41","CF42","CF43","CF45","CF72","SA1","SA2","SA3","SA4","SA5","SA6","SA7","SA1","NP4","NP44","NP6","LL13","LL14","SY13","SY14"]
NationalPostcodes2= list(dict.fromkeys(NationalPostcodes))
labels=["HywelDDA","NationalPostcodes","UKPostcodes"]
for postcode in your_list:
#print(postcode)
if postcode in HywelDDAPostcodes:
my_list.append(labels[0])
if postcode in NationalPostcodes2:
my_list.append(labels[1])
else:
my_list.append(labels[2])
with open('DiscretisedPostcodes.csv','w') as result_file:
wr = csv.writer(result_file, dialect='excel')
for item in my_list:
wr.writerow([item,])
If anyone has any advice as to what could be causing the issue or just any advice surrounding Python, in general, I would very much appreciate it. Thank you!
The reason why your comparison block isn't working is that when you use csv reader to read your file, each line is being added to your_list as a list. So you are making a list of lists and when you compare those things it doesn't match.
['LL58'] == 'LL58' # fails
So, inspect your_list and see what I mean. You should make a shell your_list before you read the file and append each new reading to it. Then inspect that to make sure it looks good. It would also behoove you to use the strip() command to strip off whitespace from each item. I can't recall if csv reader does that automatically.
Also... a better structure for testing for membership is to use sets instead of lists. in will work for lists, but it is MUCH faster for sets, so I would put your comparison items into sets.
Lastly, it isn't clear what you are trying to do with NationalPostcodes2. Just use your NationalPostcodes, but put them in a set with {}.
#Jeff H's answer is correct, but for what it's worth here's how I might write this code (untested):
# Note: Since, as you wrote, these are only single-column files I did not use the csv
# module, as it will just add additional unnecessary overhead.
# Read the known data from files--this will always be more flexible and maintainable than
# hard-coding them in your code. This is just one possible scheme for doing this; e.g.
# you could also put all of them into a single JSON file
standard_postcode_files = {
'HywelDDA': 'hyweldda.csv',
'NationalPostcodes': 'nationalpostcodes.csv',
'UKPostcodes': 'ukpostcodes.csv'
}
def read_postcode_file(filename):
with open(filename) as f:
# exclude blank lines and strip additional whitespace
return [line.strip() for line in f if line.strip()]
standard_postcodes = {}
for key, filename in standard_postcode_files.items():
standard_postcodes[key] = set(read_postcode_file(filename))
# Assuming all post codes are unique to a set, map postcodes to the set they belong to
postcodes_reversed = {v: k for k, s in standard_postcodes.items() for v in s}
your_postcodes = read_postcode_file('postcodes.csv')
labels = [postcodes_reversed[code] for code in your_postcodes]
with open('DiscretisedPostCodes.csv', 'w') as f:
for label in labels:
f.write(label + '\n')
I would probably do other things like not make the input filename hard-coded. If you need to work with multiple columns using the csv module would also be fine with minimal additional changes, but since you're just writing one item per line I figured it was unnecessary.

From Excel to Python: from columns to strings

I have an Excel file with 2 columns (it is an CSV file). The first contains dates, while the second contains numbers.
I want to make a list out of this with Python, the following way: ['date1','number1','date2','number2']. At the moment I always get: ['date1;number1','date2;number2']. I basically want every element to be treated as a string on its own.
At the moment I'm using following code:
abc = []
with open('C:...doc.csv', 'r', encoding = "utf8") as f:
reader = csv.reader(f)
for row in reader:
abc.extend([row])
I've tried everything I could come up with, e.g. nested for etc, but nothing seems to work.
Can anyone help me out please? Thank you!

List from CSV column in Python

So I had written a little snip for Scrapy to search the country on a site by zip code, but it seems like a waste to go through all the nonexistent zip codes, so, first, this is what I had...
def start_requests(self):
for i in xrange(100000):
yield self.make_requests_from_url("http://www.example.com/zipcode/%05d/search.php" % i)
The idea is obvious, but I downloaded a CSV with all of the US zip codes in a column - how would I easily use this as a list (or more efficient method than a list) to use in the above example? I have pandas if that would make things easier.
If I'm understanding you correctly, you have a file that is comma-delimited and formatted such that in a particular column (Perhaps titled 'ZipCodes') a zipcode is present on each row.
If there's a header line and different columns and you know the name of the column that contains the zipcodes you could do this:
def start_requests(self, filename, columnname):
with open(filename) as file:
headers = file.readline().strip().split(',')
for line in file.readlines():
zipcode = line.strip().split(',')[headers.index(columnname)]
yield self.make_requests_from_url("http://www.example.com/zipcode/%05d/search.php" % zipcode)
Open file, read lines, grab zip codes, yield ...
for line in open('zipcodes.csv', 'r').readlines():
zipcode = line.split(',')[columnNumberOfTheZipCodesStartingFrom0]
yield self.make_requests_from_url("http://foo.com/blah/%s/search.php" % (zipcode,))
Just to round out the array of perfectly good suggestions, here's another. The main idea to this approach is that it doesn't require special libraries like pandas, but isn't just reading plain file contents either, in which case you have to re-invent the wheel as far as CSV markup goes (not the hardest thing, but why bother?). If your csv file is simple enough, it might be easier just to read out the file contents, as suggested by dg99
Use python's built-in csv library!
ziplist = []
import csv
with open('zipcodes.csv', 'rb') as csvfile:
zipreader = csv.reader(csvfile)
for row in zipreader:
ziplist.append(row[i])
Notes:
I have row[i] where i is the column index for the zipcodes in your csv file. If the file lists zip+4 codes, you might use row[i][:5]. Interestingly, if you don't know what column number the zipcodes will be in, but you do know the column header (field name), you can use
zipreader = csv.DictReader(csvfile)
for zipDict in zipreader:
ziplist.append(row['Zip Code Column Name Here'])
According to this post, getting info back out of a list is just as efficient as a tuple, so this seems like the way to go.
so you want to read in a csv to a list....well:
i think this should be easy:
import pandas
colname = ['zip code','city']
zipdata = pandas.read_csv('uszipcodes.csv', names=colname)
i hope i understood you right!
Maybe like this?
#!/usr/local/cpython-3.3/bin/python
import csv
import pprint
def gen_zipcodes(file_):
reader = csv.reader(file_, delimiter='|', quotechar='"')
for row in reader:
yield row[0]
def main():
with open('zipcodes_2006.txt', 'r') as file_:
zipcodes = list(gen_zipcodes(file_))
pprint.pprint(zipcodes[:10])
main()

Python fast way to read several rows of csv text?

I wish to to the following as fast as possible with Python:
read rows i to j of a csv file
create the concatenation of all the strings in csv[row=(loop i to j)][column=3]
My first code was a loop (i to j) of the following:
with open('Train.csv', 'rt') as f:
row = next(itertools.islice(csv.reader(f), row_number, row_number+1))
tags = (row[3].decode('utf8'))
return tags
but my code above reads the csv one column at a time and is slow.
How can I read all rows in one call and concatenate fast?
Edit for additional information:
the csv file size is 7GB; I have only 4GB of RAM, on windows XP; but I don't need to read all columns (only 1% of the 7GB would be good I think).
Since I know which data you are interested in, I can speak from experience:
import csv
with open('Train.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in reader:
row[0] # ID
row[1] # title
row[2] # body
row[3] # tags
You can of course per row select anything you want, and store it as you like.
By using an iterator variable, you can decide which rows to collect:
import csv
with open('Train.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
linenum = 0
tags = [] # you can preallocate memory to this list if you want though.
for row in reader:
if linenum > 1000 and linenum < 2000:
tags.append(row[3]) # tags
if linenum == 2000:
break # so it won't read the next 3 million rows
linenum += 1
The good thing about it is also that this will really use low memory as you read in line by line.
As mentioned, if you want the later cases, it still has to parse the data to get there (this is inevitable since there are newlines in the text, so you can't skip to a certain row). Personally, I just roughly used linux's split, to split the file in chunks, and then edited them making sure they start at an ID (and end with a tag).
Then I used:
train = pandas.io.parsers.read_csv(file, quotechar="\"")
To quickly read in the split files.
If the file is not HUGE (hundred of megabytes) and you actually need to read a lot of rows then probably just
tags = " ".join(x.split("\t")[3]
for x in open("Train.csv").readlines()[from_row:to_row+1])
is going to be the fastest way.
If the file is instead very big the only thing you can do is iterating over all lines because CSV is uses unfortunately (in general) variable-sized records.
If by chance the specific CSV uses a fixed-size record format (not uncommon for large files) then directly seeking into the file may be an option.
If the file uses variable-sized records and the search must be done several times with different ranges then creating a simple external index just once (e.g. line->file offset for all line numbers that are a multiple of 1000) can be good idea.
Your question does not contain enough information, probably because you don't see some existing complexity: Most CSV files contain one record per line. In that case it's simple to skip the rows you're not interested in. But in CSV records can span lines, so a general solution (like the CSV reader from the standard library) has to parse the records to skip lines. It's up to you to decide what optimization is ok in your use case.
The next problem is, that you don't know, which part of the code you posted, is too slow. Measure it. Your code will never run faster than the time you need to read the file from disc. Have you checked that? Or have you guessed what part's to slow?
If you want to do fast transformations of CSV data which fits to memory, I would propose to use/learn Pandas. So it would probably a good idea to split your code in two steps:
Reduce file to the required data.
Transform the remaining data.
sed is designed for the task 'read rows i to j of a csv file'.to
If the solution does not have to be pure Python, I think preprocess the csv file with sed sed -n 'i, jp', then parse the output with Python would be simple and quick.

Categories