I'm creating a script to read a csv file into a set of named tuples from their column headers. I will then use these namedtuples to pull out rows of data which meet certain criteria.
I've worked out the input (shown below), but am having issues with filtering the data before outputting it to another file.
import csv
from collections import namedtuple
with open('test_data.csv') as f:
f_csv = csv.reader(f) #read using csv.reader()
Base = namedtuple('Base', next(f_csv)) #create namedtuple keys from header row
for r in f_csv: #for each row in the file
row = Base(*r)
# Process row
print(row) #print data
The contents of my input file are as follows:
Locus Total_Depth Average_Depth_sample Depth_for_17
chr1:6484996 1030 1030 1030
chr1:6484997 14 14 14
chr1:6484998 0 0 0
And they are printed from my code as follow:
Base(Locus='chr1:6484996', Total_Depth='1030',
Average_Depth_sample='1030', Depth_for_17='1030')
Base(Locus='chr1:6484997', Total_Depth='14',
Average_Depth_sample='14', Depth_for_17='14')
Base(Locus='chr1:6484998', Total_Depth='0', Average_Depth_sample='0',
Depth_for_17='0')
I want to be able to pull out only the records with a Total_Depth greater than 15.
Intuitively I tried the following function:
if Base.Total_Depth >= 15 :
print row
However this only prints the final row of data (from the above output table). I think the problem is twofold. As far as I can tell I'm not storing my named tuples anywhere for them to be referenced later. And secondly the numbers are being read in string format rather than as integers.
Firstly can someone correct me if I need to store my namedtuples somewhere.
And secondly how do I convert the string values to integers? Or is this not possible because namedtuples are immutable.
Thanks!
I previously asked a similar question with respect to dictionaries, but now would like to use namedtuples instead. :)
Map your values to int when creating the named tuple instances:
row = Base(r[0], *map(int, r[1:]))
This keeps the r[0] value as a string, and maps the remaining values to int().
This does require knowledge of the CSV columns as which ones can be converted to integer is hardcoded here.
Demo:
>>> from collections import namedtuple
>>> Base = namedtuple('Base', ['Locus', 'Total_Depth', 'Average_Depth_sample', 'Depth_for_17'])
>>> r = ['chr1:6484996', '1030', '1030', '1030']
>>> Base(r[0], *map(int, r[1:]))
Base(Locus='chr1:6484996', Total_Depth=1030, Average_Depth_sample=1030, Depth_for_17=1030)
Note that you should test against the rows, not the Base class:
if row.Total_Depth >= 15:
within the loop, or in a new loop of collected rows.
Related
I have a csv file with over 5,000,000 rows of data that looks like this (except that it is in Farsi):
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766140,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,1050000,7266.44,5,Concrete,13890108,5166884645
766146,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,700000,4844.29,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
770822,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,50,500000,1730.1,5,Concrete,13890114,5166884645
I would like to have a code to list the variables in a specific column.
For example, I'd like it to return {Kish, Qeshm, Tabriz} for the 'city' column.
You need to first to import the csv module into your python file and read over each row in the file and save it in a list, so it'll be like
import csv
cities = []
with open("yourfile.csv", "r") as file:
reader = csv.DictReader(file) //This will save the values in the very top of the csv file as header so it will skip a line
for row in reader:
city = row["City"]
cities.append(city)
this will give you a list of cities=[Kish, Qesh, Tabriz, ....]
It appears you want to remove duplicates as well, which you can have by simply cast the finished list to set. Here's how to do it with pandas:
import pandas as pd
cities = pd.read_csv('yourfile.csv', usecols=['City'])['City']
# just cast to list if you want a plain list instead of a DataFrame
cities_list = list(cities)
# use set to remove the duplicates
unique_cities = set(cities)
In case you have need to preserve ordering, you might use an ordered dict with just keys.
Also, in case you're short on memory trying to read 5M rows in one go, you can read them in chuncks:
import pandas as pd
cities_chunks_list = [chunck['City'] for chunck in pd.read_csv('yourfile.csv', usecols=['City'], chunksize = 1000)]
#let's flatten the list
cities_list = [city for cities_chunk in cities_chunks_list for city in cities_chunk]
Hope I helped.
The problem:
I have lists of genes expressed in 53 different tissues. Originally, this data was stored in a maximal array of the genes, with 'NaN' where there was no expression. I am trying to create new lists for each tissue that just have the genes expressed, as it was very inefficient to be searching through this array every time I was running my script. I have a code that finds the genes for each tissue as required, but I do not know how to store the ouptut.
I was using pandas data frame, and then converting to csv. But this does not accept lists of varying length, unless I put this list as a single item. However, then when I save the data frame to a csv, it tries to squeeze this very long list (all genes exprssed for one tissue) into a single cell. I get an error of the string length exceeding the excel character-per-cell limit.
Therefore I need a way of either dealing with this limit, or stroing my lists in a different way. I would rather just have one file for all lists.
My code:
import csv
import pandas as pd
import math
import numpy as np
#Import list of tissues:
df = pd.read_csv(r'E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
tissuedict=df.to_dict()
tissuelist = list(tissuedict.keys())[2:]
all_genes = [gene for key,gene in tissuedict['Gene Name'].items()]
data = []
for tissue in tissuelist:
#Create array to keep track of the protein mRnaS in tissue that are not present in the network
#initiate with first tissue, protein
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
datatis = [tissue, tissueExpression.tolist()]
print(datatis)
data.append(datatis)
print(data)
df = pd.DataFrame(data)
df.to_csv(r'tissue_expression_data.csv')
Link to data (either one):
https://github.com/joanna-lada/gene_data/blob/master/E-MTAB-5214-query-results.tsv
https://raw.githubusercontent.com/joanna-lada/gene_data/master/E-MTAB-5214-query-results.tsv
IIUC you need lists of the gene names found in each tissue. This writes these lists as columns into a csv:
import pandas as pd
df = pd.read_csv('E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
df = df.drop(columns='Gene ID').set_index('Gene Name')
res = pd.DataFrame()
for c in df.columns:
res = pd.concat([res, pd.Series(df[c].dropna().index, name=c)], axis=1)
res.to_csv('E-MTAB-5214-query-results.csv', index=False)
(Writing them as rows would have been easier, but Excel can't import so many columns)
Don't open the csv in Excel directly, but use a blank worksheet and import the csv (Data - External data, From text), otherwise you can't separate them into Excel columns in one run (at least in Excel 2010).
create your data variable as a dictionary
you can save the dictionary to a json file using json.dump refer here
import json
data = {}
for tissue in tissuelist:
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
data[tissue] = tissueExpression.tolist()
with open('filename.json', 'w') as fp:
json.dump(data, fp)
I am attempting to determine if the data inside a list is within a dataframe column. I am new to Pandas and have been struggling with this, so at the moment I am turning the dataframe column of interest into a list. However, when I df.tolist() the list contains a slew of unicode around the string. As i am attempting to compare this with text from the other list which is not in unicode I am running into issues.
I am attempted to turn the other list into unicode but then the list had items such that read like u'["item"]' which didn't help. I have also tried to remove the unicode from the dataframe but only get errors. I cannot iterate as pandas tells me that the dataframe is to long to iterate over. Below is my code:
SDC_wb = pd.ExcelFile('C:\ BLeh')
df = SDC_wb.parse(SDC_wb.sheet_names[1], header = 1)
def Follower_count(filename):
filename = open(filename)
reader = csv.reader(filename)
handles = df['things'].tolist()
print handles
dict1 = {}
for item in reader:
if item in handles:
user = api.get_user(item)
dict1[item] = user.Follower_count
newdf = pd.DataFrame(dict1)
newdf.to_csv('test1.csv', encoding='utf-8')
Here is what the list from the dataframe looks like:
[u'#Mastercard', u'#Visa', u'#AmericanExpress', u'#CapitalOne']
Here is what x = [unicode(s) for s in some_list] looks like:
u"['#HomeGoods']", u"['#pier1']", u"['#houzz']", u"['#InteriorDesign']", u"['#zulily']"]
Naturally these don't align to check the "in" requirement. Thus, I need a method of converting the .tolist() object from:
[u'#Mastercard', u'#Visa', u'#AmericanExpress', u'#CapitalOne']
to:
[#Mastercard, #Visa, #AmericanExpress, #CapitalOne]
so that the for item in handles function will see similar handles.
Thanks for your help.
I am new to Python. I use Fortran to generate the data file I wish to read. For many reasons I would like to use python to calculate averages and statistics on the data rather than fortan.
I need to read the entries in the the first three rows as strings, and then the data which begins in the fourth row onwards as numbers. I don't need the first column, but I do need the rest of the columns each as their own arrays.
# Instantaneous properties
# MC_STEP Density Pressure Energy_Total
# (molec/A^3) (bar) (kJ/mol)-Ext
0 0.34130959E-01 0.52255964E+05 0.26562549E+04
10 0.34130959E-01 0.52174646E+05 0.25835710E+04
20 0.34130959E-01 0.52050492E+05 0.25278775E+04
And the data goes on for thousands, and sometimes millions of lines.
I have tried the following, but run into problems since I can't analyze the lists I have made, and I can't seem to convert them to arrays. I would prefer however to just create arrays to begin with, but if I can convert my lists to arrays that would work too. In my method I get an error when i try to use an element in one of the lists, i.e. Energy(i)
with open('nvt_test_1.out.box1.prp1') as f:
Title = f.readline()
Properties = f.readline()
Units = f.readline()
Density = []
Pressure = []
Energy = []
for line in f:
row = line.split()
Density.append(row[1])
Pressure.append(row[2])
Energy.append(row[3])
I appreciate any help!
I would use pandas module for this task:
import pandas as pd
In [9]: df = pd.read_csv('a.csv', delim_whitespace=True,
comment='#', skiprows=3,header=None,
names=['MC_STEP','Density','Pressure','Energy_Total'])
Data Frame:
In [10]: df
Out[10]:
MC_STEP Density Pressure Energy_Total
0 0 0.034131 52255.964 2656.2549
1 10 0.034131 52174.646 2583.5710
2 20 0.034131 52050.492 2527.8775
Average values for all columns:
In [11]: df.mean()
Out[11]:
MC_STEP 10.000000
Density 0.034131
Pressure 52160.367333
Energy_Total 2589.234467
dtype: float64
You can consider a list in Python like an array in other languages and it's very optimised. If you have some special needs there is an array type available but rarely used, alternatively the numpy.array that is designed for scientific computation; you have to install the Numpy package for that.
Before performing calculations cast the string to a float, like in energy.append(float(row[3]))
Maybe do it at once using map function:
row = map(float, line.split())
Last, as #Hamms said, access the elements by using square brackets e = energy[i]
You can also use the csv module's DictReader to read each row into a dictionary, as follows:
with open('filename', 'r') as f:
reader = csv.DictReader(f, delimiter=r'\s+', fieldnames=('MC_STEP', 'DENSITY', 'PRESSURE', 'ENERGY_TOTAL')
for row in reader:
Density.append(float(row['DENSITY'])
Pressure.append(float(row['PRESSURE'])
Energy.append(float(row['ENERGY_TOTAL'])
Ofcourse this assumes that the file is formatted more like a CSV (that is, no comments). If the file does have comments at the top, you can skip them before initializing the DictReader as follows:
next(f)
I'm trying to create code that checks if the value in the index column of a CSV is equivalent in different rows, and if so, find the most occurring values in the other columns and use those as the final data. Not a very good explanation, basically I want to take this data.csv:
customer_ID,month,time,A,B,C
1003,Jan,2:00,1,1,4
1003,Jul,2:00,1,1,3
1003,Jan,2:00,1,1,4
1004,Feb,8:00,2,5,1
1004,Jul,8:00,2,4,1
And create a new answer.csv that recognizes that there are multiple rows for the same customer, so it finds the values that occur the most in each column and outputs those into one row:
customer_ID,month,ABC
1003,Jan,114
1004,Feb,251
I'd also like to learn that if there are values with the same number of occurrences (Month and B for customer 1004) how can I choose which one I want to be outputted?
I've currently written (thanks to Andy Hayden on a previous question I just asked):
import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
res.to_frame(name='answer').to_csv('answer.csv')
All this does, however, is create this (I was ignoring month previously, but now I'd like to incorporate it so that I can learn how to not only find the mode of a column of numbers, but also the most occurring string):
customer_ID,ABC
1003,114.0
1003,113.0
1003,114.0
1004,251.0
1004,241.0
Note: I don't know why it is outputting the .0 at the end of the ABC, it seems to be in the wrong variable format. I want each column to be outputted as just the 3 digit number.
Edit: I'm also having an issue that if the value in column A is 0 then the output becomes 2 digits and does not incorporate the leading 0.
What about something like this? This is not using Pandas though, I am not a Pandas expert.
from collections import Counter
dataDict = {}
# Read the csv file, line by line
with open('data.csv', 'r') as dataFile:
for line in dataFile:
# split the line by ',' since it is a csv file...
entry = line.split(',')
# Check to make sure that there is data in the line
if entry and len(entry[0])>0:
# if the customer_id is not in dataDict, add it
if entry[0] not in dataDict:
dataDict[entry[0]] = {'month':[entry[1]],
'time':[entry[2]],
'ABC':[''.join(entry[3:])],
}
# customer_id is already in dataDict, add values
else:
dataDict[entry[0]]['month'].append(entry[1])
dataDict[entry[0]]['time'].append(entry[2])
dataDict[entry[0]]['ABC'].append(''.join(entry[3:]))
# Now write the output file
with open('out.csv','w') as f:
# Loop through sorted customers
for customer in sorted(dataDict.keys()):
# use Counter to find the most common entries
commonMonth = Counter(dataDict[customer]['month']).most_common()[0][0]
commonTime = Counter(dataDict[customer]['time']).most_common()[0][0]
commonABC = Counter(dataDict[customer]['ABC']).most_common()[0][0]
# Write the line to the csv file
f.write(','.join([customer, commonMonth, commonTime, commonABC, '\n']))
It generates a file called out.csv that looks like this:
1003,Jan,2:00,114,
1004,Feb,8:00,251,
customer_ID,month,time,ABC,