What is the cleanest way of reading in a multi-column tsv file in python with headers, but where the first column has no header and instead contains the row numbers for each row?
This is apparently a common format from files coming from R data frames.
Example:
A B C
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
Any ideas?
Depends on what you want to do with the data afterwards (and if the file is truly a tsv with a \t delimiter). If you just want it in a set of lists you can use the csv module like so:
import csv
with open("tsv.tsv") as tsvfile:
tsvreader = csv.reader(tsvfile, delimiter="\t")
for line in tsvreader:
print line[1:]
However I'd also recommend the DataFrame module from pandas for anything outside of simple python operations. It can be used as such:
from pandas import DataFrame
df = DataFrame.read_csv("tsv.tsv", sep="\t")
DataFrames allow for high level manipulation of data sets such as adding columns, finding averages, etc..
df = DataFrame.from_csv("tsv.tsv", sep="\t") is deprecated since version 0.21.0
df = pd.read_csv("tsv.tsv", sep="\t") is the way to go
How about using the following native Python codes:
with open('tsvfilename') as f:
lines = f.read().split('\n')[:-1]
for i, line in enumerate(lines):
if i == 0: # header
column_names = line.split()
# ...
else:
data = line.split();
# ...
Import Pandas library
import pandas as pd
data = pd.read_csv('/ABC/DEF/TSV.tsv', sep='\t')
DataFrame.from_csv("tsv.tsv", sep="\t")
is not working anymore.
Use
df.read_csv("tsv.tsv", sep="\t")
pandas.read_csv("file.tsv")
DataFrame.from_csv() doesn't work. DataFrame.read_csv() isn't right.
Related
I have a csv file looks like
F1 F2 F3
A1 2 4 2
A2 4 1 2
When I read the file using pandas, I see that the first column is unnamed.
import pandas as pd
df = pd.read_csv("data.csv")
features = df.columns
print( features )
Index(['Unnamed: 0', 'F1, 'F2, 'F3'])
In fact I want to get only F1, F2 and F3. I can fix that with some array manipulation. But I want to know if pandas has some builtin capabilities to do that. Any thought?
UPDATE:
Using index_col = False or None doesn't work either.
That unnamed is only because of index column being read, you can use the index_col = [0] argument in the read statement to resolve.
This picks the first column as index instead of a feature itself.
import pandas as pd
df = pd.read_csv("data.csv", index_col=[0])
features = df.columns
print( features )
Index(['F1', 'F2', 'F3'])
I have a .csv file having multiple rows and columns:
chain Resid Res K B C Tr Kw Bw Cw Tw
A 1 ASP 1 0 0.000104504 NA 0 0 0.100087974 0.573972285
A 2 GLU 2 627 0.000111832 0 0.033974309 0.004533331 0.107822844 0.441666022
Whenever I open the file using pandas or using with open, it shows that there are only column and multiple rows:
629 rows x 1 columns
Here is the code im using:
data= pd.read_csv("F:/file.csv", sep='\t')
print(data)
and the result I'm getting is this"
A,1,ASP,1,0,0.0001045041279130...
I want the output to be in a dataframe form so that I can carry out future calculations. Any help will be highly appreciated
There is separator ,, so is psosible omit parameter sep, because sep=',' is deafault separator in read_csv:
data= pd.read_csv("F:/file.csv")
you can read the csv using the following code snippet
import pandas as pd
data = pd.read_csv('F:/file.csv', sep=',')
Don't use '\t', because you don't have four consecutive spaces together (a tab between), so use the below:
data = pd.read_csv("F:/file.csv")
Or if really needed, use:
data = pd.read_csv("F:/file.csv", sep='\s{2,}', engine='python')
If your data values have spaces.
I am dealing with a csv file that contains three columns and three rows containing numeric data. The csv data file simply looks like the following:
Colum1,Colum2,Colum3
1,2,3
1,2,3
1,2,3
My question is how to write a python code that take a single value of one of the column and perform a specific operation. For example, let say I want to take the first value in 'Colum1' and subtract it from the sum of all the values in the column.
Here is my attempt:
import csv
f = open('columns.csv')
rows = csv.DictReader(f)
value_of_single_row = 0.0
for i in rows:
value_of_single_Row += float(i) # trying to isolate a single value here!
print value_of_single_row - sum(float(r['Colum1']) for r in rows)
f.close()
Based on the code you provided, I suggest you take a look at the doc to see the preferred approach on how to read through a csv file. Take a look here:
How to use CsvReader
with that being said, you can modify the beginning of your code slightly to this:
import csv
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
# perform operation per row
From there you now have access to each row.
This should give you what you need to do proper row-by-row operations.
What I suggest you do is play around with printing out your rows to see what your data looks like. You will see that each row being outputted is a dictionary.
So if you were going through each row, you can just simply do something like this:
for row in rows:
row['Colum1'] # or row.get('Colum1')
# to do some math to add everything in Column1
s += float(row['Column1'])
So all of that will look like this:
import csv
s = 0
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
s += float(row['Colum1'])
You can do pretty much all of this with pandas
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
Location = r'path/test.csv'
df = pd.read_csv(Location, names=['Colum1','Colum2','Colum3'])
df = df[1:] #Remove the headers since they're unnecessary
print df
df.xs(1)['Colum1']=int(df.loc[1,'Colum1'])+5
print df
You can write back to your csv using df.to_csv('File path', index=False,header=True) Having headers=True will add the headers back in.
To do this more along the lines of what you have you can do it like this
import csv
Location = r'C:/Users/tnabrelsfo/Documents/Programs/Stack/test.csv'
data = []
with open(Location, 'r') as f:
for line in f:
data.append(line.replace('\n','').replace(' ','').split(','))
data = data[1:]
print data
data[1][1] = 5
print data
it will read in each row, cut out the column names, and then you can modify the values by index
So here is my simple solution using pandas library. Suppose we have sample.csv file
import pandas as pd
df = pd.read_csv('sample.csv') # df is now a DataFrame
df['Colum1'] = df['Colum1'] - df['Colum1'].sum() # here we replace the column by subtracting sum of value in the column
print df
df.to_csv('sample.csv', index=False) # save dataframe back to csv file
You can also use map function to do operation to one column, for example,
import pandas as pd
df = pd.read_csv('sample.csv')
col_sum = df['Colum1'].sum() # sum of the first column
df['Colum1'] = df['Colum1'].map(lambda x: x - col_sum)
I have 2 text files. File1 has one column containing some names. file2 has 50 columns and the first column has names and the next ones are values. All of the names in file1 are in file2 (2 is much bigger). I am looking for those names of file1 with corresponding rows in file2 and import those rows in a new file.
file2:
"ENSG00000000003.10" 17.83196398 69.91920499
"ENSG00000000419.8" 27.0839105 57.01053354
"ENSG00000000457.9" 15.09256081 14.86654192
"ENSG00000000460.12" 3.824827056 11.81359135
"ENSG00000000938.8" 21.29498307 26.8460545
"ENSG00000000971.11" 324.9552392 581.2884261
"ENSG00000001036.9" 51.89359951 77.12018624
"ENSG00000001084.6" 39.79887612 105.2936106
file1:
"ENSG00000000003.10"
"ENSG00000000419.8"
"ENSG00000000457.9"
output:
"ENSG00000000003.10" 17.83196398 69.91920499
"ENSG00000000419.8" 27.0839105 57.01053354
"ENSG00000000457.9" 15.09256081 14.86654192
Using inner_join() from dplyr
library(dplyr)
d3 <- inner_join(d1, d2, by="name")
You get:
> d3
name value1 value2
1 ENSG00000000003.10 17.83196 69.91920
2 ENSG00000000419.8 27.08391 57.01053
3 ENSG00000000457.9 15.09256 14.86654
Assuming the files are in csv format with headers.
import pandas as pd
df1 = pd.read_csv('first_file.csv')
df2 = pd.read_csv('second_file.csv')
df3 = df1.merge(df2, on=['Name'])
print(df3)
Here's how you do this efficiently with R using the data.table package (you didn't provide column names so I assumed your first column in file2 is called V1)
library(data.table)
setkey(setDT(file2), V1)[file1]
# 1: ENSG00000000003.10 17.83196 69.91920
# 2: ENSG00000000419.8 27.08391 57.01053
# 3: ENSG00000000457.9 15.09256 14.86654
Memory-map the larger file and do regular expression searches using the names from the first file. I'm assuming that your names are unique, but you could use re.findall instead of re.search if they are not. This example works with Python 3.4, where a memory map behaves like a bytearray object.
import re
import mmap
output = []
with open('file2.txt', 'r') as f2:
mm = mmap.mmap(f2.fileno(), 0, access=mmap.ACCESS_READ)
for line in open('file1.txt', 'r'):
pattern = bytes(line.rstrip() + ".*", 'utf-8')
nameMatch = re.search(pattern, mm)
if nameMatch:
output.append(str(nameMatch.group(), 'utf-8'))
mm.close()
I have a text file:
sample value1 value2
A 0.1212 0.2354
B 0.23493 1.3442
i import it:
with open('file.txt', 'r') as fo:
notes = next(fo)
headers,*raw_data = [row.strip('\r\n').split('\t') for row in fo] # get column headers and data
names = [row[0] for row in raw_data] # extract first row (variables)
data= np.array([row[1:] for row in raw_data],dtype=float) # get rid of first row
if i then convert it:
s = pd.DataFrame(data,index=names,columns=headers[1:])
the data is recognized as floats. I could get the sample names back as column by s=s.reset_index().
if i do
s = pd.DataFrame(raw_data,columns=headers)
the floats are objects and i cannot perform standard calculations.
How would you make the data frame ? Is it better to import the data as dict ?
BTW i am using python 3.3
You can parse your data file directly into data frame as follows:
df = pd.read_csv('file.txt', sep='\t', index_col='sample')
Which will give you:
value1 value2
sample
A 0.12120 0.2354
B 0.23493 1.3442
[2 rows x 2 columns]
Then, you can do your computations.
To parse such a file, one should use pandas read_csv function.
Below is a minimal example showing the use of read_csv with parameter delim_whitespace set to True
import pandas as pd
from StringIO import StringIO # Python2 or
from io import StringIO # Python3
data = \
"""sample value1 value2
A 0.1212 0.2354
B 0.23493 1.3442"""
# Creation of the dataframe
df = pd.read_csv(StringIO(data), delim_whitespace=True)