I have a tsv file containing an array which has been read using read_csv().
The dtype of the array is shown as dtype: object. How do I read it and access it as an array?
For example:
df=
id values
1 [0,1,0,3,5]
2 [0,0,2,3,4]
3 [1,1,0,2,3]
4 [2,4,0,3,5]
5 [3,5,0,3,5]
Currently I am unpacking it as below:
for index,row in df.iterrows():
string = row['col2']
string=string.replace('[',"")
string=string.replace(']',"")
v1,v2,v3,v4,v5=string.split(",")
v1=int(v1)
v2=int(v2)
v3=int(v3)
v4=int(v4)
v5=int(v5)
Is there any alternative to this?
I want to do this because I want to create another column in the dataframe taking the average of all the values.
Adding additional details:col2
My tsv file looks as below:
id values
1 [0,1,0,3,5]
2 [0,0,2,3,4]
3 [1,1,0,2,3]
4 [2,4,0,3,5]
5 [3,5,0,3,5]
I am reading the tsv file as follows:
df=pd.read_csv('tsv_file_name.tsv',sep='\t', header=0)
You can use json to simplify your parsing:
import json
df['col2'] = df.col2.apply(lambda t: json.loads(t))
edit: following your comment, getting the average is easy:
# using numpy
df['col2_mean'] df.col2.apply(lambda t: np.array(t).mean())
# by hand
df['col2_mean'] df.col2.apply(lambda t: sum(t)/len(t))
import csv
with open('myfile.tsv) as tsvfile:
line = csv.reader(tsvfile, delimiter='...')
...
OR
from pandas import DataFrame
df = DataFrame.from_csv("myfile.tsv", sep="...")
Related
In my code, I have data stored in numpy arrays. I want to save my data to a csv file using Pandas. Additionally, I want my data to have a special format in the csv file.
My data:
data1 = np.array([1,2,3],[4,5,6],[7,8,9])
data2 = np.array([10,11,12])
data3 = np.array([13,14,15])
What I want: a csv file that has three colums labeled 'data1', 'data2' and 'data3'.
'data1' should have 3 rows with row 1 containing the values 1,2,3 (or 1 2 3), row 2 should contain 4,5,6 (or 4 5 6) and so on.
The column labeled as 'data2' should contain 10 in row 1, 11 in row 12 and 13 in row 3.
Similar for 'data 3'.
How can I achieve that?
Using the constraints described in the question, this is how I would approach this:
import numpy as np
import pandas as pd
data1 = np.array([[1,2,3],[4,5,6],[7,8,9]]) # list of lists in declaration
data2 = np.array([10,11,12])
data3 = np.array([13,14,15])
df = pd.DataFrame(zip(data1, data2, data3), columns=['data1', 'data2', 'data3'], dtype='str')
df['data1'] = df['data1'].str.replace('[', '').str.replace(']', '')
df.to_csv('./out.csv', index=False) # saving to file in cwd with name 'out.csv'
Resulting in out.csv containing:
data1,data2,data3
1 2 3,10,13
4 5 6,11,14
7 8 9,12,15
EDIT:
To answer the comment below regarding converting back from csv of the above format to the original arrays:
# let pandas infer the data types (data1: str, data2: int, data3: int. By default)
df = pd.read_csv('./out.csv')
# convert each entry in data1 to a numpy array using fromstring method
df['data1'] = df['data1'].apply(lambda x : np.fromstring(x, sep=' '))
# nuance to get series of arrays back to numpy.ndarray
data1 = np.array(df['data1'].to_list())
# simply use to_numpy method for integer columns
data2 = df['data2'].to_numpy()
data3 = df['data3'].to_numpy()
I am trying to load and convert a csv file to a list of lists in the following format:
Example: people.csv is in the format:
Name | Age | Sex
----------------
bob | 21 | M
Tina | 22 | F
Tim | 25 | M
I am trying to convert it to list of lists in this format:
[['Name=bob', 'Age=21', 'Sex=M'],['Name=Tina', 'Age=22', 'Sex=F'], ['Name=Tim','Age=25','Sex=M']]
where name, age and sex are the column headers in the csv file.
I have tried formatting one value at a time but is there a better way of performing this operation with or without a pandas dataframe.
Thank You
Maybe not the most elegant solution, but easy to understand.
import pandas as pd
df = pd.read_csv("file_name")
entries = []
for i in range(0,len(df)
tup = (df.loc[i,'Name'],df.loc[i,'Age'],df.loc[i,'sex']
entries.append(tup)
Using csv module --> csv.DictReader.
Ex:
import csv
with open(filename) as infile:
reader = csv.DictReader(infile)
result = [["{}={}".format(k, v) for k, v in row.items()] for row in reader]
print(result)
I recommend using pandas for this job
import pandas as pd
df = pd.read_csv("file_name") #you can define if there is a header or not in the file
arr= df.values.tolist()
and now you have a 2D array of all the values in the file.
If you would like to add the titles "Name=" or "age=" you can then proceed to either manually go through the list add add the values in using the following function
parsedArr = [["Name=" + str(x[0]), "Age=" + str(x[1]), "sex=" + str(x[2])] for x in arr]
Or, even better, you can alter the dataframe to do this for you using df.apply
Altering the DataFrame is a much faster option
I have a .csv file having multiple rows and columns:
chain Resid Res K B C Tr Kw Bw Cw Tw
A 1 ASP 1 0 0.000104504 NA 0 0 0.100087974 0.573972285
A 2 GLU 2 627 0.000111832 0 0.033974309 0.004533331 0.107822844 0.441666022
Whenever I open the file using pandas or using with open, it shows that there are only column and multiple rows:
629 rows x 1 columns
Here is the code im using:
data= pd.read_csv("F:/file.csv", sep='\t')
print(data)
and the result I'm getting is this"
A,1,ASP,1,0,0.0001045041279130...
I want the output to be in a dataframe form so that I can carry out future calculations. Any help will be highly appreciated
There is separator ,, so is psosible omit parameter sep, because sep=',' is deafault separator in read_csv:
data= pd.read_csv("F:/file.csv")
you can read the csv using the following code snippet
import pandas as pd
data = pd.read_csv('F:/file.csv', sep=',')
Don't use '\t', because you don't have four consecutive spaces together (a tab between), so use the below:
data = pd.read_csv("F:/file.csv")
Or if really needed, use:
data = pd.read_csv("F:/file.csv", sep='\s{2,}', engine='python')
If your data values have spaces.
I am dealing with a csv file that contains three columns and three rows containing numeric data. The csv data file simply looks like the following:
Colum1,Colum2,Colum3
1,2,3
1,2,3
1,2,3
My question is how to write a python code that take a single value of one of the column and perform a specific operation. For example, let say I want to take the first value in 'Colum1' and subtract it from the sum of all the values in the column.
Here is my attempt:
import csv
f = open('columns.csv')
rows = csv.DictReader(f)
value_of_single_row = 0.0
for i in rows:
value_of_single_Row += float(i) # trying to isolate a single value here!
print value_of_single_row - sum(float(r['Colum1']) for r in rows)
f.close()
Based on the code you provided, I suggest you take a look at the doc to see the preferred approach on how to read through a csv file. Take a look here:
How to use CsvReader
with that being said, you can modify the beginning of your code slightly to this:
import csv
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
# perform operation per row
From there you now have access to each row.
This should give you what you need to do proper row-by-row operations.
What I suggest you do is play around with printing out your rows to see what your data looks like. You will see that each row being outputted is a dictionary.
So if you were going through each row, you can just simply do something like this:
for row in rows:
row['Colum1'] # or row.get('Colum1')
# to do some math to add everything in Column1
s += float(row['Column1'])
So all of that will look like this:
import csv
s = 0
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
s += float(row['Colum1'])
You can do pretty much all of this with pandas
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
Location = r'path/test.csv'
df = pd.read_csv(Location, names=['Colum1','Colum2','Colum3'])
df = df[1:] #Remove the headers since they're unnecessary
print df
df.xs(1)['Colum1']=int(df.loc[1,'Colum1'])+5
print df
You can write back to your csv using df.to_csv('File path', index=False,header=True) Having headers=True will add the headers back in.
To do this more along the lines of what you have you can do it like this
import csv
Location = r'C:/Users/tnabrelsfo/Documents/Programs/Stack/test.csv'
data = []
with open(Location, 'r') as f:
for line in f:
data.append(line.replace('\n','').replace(' ','').split(','))
data = data[1:]
print data
data[1][1] = 5
print data
it will read in each row, cut out the column names, and then you can modify the values by index
So here is my simple solution using pandas library. Suppose we have sample.csv file
import pandas as pd
df = pd.read_csv('sample.csv') # df is now a DataFrame
df['Colum1'] = df['Colum1'] - df['Colum1'].sum() # here we replace the column by subtracting sum of value in the column
print df
df.to_csv('sample.csv', index=False) # save dataframe back to csv file
You can also use map function to do operation to one column, for example,
import pandas as pd
df = pd.read_csv('sample.csv')
col_sum = df['Colum1'].sum() # sum of the first column
df['Colum1'] = df['Colum1'].map(lambda x: x - col_sum)
I have a text file:
sample value1 value2
A 0.1212 0.2354
B 0.23493 1.3442
i import it:
with open('file.txt', 'r') as fo:
notes = next(fo)
headers,*raw_data = [row.strip('\r\n').split('\t') for row in fo] # get column headers and data
names = [row[0] for row in raw_data] # extract first row (variables)
data= np.array([row[1:] for row in raw_data],dtype=float) # get rid of first row
if i then convert it:
s = pd.DataFrame(data,index=names,columns=headers[1:])
the data is recognized as floats. I could get the sample names back as column by s=s.reset_index().
if i do
s = pd.DataFrame(raw_data,columns=headers)
the floats are objects and i cannot perform standard calculations.
How would you make the data frame ? Is it better to import the data as dict ?
BTW i am using python 3.3
You can parse your data file directly into data frame as follows:
df = pd.read_csv('file.txt', sep='\t', index_col='sample')
Which will give you:
value1 value2
sample
A 0.12120 0.2354
B 0.23493 1.3442
[2 rows x 2 columns]
Then, you can do your computations.
To parse such a file, one should use pandas read_csv function.
Below is a minimal example showing the use of read_csv with parameter delim_whitespace set to True
import pandas as pd
from StringIO import StringIO # Python2 or
from io import StringIO # Python3
data = \
"""sample value1 value2
A 0.1212 0.2354
B 0.23493 1.3442"""
# Creation of the dataframe
df = pd.read_csv(StringIO(data), delim_whitespace=True)