reading csv file with tensorflow - python

In the examples. the variables for the columns in data set are manually given.
But my data set already has the names as headers. I want to use them. How to get the header names of a .csv file using tensor flow using python?
import tensorflow as tf
filename_queue=tf.train.string_input_producer(
['final_data1.csv'],num_epochs=1)
#to reada the csv file-----
print(5)
reader = tf.TextLineReader(skip_header_lines=1)
print(4)
_, csv_row = reader.read_up_to(filename_queue)
print(type(csv_row))
print(3)
with tf.Session() as sess:
print(reader.num_records_produced())
tf.global_variables_initializer()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord = coord)'

I can't understand exactly what you try to do -- you need to give more details.
If you want to work with csv-file, I think you'd better use DataFrame data structure from pandas library. There is the method 'read_csv' where you can choose the row you consider as header.
Try it:
from pandas import read_csv
df = read_csv('YOUR_DATAFILE.csv',header=0)
print(df)
With parameter 'header' you can select what row in your dataset contains headers. In this case it's the first row.

Related

Removal of rows containing a particular text in a csv file using Python

I have a genomic dataset consisting of more than 3500 rows. I need to remove rows in two columns that("Length" and "Protein Name") from them. How do I specify the condition for this purpose.
import csv #importing the csv module or method
#opening a new csv file
file = open('C:\\Users\\Admin\\Downloads\\csv.csv', 'r')
type(file)
#reading the csv file
csvreader = csv.reader(file)
header = []
header = next(csvreader)
print(header)
#extracting rows from the csv file
rows = []
for row in csvreader:
rows.append(row)
print(rows)
I am a beginner in python bioinformatic data analysis and I haven't tried any extensive methods. I don't how to proceed from here. I have done the work opening and reading the csv file. I have also extracted the column headers. But I don't know how to proceed from here. Please help.
try this :
csvreader= csvreader[csvreader["columnName"].str.contains("string to delete") == False]
It will be better to read scv in pandas since you have lots of rows. That will be the smart decision to make. And also set your conditional variables which you will use to perform the operation. If this do not help. I will suggest you provide a sample data of your scv file.
df = pd.read_csv('C:\\Users\\Admin\\Downloads\\csv.csv')
length = 10
protein_name = "replace with protain name"
df = df[(df["Length"] > length) & (df["Protein Name"] != protein_name)]
print(df)
You can save the df back to a scv file if you want:
df.to_csv("'C:\\Users\\Admin\\Downloads\\new_csv.csv'", index=False)

Append/Copy a dataframe with multiple columns headers to existing excel file

I'm trying to copy/append a dataframe with multiple column headers(similar to the one below) to an existing excel sheet starting from a particular cell AA2
df1 = pd.DataFrame({'sub1': [np.nan,'E',np.nan,'S'],
'sub2': [np.nan,'D',np.nan,'A']})
df2 = pd.DataFrame({'sub1': [np.nan,'D',np.nan,'S'],
'sub2': [np.nan,'C',np.nan,'S']})
df = pd.concat({'Af':df1, 'Dp':df2}, axis=1)
df
I'm thinking of a solution to export this dataframe to an excel starting in that particular cell and use openpyxl to copy the data from one to another - column by column... but not sure if that is the correct approach. any ideas?!
(the excel sheet that I'm working with has formatting and can't make it into a dataframe and use merge)
I've had success manipulating Excel files in the past with xlsxwriter (you will need to pip install this as a dependency first - although it does not need to be explicitly imported).
import io
import pandas as pd
# Load your file here instead
file_bytes = io.BytesIO()
with pd.ExcelWriter(file_bytes, engine = 'xlsxwriter') as writer:
# Write a DataFrame to Excel into specific cells
pd.DataFrame().to_excel(
writer,
sheet_name = 'test_sheet',
startrow = 10, startcol = 5,
index = False
)
# Note: You can repeat any of these operations within the context manager
# and keep adding stuff...
# Add some text to cells as well:
writer.sheets['test_sheet'].write('A1', 'Your text goes here')
file_bytes.seek(0)
# Then write your bytes to a file...
# Overwriting it in your case?
Bonus:
You can add plots too - just write them to a BytesIO object and then call <your_image_bytes>.seek(0) and then use in insert_image() function.
... # still inside ExcelWriter context manager
plot_bytes = io.BytesIO()
# Create plot in matplotlib here
plt.savefig(plot_bytes, format='png') # Instead of plt.show()
plot_bytes.seek(0)
writer.sheets['test_sheet'].insert_image(
5, # Row start
5, # Col start
'some_image_name.png',
options = {'image_data': plot_bytes}
)
The full documentation is really helpful too:
https://xlsxwriter.readthedocs.io/working_with_pandas.html

how to read data (using pandas?) so that it is correctly formatted?

I have a txt file with following format:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],"values":[["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Testcustomer",null,null,null,null,-196,196,-196,null,null],["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Testcustomer",null,null,null,null,null,null,null,null,null],["2017-10-06T08:50:25.349Z",null,null,2596,null,null,null,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,80700],["2017-10-06T08:50:25.35Z",null,null,null,null,null,null,null,null,null,1956,"41762721","Testkunde",null,null,null,null,null,null,null,null,null],["2017-10-06T09:20:05.742Z",null,null,null,null,null,67.98999,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,null]]}]}]}
...
So in the text file everything is saved in one line. CSV file is not available.
I would like to have it as a data frame in pandas. when I use read.csv:
df = pd.read_csv('time-series-data.txt', sep = ",")
the output of print(df) is someting like [0 rows x 3455.. columns]
So currently everything is read in as one line. However, I would like to have 22 columns (time, activepower0, CosPhi0,..). I ask for tips, thank you very much.
Is a pandas dataframe even suitable for this? the text files are up to 2 GB in size.
Here's an example which can read the file you posted.
Here's the test file, named test.json:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],
"values":[
["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Test-customer",null,null,null,null,-196,196,-196,null,null],
["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Test-customer",null,null,null,null,null,null,null,null,null]]}]}]}
Here's the python code used to read it in:
import json
import pandas as pd
# Read test file.
# This reads the entire file into memory at once. If this is not
# possible for you, you may want to look into something like ijson:
# https://pypi.org/project/ijson/
with open("test.json", "rb") as f
data = json.load(f)
# Get the first element of results list, and first element of series list
# You may need a loop here, if your real data has more than one of these.
subset = data['results'][0]['series'][0]
values = subset['values']
columns = subset['columns']
df = pd.DataFrame(values, columns=columns)
print(df)

How to save a Tensorflow dataset to csv?

I find a lot of documents/forums telling how to convert a csv to a Tensorflow dataset, but not a single one saying how to convert a dataset to a csv. I have csv with two columns now (filename, weight - more columns maybe be added later). I read that into tensorflow and create a dataset. At the end of the script the 2nd column is modified and I need to save these columns to a csv. I need them in csv (not checkpoint) because I may need to do stuff with it on Matlab.
I tried to call the dataset map function and tried to save to csv inside map function. But it doesn't work as expected.
#reading csv to dataset
def map_func1(line):
FIELD_DEFAULTS = [[""], [0.0]]
sample,weight = tf.decode_csv(line, FIELD_DEFAULTS)
return sample,weight
ds = tf.data.TextLineDataset('sample_weights.csv')
ds_1 = ds.map(map_func1)
# then the dataset is modified to ds_2 then, not including code- it's just another map func
# trying to save to csv -
def map_func3(writer,x):
x0,x1 = x
writer.writerow([x0,x1])
return x
with open('sample_weights_mod.csv','w') as file:
writer = csv.writer(file)
ds_3 = ds_2.map(lambda *x: map_func3(writer,x))
This doesn't work as expected just writes the tensor shape to csv Tensor("arg0:0", shape=(), dtype=string) Tensor("arg1:0", shape=(), dtype=float32)
This solution is probably a bad one. I really need to get a neat way to do this
Though not a good way of doing for now I did it as below
type(movies) ## movies variable is of type tensorflow.python.data.ops.dataset_ops.MapDataset
z=[]
for example in movies:
z.append(example.numpy().decode("utf-8"))
mv={'movie_title':z}
pd.DataFrame(mv).to_csv('movie.csv')

How do I write scikit-learn dataset to csv file

I can load a data set from scikit-learn using
from sklearn import datasets
data = datasets.load_boston()
print(data)
What I'd like to do is write this data set to a flat file (.csv)
Using the open() function,
f = open('boston.txt', 'w')
f.write(str(data))
works, but includes the description of the data set.
I'm wondering if there is some way that I can generate a simple .csv with headers from this Bunch object so I can move it around and use it elsewhere.
data = datasets.load_boston() will generate a dictionary. In order to write the data to a .csv file you need the actual data data['data'] and the columns data['feature_names']. You can use these in order to generate a pandas dataframe and then use to_csv() in order to write the data to a file:
from sklearn import datasets
import pandas as pd
data = datasets.load_boston()
print(data)
df = pd.DataFrame(data=data['data'], columns = data['feature_names'])
df.to_csv('boston.txt', sep = ',', index = False)
and the output boston.txt should be:
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
...
There are various toy datasets in scikit-learn such as Iris and Boston datasets. Let's load Boston dataset:
from sklearn import datasets
boston = datasets.load_boston()
What type of object is this? If we examine its type, we see that this is a scikit-learn Bunch object.
print(type(boston))
Output:
<class 'sklearn.utils.Bunch'>
A scikit-learn Bunch object is a kind of dictionary. So, we should treat it as such. We can use dictionary methods. Let's look at the keys:
print(boston.keys())
output:
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
Here we are interested in data, feature_names and target keys. We will import pandas module and use these keys to create a pandas DataFrame.
import pandas as pd
df = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])
We should also add the target variable to the DataFrame. Target variable is what we try to predict. We should learn the target variable's name. It is written in the "DESCR". We can
print(boston["DESCR"]) and read the full description of the dataset.
In the description we see that the name of the target variable is MEDV. Now, we can add the target variable to the DataFrame:
df['MEDV'] = boston['target']
There is only one step left. We are exporting the DataFrame to a csv file without index numbers:
df.to_csv("scikit_learn_boston_dataset.csv", index=False)
BONUS: Iris dataset has additional parameters that we can utilize (look at here). Following code automatically creates the DataFrame with the target variable included:
iris = datasets.load_iris(as_frame=True)
df = iris["frame"]
Note: If we print(iris.keys()), we can see the 'frame' key:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
BONUS2: If we print(boston["filename"]) or print(iris["filename"]), we can see the physical locations of the csv files of these datasets. For instance:
C:\Users\user\anaconda3\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv
Just wanted to modify the reply by adding that you should probably include the target variable--"MV"--as well. Added an additional line below:
from sklearn import datasets
import pandas as pd
data = datasets.load_boston()
print(data)
df = pd.DataFrame(data=data['data'], columns = data['feature_names'])
**df['MV'] = data['target']**
df.to_csv('boston.txt', sep = ',', index = False)

Categories