Numpy: load multiple CSV files as dictionaries

Numpy: load multiple CSV files as dictionaries - python

I wanted to use the numpy loadtxt method to read .csv files for my experiment. I have three different time-series data of the following format with different characteristics where the first column is timestamp and the second column is the value.
0.086206438,10
0.086425551,12
0.089227066,20
0.089262508,24
0.089744425,30
0.090036815,40
0.090054172,28
0.090377569,28
0.090514071,28
0.090762872,28
0.090912691,27
For reproducibility, I have shared the three time-series data I am using here.
If I do it like the following
import numpy as np
fname="data1.csv"
col_time,col_window = np.loadtxt(fname,delimiter=',').T
It works fine as intended. However instead of reading only a single file, I want to pass a dictionary to col_time,col_window = np.loadtxt(types,delimiter=',').T as the following
protocols = {}
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
so that I can read multiple csv files and do plot all the results at ones using a one for loop as in the following.
for protname, fname in types.items():
col_time, col_window = protocols[protname]["col_time"], protocols[protname]["col_window"]
rt = np.exp(np.diff(np.log(col_window)))
plt.plot(quotient_times, quotient, ".", markersize=4, label=protname)
plt.title(protname)
plt.xlabel("t")
plt.ylabel("values")
plt.legend()
plt.show()
But it is giving me an error ValueError: could not convert string to float: b'data1'. How can I load multiple csv files as a dictionary?

Assuming that you want to build a protocols dict that will be useable in your code, you can easily build it with a simple loop:
types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}
protocols = {}
for name, file in types.items():
col_time, col_window = np.loadtxt(file, delimiter=',').T
protocols[name] = {'col_time': col_time, 'col_window': col_window}
You can then successfully plot the 3 graphs:
for protname, fname in types.items():
col_time, col_window = protocols[protname]["col_time"], protocols[protname]["col_window"]
rt = np.exp(np.diff(np.log(col_window)))
plt.plot(col_time, col_window, ".", markersize=4, label=protname)
plt.title(protname)
plt.xlabel("t")
plt.ylabel("values")
plt.legend()
plt.show()

Loading data from multiple CSV files is not supported in pandas and numpy. You can use concat function of pandas DataFrame and load all the files. The example bellow demonstrates using pandas. Replace StringIO with file object.
data="""
0.086206438,10
0.086425551,12
0.089227066,20
0.089262508,24
0.089744425,30
0.090036815,40
0.090054172,28
0.090377569,28
0.090514071,28
0.090762872,28
0.090912691,27
"""
data2="""
0.086206438,29
0.086425551,32
0.089227066,50
0.089262508,54
"""
data3="""
0.086206438,69
0.086425551,72
0.089227066,70
0.089262508,74
"""
import pandas as pd
from io import StringIO
files={"data1":data,"data2":data2,"data3":data3}
# Load the first file into data frame
key=list(files.keys())[0]
df=pd.read_csv(StringIO(files.get(key)),header=None,usecols=[0,1],names=['data1','data2'])
print(df.head())
# remove file from dictionary
files.pop(key,None)
print("final values")
# Efficient :Concat this dataframe with remaining files
df=pd.concat([pd.read_csv(StringIO(files[i]),header=None,usecols=[0,1],names=['data1','data2']) for i in files.keys()],
ignore_index=True)
print(df.tail())
For more insight: pandas append vs concat

Related

Prevent pandas from changing int to float/date?

I'm trying to merge a series of xlsx files into one, which works fine.
However, when I read a file, columns containing ints are transformed into floats (or dates?) when I merge and output them to csv. I have tried to visualize this in the picture. I have seen some solutions to this where dtype is used to "force" specific columns into int format. However, I do not always know the index nor the title of the column, so i need a more scalable solution.
Anyone with some thoughts on this?
Thank you in advance
#specify folder with xlsx-files
xlsFolder = "{}/system".format(directory)
dfMaster = pd.DataFrame()
#make a list of all xlsx-files in folder
xlsFolderContent = os.listdir(xlsFolder)
xlsFolderList = []
for file in xlsFolderContent:
if file[-5:] == ".xlsx":
xlsFolderList.append(file)
for xlsx in xlsFolderList:
print(xlsx)
xl = pd.ExcelFile("{}/{}".format(xlsFolder, xlsx))
for sheet in xl.sheet_names:
if "_Errors" in sheet:
print(sheet)
dfSheet = xl.parse(sheet)
dfSheet.fillna(0, inplace=True)
dfMaster = dfMaster.append(dfSheet)
print("len of dfMaster:", len(dfMaster))
dfMaster.to_csv("{}/dfMaster.csv".format(xlsFolder),sep=";")
Data sample:

Try to use dtype='object' as parameter of pd.read_csv or (ExcelFile.parse) to prevent Pandas to infer the data type of each column. You can also simplify your code using pathlib:
import pandas as pd
import pathlib
directory = pathlib.Path('your_path_directory')
xlsFolder = directory / 'system'
data = []
for xlsFile in xlsFolder.glob('*.xlsx'):
sheets = pd.read_excel(xlsFile, sheet_name=None, dtype='object')
for sheetname, df in sheets.items():
if '_Errors' in sheetname:
data.append(df.fillna('0'))
pd.concat(data).to_csv(xlsxFolder / dfMaster.csv, sep=';')

Problem either with number of characters exceeding cell limit, or storing lists of variable length

The problem:
I have lists of genes expressed in 53 different tissues. Originally, this data was stored in a maximal array of the genes, with 'NaN' where there was no expression. I am trying to create new lists for each tissue that just have the genes expressed, as it was very inefficient to be searching through this array every time I was running my script. I have a code that finds the genes for each tissue as required, but I do not know how to store the ouptut.
I was using pandas data frame, and then converting to csv. But this does not accept lists of varying length, unless I put this list as a single item. However, then when I save the data frame to a csv, it tries to squeeze this very long list (all genes exprssed for one tissue) into a single cell. I get an error of the string length exceeding the excel character-per-cell limit.
Therefore I need a way of either dealing with this limit, or stroing my lists in a different way. I would rather just have one file for all lists.
My code:
import csv
import pandas as pd
import math
import numpy as np
#Import list of tissues:
df = pd.read_csv(r'E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
tissuedict=df.to_dict()
tissuelist = list(tissuedict.keys())[2:]
all_genes = [gene for key,gene in tissuedict['Gene Name'].items()]
data = []
for tissue in tissuelist:
#Create array to keep track of the protein mRnaS in tissue that are not present in the network
#initiate with first tissue, protein
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
datatis = [tissue, tissueExpression.tolist()]
print(datatis)
data.append(datatis)
print(data)
df = pd.DataFrame(data)
df.to_csv(r'tissue_expression_data.csv')
Link to data (either one):
https://github.com/joanna-lada/gene_data/blob/master/E-MTAB-5214-query-results.tsv
https://raw.githubusercontent.com/joanna-lada/gene_data/master/E-MTAB-5214-query-results.tsv

IIUC you need lists of the gene names found in each tissue. This writes these lists as columns into a csv:
import pandas as pd
df = pd.read_csv('E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
df = df.drop(columns='Gene ID').set_index('Gene Name')
res = pd.DataFrame()
for c in df.columns:
res = pd.concat([res, pd.Series(df[c].dropna().index, name=c)], axis=1)
res.to_csv('E-MTAB-5214-query-results.csv', index=False)
(Writing them as rows would have been easier, but Excel can't import so many columns)
Don't open the csv in Excel directly, but use a blank worksheet and import the csv (Data - External data, From text), otherwise you can't separate them into Excel columns in one run (at least in Excel 2010).

create your data variable as a dictionary
you can save the dictionary to a json file using json.dump refer here
import json
data = {}
for tissue in tissuelist:
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
data[tissue] = tissueExpression.tolist()
with open('filename.json', 'w') as fp:
json.dump(data, fp)

Read and save multiple csv files from a for-loop

I am trying to read multiple csv files from a list of file paths and save them all as separate pandas dataframes.
I feel like there should be a way to do this, however I cannot find a succinct explanation.
import pandas as pd
data_list = [['df_1','filepath1.csv'],
['df_2','filepath2.csv'],
['df_3','filepath3.csv']]
for name, filepath in data_list:
name = pd.read_csv(filepath)
I have also tried:
data_list = [[df_1,'filepath1.csv'],[df_2,'filepath2.csv'],
[df_3,'filepath3.csv']]
for name, filepath in data_list:
name = pd.read_csv(filepath)
I would like to be able to call each dataframe by its assigned name.
Ex):
df_1.head()

df_dct = {name:pd.read_csv(filepath) for name, filepath in data_list}
would create a dictionary of DataFrames. This may help you organize your data.
You may also want to look into glob.glob to create your list of files. For example, to get all CSV files in a directory:
file_paths = glob.glob(my_file_dir+"/*.csv")

I recommend you numpy. Read the csv files with numpy.
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')
You will get nd-array's. After that you can include them into pandas.

You can make sure of dictionary for this...
import pandas as pd
data_list = ['filepath1.csv', 'filepath2.csv', 'filepath3.csv']
d = {}
for _, i in enumerate(data_list):
file_name = "df" + str(_)
d[file_name] = pd.read_csv(filepath)
Here d is the dictionary which contains all your dataframes.

How do I write scikit-learn dataset to csv file

I can load a data set from scikit-learn using
from sklearn import datasets
data = datasets.load_boston()
print(data)
What I'd like to do is write this data set to a flat file (.csv)
Using the open() function,
f = open('boston.txt', 'w')
f.write(str(data))
works, but includes the description of the data set.
I'm wondering if there is some way that I can generate a simple .csv with headers from this Bunch object so I can move it around and use it elsewhere.

data = datasets.load_boston() will generate a dictionary. In order to write the data to a .csv file you need the actual data data['data'] and the columns data['feature_names']. You can use these in order to generate a pandas dataframe and then use to_csv() in order to write the data to a file:
from sklearn import datasets
import pandas as pd
data = datasets.load_boston()
print(data)
df = pd.DataFrame(data=data['data'], columns = data['feature_names'])
df.to_csv('boston.txt', sep = ',', index = False)
and the output boston.txt should be:
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
...

There are various toy datasets in scikit-learn such as Iris and Boston datasets. Let's load Boston dataset:
from sklearn import datasets
boston = datasets.load_boston()
What type of object is this? If we examine its type, we see that this is a scikit-learn Bunch object.
print(type(boston))
Output:
<class 'sklearn.utils.Bunch'>
A scikit-learn Bunch object is a kind of dictionary. So, we should treat it as such. We can use dictionary methods. Let's look at the keys:
print(boston.keys())
output:
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
Here we are interested in data, feature_names and target keys. We will import pandas module and use these keys to create a pandas DataFrame.
import pandas as pd
df = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])
We should also add the target variable to the DataFrame. Target variable is what we try to predict. We should learn the target variable's name. It is written in the "DESCR". We can
print(boston["DESCR"]) and read the full description of the dataset.
In the description we see that the name of the target variable is MEDV. Now, we can add the target variable to the DataFrame:
df['MEDV'] = boston['target']
There is only one step left. We are exporting the DataFrame to a csv file without index numbers:
df.to_csv("scikit_learn_boston_dataset.csv", index=False)
BONUS: Iris dataset has additional parameters that we can utilize (look at here). Following code automatically creates the DataFrame with the target variable included:
iris = datasets.load_iris(as_frame=True)
df = iris["frame"]
Note: If we print(iris.keys()), we can see the 'frame' key:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
BONUS2: If we print(boston["filename"]) or print(iris["filename"]), we can see the physical locations of the csv files of these datasets. For instance:
C:\Users\user\anaconda3\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv

Just wanted to modify the reply by adding that you should probably include the target variable--"MV"--as well. Added an additional line below:
from sklearn import datasets
import pandas as pd
data = datasets.load_boston()
print(data)
df = pd.DataFrame(data=data['data'], columns = data['feature_names'])
**df['MV'] = data['target']**
df.to_csv('boston.txt', sep = ',', index = False)

Why is the cdc_list getting updated after calling the function read_csv() in total_list?

# Program to combine data from 2 csv file
The cdc_list gets updated after second call of read_csv
overall_list = []
def read_csv(filename):
file_read = open(filename,"r").read()
file_split = file_read.split("\n")
string_list = file_split[1:len(file_split)]
#final_list = []
for item in string_list:
int_fields = []
string_fields = item.split(",")
string_fields = [int(x) for x in string_fields]
int_fields.append(string_fields)
#final_list.append()
overall_list.append(int_fields)
return(overall_list)
cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")
print(len(cdc_list)) #3652
total_list = read_csv("US_births_2000-2014_SSA.csv")
print(len(total_list)) #9131
print(len(cdc_list)) #9131

I don't think the code you pasted explains the issue you've had, at least it's not anywhere I can determine. Seems like there's a lot of code you did not include in what you pasted above, that might be responsible.
However, if all you want to do is merge two csvs (assuming they both have the same columns), you can use Pandas' read_csv and Pandas DataFrame methods append and to_csv, to achieve this with 3 lines of code (not including imports):
import pandas as pd
# Read CSV file into a Pandas DataFrame object
df = pd.read_csv("first.csv")
# Read and append the 2nd CSV file to the same DataFrame object
df = df.append( pd.read_csv("second.csv") )
# Write merged DataFrame object (with both CSV's data) to file
df.to_csv("merged.csv")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy: load multiple CSV files as dictionaries - python

Related

Prevent pandas from changing int to float/date?

Problem either with number of characters exceeding cell limit, or storing lists of variable length

Read and save multiple csv files from a for-loop

How do I write scikit-learn dataset to csv file

Why is the cdc_list getting updated after calling the function read_csv() in total_list?

Categories

Resources