Convert CSV File into Python Dictionary, Array and Binary File - python

I have a CSV file of tab-separated data with headers and data of different types which I would like to convert into a dictionary of vectors. Eventually I would like to convert the dictionary into numpy arrays, and store them in some binary format for fast retrieval by different scripts. This is a large file with approximately 700k records and 16 columns. The following is a sample:
"answer_option" "value" "fcast_date" "expertise"
"a" 0.8 "2013-07-08" 3
"b" 0.2 "2013-07-08" 3
I have started implementing this with the DictReader class, which I'm just learning about.
import csv
with open( "filename.tab", 'r') as records:
reader = csv.DictReader( records, dialect='excel-tab' )
row = list( reader )
n = len( row )
d = {}
keys = list( row[0] )
for key in keys :
a = []
for i in range(n):
a.append( row[i][key] )
d [key] = a
which gives the result
{'answer_option': ['a', 'b'],
'value': ['0.8', '0.2'],
'fcast_date': ['2013-07-08', '2013-07-08'],
'expertise': ['3', '3']}
Besides the small nuisance of having to clean from the numerical values the quotation characters that are enclosing them, I thought that perhaps there is something ready made. I'm also wondering if there is anything that extracts directly from the file into numpy vectors, since I do not need to necessarily transform my data in dictionaries.
I took a look at SciPy.org and a search of CSV also refers to HDF5 and genfromtxt, but I haven't dived into those suggestions yet. Ideally I would like to be able to store the data in a fast-to-load format, so that it would be simple to load from other scripts with only one command, where all vectors are made available the same way it is possible in Matlab/Octave. Suggestions are appreciated
EDIT: the data are tab separated with strings enclosed by quotation marks.

This will read the csv into a Pandas data frame and remove the quotes:
import pandas as pd
import csv
import io
with open('data_with_quotes.csv') as f_input:
data = [next(csv.reader(io.StringIO(line.replace('"', '')))) for line in f_input]
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
answer_option value fcast_date expertise
0 a 0.8 2013-07-08 3
1 b 0.2 2013-07-08 3
You can easily convert the data to a numpy array using df.values:
array([['a', '0.8', '2013-07-08', '3'],
['b', '0.2', '2013-07-08', '3']], dtype=object)
To save the data in a binary format, I recommend using Hdf5:
import h5py
with h5py.File('file.hdf5', 'w') as f:
dset = f.create_dataset('default', data=df)
To load the data, use the following:
with h5py.File('file.hdf5', 'r') as f:
data = f['default']
You can also use Pandas to save and load the data in binary format:
# Save the data
df.to_hdf('data.h5', key='df', mode='w')
# Load the data
df = pd.read_hdf('data.h5', 'df')

Related

Problem either with number of characters exceeding cell limit, or storing lists of variable length

The problem:
I have lists of genes expressed in 53 different tissues. Originally, this data was stored in a maximal array of the genes, with 'NaN' where there was no expression. I am trying to create new lists for each tissue that just have the genes expressed, as it was very inefficient to be searching through this array every time I was running my script. I have a code that finds the genes for each tissue as required, but I do not know how to store the ouptut.
I was using pandas data frame, and then converting to csv. But this does not accept lists of varying length, unless I put this list as a single item. However, then when I save the data frame to a csv, it tries to squeeze this very long list (all genes exprssed for one tissue) into a single cell. I get an error of the string length exceeding the excel character-per-cell limit.
Therefore I need a way of either dealing with this limit, or stroing my lists in a different way. I would rather just have one file for all lists.
My code:
import csv
import pandas as pd
import math
import numpy as np
#Import list of tissues:
df = pd.read_csv(r'E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
tissuedict=df.to_dict()
tissuelist = list(tissuedict.keys())[2:]
all_genes = [gene for key,gene in tissuedict['Gene Name'].items()]
data = []
for tissue in tissuelist:
#Create array to keep track of the protein mRnaS in tissue that are not present in the network
#initiate with first tissue, protein
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
datatis = [tissue, tissueExpression.tolist()]
print(datatis)
data.append(datatis)
print(data)
df = pd.DataFrame(data)
df.to_csv(r'tissue_expression_data.csv')
Link to data (either one):
https://github.com/joanna-lada/gene_data/blob/master/E-MTAB-5214-query-results.tsv
https://raw.githubusercontent.com/joanna-lada/gene_data/master/E-MTAB-5214-query-results.tsv
IIUC you need lists of the gene names found in each tissue. This writes these lists as columns into a csv:
import pandas as pd
df = pd.read_csv('E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
df = df.drop(columns='Gene ID').set_index('Gene Name')
res = pd.DataFrame()
for c in df.columns:
res = pd.concat([res, pd.Series(df[c].dropna().index, name=c)], axis=1)
res.to_csv('E-MTAB-5214-query-results.csv', index=False)
(Writing them as rows would have been easier, but Excel can't import so many columns)
Don't open the csv in Excel directly, but use a blank worksheet and import the csv (Data - External data, From text), otherwise you can't separate them into Excel columns in one run (at least in Excel 2010).
create your data variable as a dictionary
you can save the dictionary to a json file using json.dump refer here
import json
data = {}
for tissue in tissuelist:
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
data[tissue] = tissueExpression.tolist()
with open('filename.json', 'w') as fp:
json.dump(data, fp)

Read only specific fields from large JSON and import into a Pandas Dataframe

I have a folder with more or less 10 json files that size between 500 and 1000 Mb.
Each file contains about 1.000.000 of lines like the loffowling:
{
"dateTime": '2019-01-10 01:01:000.0000'
"cat": 2
"description": 'This description'
"mail": 'mail#mail.com'
"decision":[{"first":"01", "second":"02", "third":"03"},{"first":"04", "second":"05", "third":"06"}]
"Field001": 'data001'
"Field002": 'data002'
"Field003": 'data003'
...
"Field999": 'data999'
}
My target is to analyze it with pandas so I would like to save the data coming from all the files into a Dataframe.
If I loop all the files Python crash because I don't have free resources to manage the data.
As for my purpose I only need a Dataframe with two columns cat and dateTime from all the files, which I suppose is lighter that a whole Dataframe with all the columns I have tryed to read only these two columns with the following snippet:
Note: at the moment I am working with only one file, when I get a fast reader code I will loop to all other files (A.json, B.json, ...)
import pandas as pd
import json
import os.path
from glob import glob
cols = ['cat', 'dateTime']
df = pd.DataFrame(columns=cols)
file_name='this_is_my_path/File_A.json'
with open(file_name, encoding='latin-1') as f:
for line in f:
data=json.loads(line)
lst_dict=({'cat':data['cat'], 'dateTime':data['dateTime']})
df = df.append(lst_dict, ignore_index=True)
The code works, but it is very very slow so it takes more than one hour for one, file while reading all the file and storing into a Dataframe usually takes me 8-10 minutes.
Is there a way to read only two specific columns and append to a Dataframe in a faster way?
I have tryed to read all the JSON file and store into a Dataframe, then drop all the columns but 'cat' and 'dateTime' but it seems to be too heavy for my MacBook.
I had the same problem. I found out that appending a dict to a DataFrame is very very slow. Extract the values as a list instead. In my case it took 14 s instead of 2 h.
cols = ['cat', 'dateTime']
data = []
file_name = 'this_is_my_path/File_A.json'
with open(file_name, encoding='latin-1') as f:
for line in f:
doc = json.loads(line)
lst = [doc['cat'], doc['dateTime']]
data.append(lst)
df = pd.DataFrame(data=data, columns=cols)
Will this help?
Step 1.
Read your json file from pandas
"pandas.read_json() "
Step 2.
Then filter out your 2 columns from the dataframe.
Let me know if you still face any issue.
Thanks

Creating dictionary from CVS file using lists

I have a csv file which contains four columns and many rows, each representing different data, e.g.
OID DID HODIS BEAR
1 34 67 98
I have already opened and read the csv file, however I am unsure how I can make each column into a key. I believe the following format I have used in the code is best for the task I am creating.
Please see my code bellow, sorry if the explanation is a bit confusing.
Note that the #Values in column 1 is what I am stuck on, I am unsure how I can define each column.
for line in file_2:
the_dict = {}
OID = line.strip().split(',')
DID = line.strip().split(',')
HODIS = line.strip().split(',')
BEAR = line.strip().split(',')
the_dict['KeyOID'] = OID
the_dict['KeyDID'] = DID
the_dict['KeyHODIS'] = HODIS
the_dict['KeyBEAR'] = BEAR
dictionary_list.append(the_dict)
print(dictionary_list)
image
There is a great Python function for strings that will split strings based on a delimiter, .split(delim) where delim is the delimiter, and returns them as a list.
From the code that you have in your screenshot, you can use the following code to split on a ,, which I assume is your delimiter because you said that your file is a CSV.
...
for line in file_contents_2:
the_dict = {}
values = line.strip().split(',')
OID = values[0]
DID = values[1]
HODIS = values[2]
BEAR = values[3]
...
Also, in case you ever need to split a string based on whitespace, that is the default argument for .split() (the default argument is used when no argument is provided).
I would say this as whole code:
lod = []
with open(file,'r') as f:
l=f.readlines()
for i in l[1:]:
lod.append(dict(zip(l[0].rstrip().split(),i.split())))
split doesn't need a parameter, just use simple for loop in with open, no need for knowing keys
And if care about empty dictionaries do:
lod=list(filter(None,lod))
print(lod)
Output:
[{'OID': '1', 'DID': '34', 'HODIS': '67', 'BEAR': '98'}]
If want integers:
lod=[{k:int(v) for k,v in i.items()} for i in lod]
print(lod)
Output:
[{'OID': 1, 'DID': 34, 'HODIS': 67, 'BEAR': 98}]
Another way to do it is using libraries like Pandas, that is powerful in working with tabular data. It is fast as we avoid loops. In the example below you only need Pandas and the name of the CSV file. I used io just to transform string data to mimic csv.
import pandas as pd
from io import StringIO
data=StringIO('''
OID,DID,HODIS,BEAR\n
1,34,67,98''') #mimic csv file
df = pd.read_csv(data,sep=',')
print(df.T.to_dict()[0])
At the bottom you need only one-liner that chains commands. Read csv, transpose and tranform to dictionary:
import pandas as pd
csv_dict = pd.read_csv('mycsv.csv',sep=',').T.to_dict()[0]

Operations on a very large csv with pandas

I have been using pandas on csv files to get some values out of them. My data looks like this:
"A",23.495,41.995,"this is a sentence with some words"
"B",52.243,0.118,"More text but contains WORD1"
"A",119.142,-58.289,"Also contains WORD1"
"B",423.2535,292.3958,"Doesn't contain anything of interest"
"C",12.413,18.494,"This string contains WORD2"
I have a simple script to read the csv and create the frequencies of WORD by group so the output is like:
group freqW1 freqW2
A 1 0
B 1 0
C 0 1
Then do some other operations on the values. The problem is now I have to deal with very large csv files (20+ GB) that can't be held in memory. I tried the chunksize=x option in pd.read_csv, but because 'TextFileReader' object is not subscriptable, I can't do the necessary operations on the chunks.
I suspect there is some easy way to iterate through the csv and do what I want.
My code is like this:
df = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"])
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
outfile = open("csv_out.txt","w", encoding='utf-8')
df1.to_csv(outfile, sep=",")
outfile.close()
You can specify a chunksize option in the read_csv call. See here for details
Alternatively you could use the Python csv library and create your own csv Reader or DictReader and then use that to read in data in whatever chunk size you choose.
Okay I misunderstood the chunk parameter. I solved it by doing this:
frame = pd.DataFrame()
chunks = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"],chunksize=1000000)
for df in chunks:
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
frame = frame.add(df1,fill_value=0)
outfile = open("csv_out.txt","w", encoding='utf-8')
frame.to_csv(outfile, sep=",")
outfile.close()

Extracting data using fieldnames

I have parsed and extracted lines and field names from a CSV data file using:
reader = csv.DictReader(open('Sourcefile.txt','rt'), delimiter = '\t')
fn = reader.fieldnames
How do I access the parsed data in any line using the fieldname? For example fieldnames could be AA, BB, CC, DD. How do obtain the value for DD in line 5, or AA in line 3?
As long as the file is not too big just convert the reader into a list:
data = list(reader)
Now access the column AA in row 1:
data[0]['AA']
or column for field name 3 in row 3:
data[2][fn[2]]
An easy way is to store all of the data in your file into a container. An alternative to the other answer is to use a pandas dataframe:
>>import pandas as pd
>>from io import StringIO # just to make a fake in memory file
>>s = StringIO('AA\tBB\tCC\tDD\n1\t2\t3\t4\n11\t12\t13\t14\n') # fake data
>>data = pd.read_table(s)
>>data.loc[0,'AA'] # access row 0 (the first row) and column AA
1
Edit: This does use a different package not included with the standard library, but I often find it easier to use pandas than csv when I am constantly reading and manipulating csv files.

Categories