Making a pandas series safe YAML - python

I'm working with one script that dumps a pandas series to a yaml file:
with open('ex.py','w') as f:
yaml.dump(a_series,f)
And then another script that opens the yaml file for the pandas series:
with open('ex.py','r') as f:
yaml.safe_load(a_series,f)
I'm trying to safe_load the series but I get a constructor error. How can I specify that the pandas series is safe to load?

When you use PyYAML's load, you specify that everything in the YAML document you are loading is safe. That is why you need to use yaml.safe_load.
In your case this leads to an error, because safe_load doesn't know how to construct pandas internals that have tags in the YAML document like:
!!python/name:pandas.core.indexes.base.Index
and
!!python/tuple
etc.
You would need to provide constructors for all the objects, add these to the SafeLoader and then do a_series = yaml.load(f).
Doing that can be a lot of work, especially since what looks like a small change to the data used in your series might require you to add constructors.
You could dump the dict representation of your Series and load that back. Of course some information is lost in this process, I am not sure if that is acceptable:
import sys
import yaml
from pandas import Series
def series_representer(dumper, data):
return dumper.represent_mapping(u'!pandas.series', data.to_dict())
yaml.add_representer(Series, series_representer, Dumper=yaml.SafeDumper)
def series_constructor(loader, node):
d = loader.construct_mapping(node)
return Series(data)
yaml.add_constructor(u'!pandas.series', series_constructor, Loader=yaml.SafeLoader)
data = Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
with open('ex.yaml', 'w') as f:
yaml.safe_dump(data, f)
with open('ex.yaml') as f:
s = yaml.safe_load(f)
print(s)
print(type(s))
which gives:
a 1
b 2
c 3
d 4
e 5
dtype: int64
<class 'pandas.core.series.Series'>
And the ex.yaml file contains:
!pandas.series {a: 1, b: 2, c: 3, d: 4, e: 5}
There are a few things to note:
YAML documents are normally written to files with a .yaml extension. Using .py is bound to get you confused, or have you overwrite some program source files at some point.
yaml.load() and yaml.safe_load() take a stream as first paramater you use them like:
data = yaml.safe_load(stream)
and not like:
yaml.safe_load(data, stream)
It would be better to have a two step constructor, which allows you to construct self referential data structures. However Series.append() doesn't seem to work for that:
def series_constructor(loader, node):
d = Series()
yield d
d.append(Series(loader.construct_mapping(node)))
If dumping the Series via a dictionary is not good enough (because it simplifies the series' data), and if you don't care about the readability of the YAML generated, you can instead of .to_dict() use to to_pickle() but you would have to work with temporary files, as that method is not flexible enough to handle file like objects and expects a file name string as argument.

Related

How to read a vcf.gz file in Python?

I have a file in the vcf.gz format (e.g. file_name.vcf.gz) - and I need to read it somehow in Python.
I understood that first I have to decompress it and then to read it. I found this solution, but it doesn't work for me unfortunately. Even for the first line (bgzip file_name.vcf or tabix file_name.vcf.gz) it says SyntaxError: invalid syntax.
Could you help me please?
Both cyvcf and pyvcf can read vcf files, but cyvcf is much faster and is more actively maintained.
The best approach is by using programs that do this for you as mentioned by basesorbytes. However, if you want your own code you could use this approach
# Import libraries
import gzip
import pandas as pd
class ReadFile():
'''
This class read a VCF file
and does some data manipulation
the outout is the full data found
in the input of this class
the filtering process happens
in the following step
'''
def __init__(self,file_path):
'''
This is the built-in constructor method
'''
self.file_path = file_path
def load_data(self):
'''
1) Convert VCF file into data frame
Read header of the body dynamically and assign dtype
'''
# Open the VCF file and read line by line
with io.TextIOWrapper(gzip.open(self.file_path,'r')) as f:
lines =[l for l in f if not l.startswith('##')]
# Identify columns name line and save it into a dict
# with values as dtype
dinamic_header_as_key = []
for liness in f:
if liness.startswith("#CHROM"):
dinamic_header_as_key.append(liness)
# Declare dtypes
values = [str,int,str,str,str,int,str,str,str,str]
columns2detype = dict(zip(dinamic_header_as_key,values))
vcf_df = pd.read_csv(
io.StringIO(''.join(lines)),
dtype=columns2detype,
sep='\t'
).rename(columns={'#CHROM':'CHROM'})
return vcf_df

Pandas to_excel as variable (without destination file) [duplicate]

This question already has an answer here:
Pandas XLSWriter - return instead of write
(1 answer)
Closed 4 years ago.
I recently had to take a dataframe and prepare it to output to an Excel file. However, I didn't want to save it to the local system, but rather pass the prepared data to a separate function that saves to the cloud based on a URI. After searching through a number of ExcelWriter examples, I couldn't find what I was looking for.
The goal is to take the dataframe, e.g.:
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6})
And temporarily store it as bytes in a variable, e.g.:
processed_data = <bytes representing the excel output>
The solution I came up with is provided in the answers and hopefully will help someone else. Would love to see others' solutions as well!
Update #2 - Example Use Case
In my case, I created an io module that allows you to use URIs to specify different cloud destinations. For example, "paths" starting with gs:// get sent to Google Storage (using gsutils-like syntax). I process the data as my first step, and then pass that processed data to a "save" function, which itself filters to determine the right path.
df.to_csv() actually works with no path and automatically returns a string (at least in recent versions), so this is my solution to allow to_excel() to do the same.
Works like the common examples, but instead of specifying the file in ExcelWriter, it uses the standard library's BytesIO to store in a variable (processed_data):
from io import BytesIO
import pandas as pd
df = pd.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6]
})
output = BytesIO()
writer = pd.ExcelWriter(output)
df.to_excel(writer) # plus any **kwargs
writer.save()
processed_data = output.getvalue()

Writing Dictionary to .csv

After looking around for about a week, I have been unable to find an answer that I can get to work. I am making an assignment manager for a project for my first year CS class. Everything else works how I'd like it to (no GUI, just text) except that I cannot save data to use each time you reopen it. Basically, I would like to save my classes dictionary:
classes = {period_1:assignment_1, period_2:assignment_2, period_3:assignment_3, period_4:assignment_4, period_5:assignment_5, period_6:assignment_6, period_7:assignment_7}
after the program closes so that I can retain the data stored in the dictionary. However, I cannot get anything I have found to work. Again, this is a beginner CS class, so I don't need anything fancy, just something basic that will work. I am using a school-licensed form of Canopy for the purposes of the class.
L3viathan's post might be direct answer to this question, but I would suggest the following for your purpose: using pickle.
import pickle
# To save a dictionary to a pickle file:
pickle.dump(classes, open("assignments.p", "wb"))
# To load from a pickle file:
classes = pickle.load(open("assignments.p", "rb"))
By this method, the variable would retain its original structure without having to write and convert to different formats manually.
Either use the csv library, or do something simple like:
with open("assignments.csv", "w") as f:
for key, value in classes.items():
f.write(key + "," + value + "\n")
Edit: Since it seems that you can't read or write files in your system, here's an alternative solution (with pickle and base85):
import pickle, base64
def save(something):
pklobj = pickle.dumps(something)
print(base64.b85encode(pklobj).decode('utf-8'))
def load():
pklobj = base64.b85decode(input("> ").encode('utf-8'))
return pickle.loads(pklobj)
To save something, you call save on your object, and copy the string that is printed to your clipboard, then you can save it in a file, for instance.
>>> save(classes) # in my case: {34: ['foo#', 3]}
fCGJT081iWaRDe;1ONa4W^ZpJaRN&NWpge
To load, you call load() and enter the string:
>>> load()
> fCGJT081iWaRDe;1ONa4W^ZpJaRN&NWpge
{34: ['foo#', 3]}
The pickle approach described by #Ébe Isaac and #L3viathan is the way to go. In case you also want to do something else with the data, you might want to consider pandas (which you should only use IF you do something else than just exporting the data).
As there are only basic strings in your dictionary according to your comment below your question, it is straightforward to use; if you have more complicated data structures, then you should use the pickle approach:
import pandas as pd
classes = {'period_1':'assignment_1', 'period_2':'assignment_2', 'period_3':'assignment_3', 'period_4':'assignment_4', 'period_5':'assignment_5', 'period_6':'assignment_6', 'period_7':'assignment_7'}
pd.DataFrame.from_dict(classes, orient='index').sort_index().rename(columns={0: 'assignments'}).to_csv('my_csv.csv')
That gives you the following output:
assignments
period_1 assignment_1
period_2 assignment_2
period_3 assignment_3
period_4 assignment_4
period_5 assignment_5
period_6 assignment_6
period_7 assignment_7
In detail:
.from_dict(classes, orient='index') creates the actual dataframe using the dictionary as in input
.sort_index() sorts the index which is not sorted as you use a dictionary for the creation of the dataframe
.rename(columns={0: 'assignments'}) that just assigns a more reasonable name to your column (by default '0' is used)
.to_csv('my_csv.csv') that finally exports the dataframe to a csv
If you want to read in the file again you can do it as follows:
df2 = pd.read_csv('my_csv.csv', index_col=0)

Efficiently save to disk (heterogeneous) graph of lists, tuples, and NumPy arrays

I am regularly dealing with large amounts of data (order of several GB), which are stored in memory in NumPy arrays. Often, I will be dealing with nested lists/tuples of such NumPy arrays. How should I store these to disk? I want to preserve the list/tuple structure of my data, the data has to be compressed to conserve disk space, and saving/loading needs to be fast.
(The particular use case I'm facing right now is a 4000-element long list of 2-tuples x where x[0].shape = (201,) and x[1].shape = (201,1000).)
I have tried several options, but all have downsides:
pickle storage into a gzip archive. This works well, and results in acceptable disk space usage, but is extremely slow and consumes a lot of memory while saving.
numpy.savez_compressed. Is much faster than pickle, but unfortunately only allows either a sequence of numpy arrays (not nested tuples/lists as I have) or a dictionary-style way of specifying the arguments.
Storing into HDF5 through h5py. This seems too cumbersome for my relatively simple needs. More importantly, I looked a lot into this, and also there does not seem to be a straightforward way to store heterogeneous (nested) lists.
hickle seems to do exactly what I want, however unfortunately it's incompatible with Python 3 at the moment (which is what I'm using).
I was thinking of writing a wrapper around numpy.savez_compressed, which would determine the nested structure of the data, store this structure in some variable nest_structure, flatten the full graph, and store both nest_structure and all the flattened data using numpy.savez_compressed. Then, the corresponding wrapper around numpy.load would understand the nest_structure variable, and re-create the graph and return it. However, I was hoping there is something like this already out there.
You may like the shelve package. It effectively wraps heterogeneous pickled objects in a convenient file. shelve is oriented more toward a "persistent storage" than classic save-to-file model.
The main benefit of using shelve is that you can conveniently save most kinds of structured data. The main disadvantage of using shelve is that it is Python-specific. Unlike HDF-5 or saved Matlab files or even simple CSV files, it isn't so easy to use other tools with your data.
Example of saving (Out of habit, I created objects and copy them to df, but you don't need to do this. You could just save directly to items in df):
import shelve
import numpy as np
a = np.arange(0, 1000, 12)
b = "This is a string"
class C(object):
alpha = 1.0
beta = [3, 4]
c = C()
class C(object):
alpha = 1.0
beta = [3, 4]
c = C()
df = shelve.open('test.shelve', 'c')
df['a'] = a
df['b'] = b
df['c'] = c
df.sync()
exit()
Following the above example, recovering data:
import shelve
import numpy as np
class C():
alpha = 1.0
beta = [3, 4]
df = shelve.open('test.shelve')
print(df['a'])
print(df['b'])
print(df['c'].alpha)

How do you create your own data dictionary/structure in python

In the sci-kit learn python library there are many datasets accessed easily by the following commands:
for example to load the iris dataset:
iris=datasets.load_iris()
And we can now assign data and target/label variables as follows:
X=iris.data # assigns feature dataset to X
Y=iris.target # assigns labels to Y
My question is how to create my own data dictionary using my own data either in csv, xml or any other format into something similar above so data can be called easily and features/labels are easily accessed.
Is this possible? someone help me!!
By the way I am using the spyder (anaconda) platform by continuum.
Thanks!
I see at least two (easy) solutions to your problem.
First, you can store your data in whichever structure you like.
# Storing in a list
my_list = []
my_list.append(iris.data)
my_list[0] # your data
# Storing in a dictionary
my_dict = {}
my_dict["data"] = iris.data
my_dict["data"] # your data
Or, you can create your own class:
Class MyStructure:
def __init__(data, target):
self.data = data
self.target = target
my_class = MyStructure(iris.data, iris.target)
my_class.data # your data
Hope it helps
If ALL you want to do is read data from csv files and have them organized , I would recommend you to simply use either pandas or numpy's genfromtxt function.
mydata=numpy.genfromtxt(filepath,*params)
If the CSV is formatted regularly, you can extract for example the names of each column by specifying:
mydata=numpy.genfromtxt(filepath,unpack=True,names=True,delimiter=',')
then you can access any column data you want by simply typing it's name/header:
mydata['your header']
(Pandas also has a similar convenient way of grabbing data in an organized manner from CSV or similar files.)
However if you want to do it the long way and learn:
Simply, you want to write a class for the data that you are using, complete with its own access, modify, read, #dosomething functions. Instead of code for this, I think you would benefit more from going in and reading for example the iris class, or an introduction to a simple Class from any beginners guide to object based programming.
To do what you want, for an object MyData, you could have for example
read(#file) function that reads from a given file of some expected format and returns some specified structure. For reading from csv files, you can simply use numpy's loadtxt method.
modify(#some attribute)
etc.

Categories