How to read a vcf.gz file in Python? - python

I have a file in the vcf.gz format (e.g. file_name.vcf.gz) - and I need to read it somehow in Python.
I understood that first I have to decompress it and then to read it. I found this solution, but it doesn't work for me unfortunately. Even for the first line (bgzip file_name.vcf or tabix file_name.vcf.gz) it says SyntaxError: invalid syntax.
Could you help me please?

Both cyvcf and pyvcf can read vcf files, but cyvcf is much faster and is more actively maintained.

The best approach is by using programs that do this for you as mentioned by basesorbytes. However, if you want your own code you could use this approach
# Import libraries
import gzip
import pandas as pd
class ReadFile():
'''
This class read a VCF file
and does some data manipulation
the outout is the full data found
in the input of this class
the filtering process happens
in the following step
'''
def __init__(self,file_path):
'''
This is the built-in constructor method
'''
self.file_path = file_path
def load_data(self):
'''
1) Convert VCF file into data frame
Read header of the body dynamically and assign dtype
'''
# Open the VCF file and read line by line
with io.TextIOWrapper(gzip.open(self.file_path,'r')) as f:
lines =[l for l in f if not l.startswith('##')]
# Identify columns name line and save it into a dict
# with values as dtype
dinamic_header_as_key = []
for liness in f:
if liness.startswith("#CHROM"):
dinamic_header_as_key.append(liness)
# Declare dtypes
values = [str,int,str,str,str,int,str,str,str,str]
columns2detype = dict(zip(dinamic_header_as_key,values))
vcf_df = pd.read_csv(
io.StringIO(''.join(lines)),
dtype=columns2detype,
sep='\t'
).rename(columns={'#CHROM':'CHROM'})
return vcf_df

Related

How to display the csv file name which is read by pandas read_csv() function?

I want to display the csv file name which is read by pandas.read_csv() function. I tried the below code but I couldn't display the csv file name.
import pandas as pd
df=pd.read_csv("abc.csv")
print(df.info())
I want to display the "abc". Guide me for my situation. Thanks in advance.
The pandas.read_csv() method accepts a File object (actually any file-like object with a read() method).
And the File class has a name object that has the name of the opened file.
I see this code and situation as absolutely meaningless since you already know the file name beforehand, but for the sake of completeness, here you go:
import pandas as pd
csv_file = open("your_csv_filename.csv")
print(csv_file.name)
df = pd.read_csv(csv_file)
When you use pandas read_csv function, you get a dataframe that does not include the file name. So the solution is storing the name of the .csv in a variable, and then print it. You can check about pandas dataframe in pandas.DataFrame Documentation
import pandas as pd
name = "abc.csv"
df=pd.read_csv(name)
print(name.split(".")[0])
You can use something like this as read_csv does not save the file_name.
Using glob will give you the ability to put wildcards or regex for all the CSV files on that folder for reading.
import glob
data = {}
for filename in glob.glob("/path/of/the/csv/files/*.csv"):
data[filename.split("/")[-1].split(".")[0]] = pd.read_csv(filename)
for key, value in data.items():
print(key)
print(value.info())
print("\n\n")
filename.split("/")[-1].split('.')[0]
The above line may look complicated but it just split the file_name 2 times.

Convert pkl file to json file

I'm new on stack-overflow.
I'm trying to convert pkl file into json file using python. Below is my sample code
import pickle
import pandas as pd
# Load pickle file
input_file = open('file.pkl', 'rb')
new_dict = pickle.load(input_file)
input_file()
# Create a Pandas DataFrame
data_frame = pd.DataFrame(new_dict)
# Copy DataFrame index as a column
data_frame['index'] = data_frame.index
# Move the new index column to the from of the DataFrame
index = data_frame['index']
data_frame.drop(labels=['index'], axis=1, inplace = True)
data_frame.insert(0, 'index', index)
# Convert to json values
json_data_frame = data_frame.to_json(orient='values', date_format='iso', date_unit='s')
with open('data.json', 'w') as js_file:
js_file.write(json_data_frame)
When I run this code I got error that TypeError: '_io.TextIOWrapper' object is not callable. By following some same issues This one and This one, these issues suggested to use write method with input_file() at line 7 but still I'm getting this error io.UnsupportedOperation: write which is probably a writing method but I'm using it with reading and for reading I'm unable to fine any method.
I also tried to read pickle file in following way
with open ('file.pkl', 'rb') as input_file:
new_dict = pickle.load(input_file)
and I'm getting this error
DataFrame constructor not properly called!.
I need some kind suggestions that how I can solve this problem?
Any suggestions about other tools which can perform this task, will be appreciable. Thanks

Writing value to given filed in csv file using pandas or csv module

Is there any way you can write value to specific place in given .csv file using pandas or csv module?
I have tried using csv_reader to read the file and find a line which fits my requirements though I couldn't figure out a way to switch value which is in the file to mine.
What I am trying to achieve here is that I have a spreadsheet of names and values. I am using JSON to update the values from the server and after that I want to update my spreadsheet also.
The latest solution which I came up with was to create separate sheet from which I will get updated data, but this one is not working, though there is no sequence in which the dict is written to the file.
def updateSheet(fileName, aValues):
with open(fileName+".csv") as workingSheet:
writer = csv.DictWriter(workingSheet,aValues.keys())
writer.writeheader()
writer.writerow(aValues)
I will appreciate any guidance and tips.
You can try this way to operate the specified csv file
import pandas as pd
a = ['one','two','three']
b = [1,2,3]
english_column = pd.Series(a, name='english')
number_column = pd.Series(b, name='number')
predictions = pd.concat([english_column, number_column], axis=1)
save = pd.DataFrame({'english':a,'number':b})
save.to_csv('b.csv',index=False,sep=',')

Using Pandas read_table with list of files

I am pretty new to Python in general, but am trying to make a script that takes data from certain files in a folder and puts it into an Excel spreadsheet.
The code I have will find the file type that I want in my specified folder, and then make a list with the full file paths.
import os
file_paths = []
for folder, subs, files in os.walk('C://Users/Dir'):
for filename in files:
if filename.endswith(".log") or filename.endswith(".txt"):
file_paths.append(os.path.abspath(os.path.join(folder,filename)))
It will also take a specific file path, pull data from the correct column, and put it into excel in the correct cells.
import pandas as pd
import numpy
for i in range(len(file_paths)):
fields = ['RDCR']
data = pd.read_table(file_paths[i], sep= "\s+", names = fields, usecols=[3],
Where I am having trouble is making the read_table iterate through my list of files and put the data into an excel sheet where every time it reads a new file it moves over one column in the spreadsheet.
Ideally, the for loop would see how long the file_paths list is, and use that as the range. It would then use the file_paths[i] to input the file names into the read_table one by one.
What happens is that it finds the length of file_paths, and instead of iterating through the files in it one by one, it just inputs the data from the last file on the list.
Any help would be much appreciated! Thank you!
Try to concatenate all of them at once and write to excel one time.
from glob import glob
import pandas as pd
files = glob('C://Users/Dir/*.log') + glob('C://Users/Dir/*.txt')
def read_file(f):
fields = ['RDCR']
return pd.read_table(
f, sep="\s+",
names=fields, usecols=[3])
df = pd.concat([read_file(f) for f in files], axis=1).to_excel('out.xlsx')

Parsing CSV File into Python into a contigous block

I am trying to load in time series/Apple's stock price data (3000X5) into Python.
So date, open, high, low, close. I am running the following code in python spyder.
import matplotlib.pyplot as plt
import csv
datafile = open('C:\Users\Riemmman\Desktop\SAMPLE_AAPL_DATA_FOR_Python.csv')
datareader = csv.reader(datafile)
data = []
for row in datareader:
data.append(row)
But the 'data' file still remains as a list file. I want it separated into a continuous block with the headers on top and the data in it's respective column with date being at the utmost left-hand side. As one would see the data in R/Matlab. What am I missing? Thank you for your help.
You want to transpose the data; rows to columns. The zip() function, when applied to all rows, does this for you. Use *datareader to have Python pull all rows in and apply them as separate arguments to the zip() function:
filename = 'C:\Users\Riemmman\Desktop\SAMPLE_AAPL_DATA_FOR_Python.csv'
with open(filename, 'rb') as datafile:
datareader = csv.reader(datafile)
columns = zip(*datareader)
This also uses some more best practices:
Using the file as a context manager with the with statement ensures it is clsed automatically
Open the file in binary mode so the csv module can manage line endings correctly

Categories