how to convert from csv to xml with python

how to convert from csv to xml with python - python

How to convert this csv format['x','y','width','height','tag']
to this XML format using python script?
<annotation>
<object>
<tag>figure</tag>
<bndbox>
<x>1.0</x>
<y>100.0</y>
<width>303.0</width>
<height>619.0</height>
</bndbox>
</object>
<object>
<tag>text</tag>
<bndbox>
<x>338.0</x>
<y>162.0</y>
<width>143.0</width>
<height>423.0</height>
</bndbox>
</object>
<object>
<tag>text</tag>
<bndbox>
<x>85.0</x>
<y>768.0</y>
<width>554.0</width>
<height>39.0</height>
</bndbox>
</object>
</annotation>
Note this is for the first row of the csv file i want to convert for all row

Good morning.
I was interested in reading / writting the csv files with Python.
This is, because I wasn't able to read a file on my own computer.
I'm not talking more about other files, because I have no idea.
Personally I had tried the IDLE Pyhon editor.
This can also run and debug Python applications.
If you know another Python editor, it can be also another choice.
It's not necessary to use a different application for a certain task.
The python program can read your csv file and writes another xml file. I didn't thought what happens, if the csv file has a title row.
I had created a Py_2_xml.py file.
This can be placed in the same folder as your movies2.csv
First time you can create the file as a text file wih Notepad.
Rename the extension and then edit with the installed IDLE editor.
Here is my code:
import csv
csvfile=open('./movies2.csv')
reader=csv.reader(csvfile)
data="<annotation>\n"
for line in reader:
#print(line)
data = data + ' <object>\n'
data = data + ' <tag>' + line[4] + '</tag>\n'
data = data + ' <bndbox>\n'
i = 0
for item in line:
if i == 0:
data = data + ' <x>' + item + '</x>\n'
elif i == 1:
data = data + ' <y>' + item + '</y>\n'
elif i == 2:
data = data + ' <width>' + item + '</width>\n'
elif i == 3:
data = data + ' <height>' + item + '</height>\n'
i = i + 1
#print(item)
data = data + ' </bndbox>\n'
data = data + ' </object>\n'
csvfile.close()
data = data + '</annotation>\n'
xmlfile=open('outputf.xml','w')
xmlfile.write(data)
xmlfile.close()
print(data)
data = ''
#import subprocess
#subprocess.Popen(["notepad","outputf.xml"])
I'm sorry, first time I missed to write the object tag.
Then the last column was not properly read.
The items in the line object being read by csv reader, can be referenced also as items of an array.
(So that the items in the columns could be read in any order).
This can simplify more the code.
Instead of parsing all items in the line array, do like this:
import csv
csvfile=open('./movies2.csv')
reader=csv.reader(csvfile)
data="<annotation>\n"
for line in reader:
#print(line)
data = data + ' <object>\n'
data = data + ' <tag>' + line[4] + '</tag>\n'
data = data + ' <bndbox>\n'
data = data + ' <x>' + line[0] + '</x>\n'
data = data + ' <y>' + line[1] + '</y>\n'
data = data + ' <width>' + line[2] + '</width>\n'
data = data + ' <height>' + line[3] + '</height>\n'
data = data + ' </bndbox>\n'
data = data + ' </object>\n'
csvfile.close()
data = data + '</annotation>\n'
xmlfile=open('outputf.xml','w')
xmlfile.write(data)
xmlfile.close()
print(data)
data = ''
Preferably the file won't have to be opened in an application.
Use import os module, to copy the file into another location.
But first, it should be obtained the destination path.
Some people use the Tkinkter library to show a SaveDialog.
Be careful at each line indentation in the Python code.
Because when a tab is wrong, it will result an error.
This is really working for me.
Best regards, Adrian Brinas.

You can use python Panda for this
https://pypi.org/project/pandas/
imports pandas as pd
df= pd.read_csv('movies2.csv')
with open('outputf.xml', 'w') as myfile:
myfile.write(df.to_xml())

make sure you have titles in your csv .
import pandas as pd
def convert_row(row):
return """<annotation>
<object>
<tag>%s</tag>
<bndbox>
<x>%s</x>
<y>%s</y>
<width>%s</width>
<height>%s</height>
</bndbox>
</object>""" % (
row.tag, row.x, row.y, row.width, row.hight)
df = pd.read_csv('untitled.txt', sep=',')
print '\n'.join(df.apply(convert_row, axis=1))

Related

Find length of a contig in one fasta, using the header of another fasta as query in python

I'm trying to find a python solution to extract the length of a specific sequence within a fasta file using the full header of the sequence as the query. The full header is stored as a variable earlier in the pipeline (i.e. "CONTIG"). I would like to save the output of this script as a variable to then use later on in the same pipeline.
Below is an updated version of the script using code provided by Lucía Balestrazzi.
Additional information: The following with-statement is nested inside a larger for-loop that cycles through subsamples of an original genome. The first subsample fasta in my directory has a single sequence ">chr1:0-40129801" with a length of 40129801. I'm trying to write out a text file "OUTPUT" that has some basic information about each subsample fasta. This text file will be used as an input for another program downstream.
Header names in the original fasta file are chr1, chr2, etc... while the header names in the subsample fastas are something along the lines of:
batch1.fa >chr1:0-40k
batch2.fa >chr1:40k-80k
...etc...
import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse(ORIGINAL_GENOME, "fasta")) #not the subsample
with open(GENOME_SUBSAMPLE, 'r') as FIN:
for LINE in FIN:
if LINE.startswith('>'):
#Example of "LINE"... >chr1:0-40129801
HEADER = re.sub('>','',LINE)
#HEADER = chr1:0-40129801
HEADER2 = re.sub('\n','',HEADER)
#HEADER2 = chr1:0-40129801 (no return character on the end)
CONTIG = HEADER2.split(":")[0]
#CONTIG = chr1
PART2_HEADER = HEADER2.split(":")[1]
#PART2_HEADER = 0-40129801
START = int(PART2_HEADER.split("-")[0])
#START = 0
END = int(PART2_HEADER.split("-")[1])
#END = 40129801
LENGTH = END-START
#LENGTH = 40129801 minus 0 = 40129801
#This is where I'm stuck...
ORIGINAL_CONTIG_LENGTH = len(record_dict[CONTIG]) #This returns "KeyError: 1"
#ORIGINAL_CONTIG_LENGTH = 223705999 (this is from the full genome, not the subsample).
OUTPUT.write(str(START) + '\t' + str(HEADER2) + '\t' + str(LENGTH) + '\t' + str(CONTIG) + '\t' + str(ORIGINAL_CONTIG_LENGTH) + '\n')
#OUTPUT = 0 chr1:0-40129801 40129801 chr1 223705999
OUTPUT.close()
I'm relatively new to bioinformatics. I know I'm messing up on how I'm using the dictionary, but I'm not quite sure how to fix it.
Any advice would be greatly appreciated. Thanks!

You can do it this way:
import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse("genome.fa", "fasta"))
print(len(record_dict["chr1"]))
or
import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse("genome.fa", "fasta"))
seq = record_dict["chr1"]
print(len(seq))
EDIT: Alternative code
import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse("genome.fa", "fasta")
names = record_dict.keys()
for HEADER in names:
#HEADER = chr1:0-40129801
ORIGINAL_CONTIG_LENGTH = len(record_dict[HEADER])
CONTIG = HEADER.split(":")[0]
#CONTIG = chr1
PART2_HEADER = HEADER.split(":")[1]
#PART2_HEADER = 0-40129801
START = int(PART2_HEADER.split("-")[0])
END = int(PART2_HEADER.split("-")[1])
LENGTH = END-START
The idea is that you define the dict once, get the value of its keys (all the contigs headers) and store them as a variable, and then loop through the headers extracting the info you need. No need to loop through the file.
Cheers

This works, just changed the "CONTIG" variable to a string. Thanks Lucía for all your help the last couple of days!
import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse(ORIGINAL_GENOME, "fasta")) #not the subsample
with open(GENOME_SUBSAMPLE, 'r') as FIN:
for LINE in FIN:
if LINE.startswith('>'):
#Example of "LINE"... >chr1:0-40129801
HEADER = re.sub('>','',LINE)
#HEADER = chr1:0-40129801
HEADER2 = re.sub('\n','',HEADER)
#HEADER2 = chr1:0-40129801 (no return character on the end)
CONTIG = HEADER2.split(":")[0]
#CONTIG = chr1
PART2_HEADER = HEADER2.split(":")[1]
#PART2_HEADER = 0-40129801
START = int(PART2_HEADER.split("-")[0])
#START = 0
END = int(PART2_HEADER.split("-")[1])
#END = 40129801
LENGTH = END-START
#LENGTH = 40129801 minus 0 = 40129801
#This is where I'm stuck...
ORIGINAL_CONTIG_LENGTH = len(record_dict[str(CONTIG)])
#ORIGINAL_CONTIG_LENGTH = 223705999 (this is from the full genome, not the subsample).
OUTPUT.write(str(START) + '\t' + str(HEADER2) + '\t' + str(LENGTH) + '\t' + str(CONTIG) + '\t' + str(ORIGINAL_CONTIG_LENGTH) + '\n')
#OUTPUT = 0 chr1:0-40129801 40129801 chr1 223705999
OUTPUT.close()

web scraping and transferring data into excel using python

im able to fully scrap the material i needed the problem is i cant get the data into excel.
from lxml import html
import requests
import xlsxwriter
page = requests.get('website that gets mined')
tree = html.fromstring(page.content)
items = tree.xpath('//h4[#class="item-title"]/text()')
prices = tree.xpath('//span[#class="price"]/text()')
description = tree.xpath('//div[#class="description text"]/text()')
print 'items: ', items
print 'Prices: ', prices
print 'description', description
everything works fine until this section where i try to get the data into excel
this is the error message:
for items,prices,description in (array):
ValueError: too many values to unpack
Exception Exception: Exception('Exception caught in workbook destructor. Explicit close() may be required for workbook.',) in <bound method Workbook.__del__ of <xlsxwriter.workbook.Workbook object at 0x104735e10>> ignored
this is what it was trying to do
array = [items,prices,description]
workbook = xlsxwriter.Workbook('test1.xlsx')
worksheet = workbook.add_worksheet()
row = 0
col = 0
for items,prices,description in (array):
worksheet.write(row, col, items)
worksheet.write(row, col + 1, prices)
worksheet.write(row, col + 2, description)
row += 1
workbook.close()

Assuming that "items,prices,description" all have the same length, you could rewrite the final part of the code in :
for item,price,desc in zip(items,prices,description)
worksheet.write(row, col, item)
worksheet.write(row, col + 1, price)
worksheet.write(row, col + 2, desc)
row += 1
If the lists can have unequal lengths you should check this for alternatives for the zip method, but I would be worried for the data consistency.

Inevitably, it will be easier to write to a CSV file, or a Text file, rather than an Excel file.
import urllib2
listOfStocks = ["AAPL", "MSFT", "GOOG", "FB", "AMZN"]
urls = []
for company in listOfStocks:
urls.append('http://real-chart.finance.yahoo.com/table.csv?s=' + company + '&d=6&e=28&f=2015&g=m&a=11&b=12&c=1980&ignore=.csv')
Output_File = open('C:/your_path_here/Data.csv','w')
New_Format_Data = ''
for counter in range(0, len(urls)):
Original_Data = urllib2.urlopen(urls[counter]).read()
if counter == 0:
New_Format_Data = "Company," + urllib2.urlopen(urls[counter]).readline()
rows = Original_Data.splitlines(1)
for row in range(1, len(rows)):
New_Format_Data = New_Format_Data + listOfStocks[counter] + ',' + rows[row]
Output_File.write(New_Format_Data)
Output_File.close()
OR
from bs4 import BeautifulSoup
import urllib2
var_file = urllib2.urlopen("http://www.imdb.com/chart/top")
var_html = var_file.read()
text_file = open("C:/your_path_here/Text1.txt", "wb")
var_file.close()
soup = BeautifulSoup(var_html)
for item in soup.find_all(class_='lister-list'):
for link in item.find_all('a'):
#print(link)
z = str(link)
text_file.write(z + "\r\n")
text_file.close()
As a developer, it's difficult to programmatically manipulate Excel files since the Excel is proprietary. This is especially true for languages other than .NET. On the other hand, for a developer it's easy to programmatically manipulate CSV since, after all, they are simple text files.

How to generate sequential numbers in xml? (Python 2.7)

I have a Python script (2.7) that is generating xml data. However, for each object in my array I need it to generate a new sequential number beginning with 1.
Example below (see sequenceOrder tag):
<objectArray>
<object>
<title>Title</title>
<sequenceOrder>1</sequenceOrder>
</object>
<object>
<title>Title</title>
<sequenceOrder>2</sequenceOrder>
</object>
<object>
<title>Title</title>
<sequenceOrder>3</sequenceOrder>
</object>
</objectArray>
With Python 2.7, how can I have my script generate a new number (+1 of the number preceding it) for the sequenceOrder part of each object in my xml array?
Please note that there will be hundreds of thousands of objects in my array, so something to consider.
I'm totally new to Python/coding in general, so any help is appreciated! Glad to provide additional information as necessary.

If you are creating objects to be serialized yourself, you may use itertools.count to get consecutive unique integers.
In a very abstract way it'll look like:
import itertools
counter = itertools.count()
o = create_object()
o.sequentialNumber = next(counter)
o2 = create_another_object()
o.sequentialNumber = next(counter)
create_xml_doc(my_objects)

Yes, you can generate a new sequence number for each XML element. Here is how to produce your sample output using lxml:
import lxml.etree as et
root = et.Element('objectArray')
for i in range(1, 4):
obj = et.Element('object')
title = et.Element('title')
title.text = 'Title'
obj.append(title)
sequenceOrder = et.Element('sequenceOrder')
sequenceOrder.text = str(i)
obj.append(sequenceOrder)
root.append(obj)
print et.tostring(root, pretty_print=True)

A colleague provided a solution:
set
sequence_order = 1
then in xml
<sequenceOrder>""" + str(sequence_order) + """</sequenceOrder>
then later
if test is False:
with open('uniquetest.csv', 'a') as write:
writelog = csv.writer(write, delimiter= '\t', quoting=csv.QUOTE_ALL)
writelog.writerow( (title, ) )
try:
f = open(file + '_output.xml', 'r')
f = open(file + '_output.xml', 'a')
f.write(DASxml_bottom + DASxml_top + digital_objects)
f.close()
sequence_order = sequence_order + 1
# f = open('log.txt', 'a')
# f.write(title + str(roll) + label_flag + str(id) + str(file_size) + file_path + """
# """)
# f.close()
That might make no sense to some since I didn't provide the whole script, but it worked for my purposes! Thanks all for the suggestions

read and write only a line not the full file

i have the following problem in the code below.
I open a file and load it into "csproperties" (Comment #open path). In every open file i want to make three changes (Comment #change parameters). Then i want to write the three changes to the file and close it. I want to do this file per file.
When i now open the changed file, the file has three times the same content. In content one i can see my first change, in content two the second and so on.
I do not understand why my tool writes the full file content 3 times in an changed file.
i think it hat somethink to do with the #write file Block... i tried serverell things, but nothing worked the right way.
Any suggestions?
Kind regards
for instance in cs_id:
cspath.append(cs_id[n] + '/mypath/conf/myfile.txt')
# open path
f = open(cspath[n], "r")
csproperties = f.read()
f.close()
#change parameters
CS_License_Key_New = csproperties.replace(oms + "CSLicenseKey=", oms + "CSLicenseKey="+ keystore[n])
Logfile_New = csproperties.replace(oms + "LogFile=", oms + "LogFile=" + logs + 'ContentServer_' + cs_id[n] +'.log')
Pse_New = csproperties.replace(oms + "PABName=", oms + "PABName=" + pse + 'ContentServer_' + cs_id[n] + '.PSE')
#write File
f = open(cspath[n],'w')
f.write(CS_License_Key_New)
f.write(Logfile_New)
f.write(Pse_New)
f.close()
n += 1

You're doing 3 different replaces on the same content. You should chain the replaces instead:
result = (csproperties
.replace(oms + "CSLicenseKey=", oms + "CSLicenseKey="+ keystore[n])
.replace(oms + "LogFile=",
oms + "LogFile=" + logs + 'ContentServer_' + cs_id[n] +'.log')
.replace(oms + "PABName=",
oms + "PABName=" + pse + 'ContentServer_' + cs_id[n] + '.PSE'))
...
f.write(result)

CS_License_Key_New = csproperties.replace(...)
Logfile_New = csproperties.replace(...)
Pse_New = csproperties.replace(...)
There are three different copies of content.
You are trying to replace content and save it to three different variables.
You should do it in once time.

Python: File Writing Adding Unintentional Newlines on Linux Only

I am using Python 2.7.9. I'm working on a program that is supposed to produce the following output in a .csv file per loop:
URL,number
Here's the main loop of the code I'm using:
csvlist = open(listfile,'w')
f = open(list, "r")
def hasQuality(item):
for quality in qualities:
if quality in item:
return True
return False
for line in f:
line = line.split('\n')
line = line[0]
# print line
itemname = urllib.unquote(line).decode('utf8')
# print itemhash
if hasQuality(itemname):
try:
looptime = time.time()
url = baseUrl + line
results = json.loads(urlopen(url).read())
# status = results.status_code
content = results
if 'median_price' in content:
medianstr = str(content['median_price']).replace('$','')
medianstr = medianstr.replace('.','')
median = float(medianstr)
volume = content['volume']
print url+'\n'+itemname
print 'Median: $'+medianstr
print 'Volume: '+str(volume)
if (median > minprice) and (volume > minvol):
csvlist.write(line + ',' + medianstr + '\n')
print '+ADDED TO LIST'
else:
print 'No median price given for '+itemname+'.\nGiving up on item.'
print "Finished loop in " + str(round(time.time() - looptime,3)) + " seconds."
except ValueError:
print "we blacklisted fool?? cause we skippin beats"
else:
print itemname+'is a commodity.\nGiving up on item.'
csvlist.close()
f.close()
print "Finished script in " + str(round(time.time() - runtime, 3)) + " seconds."
It should be generating a list that looks like this:
AWP%20%7C%20Asiimov%20%28Field-Tested%29,3911
M4A1-S%20%7C%20Hyper%20Beast%20%28Field-Tested%29,4202
But it's actually generating a list that looks like this:
AWP%20%7C%20Asiimov%20%28Field-Tested%29
,3911
M4A1-S%20%7C%20Hyper%20Beast%20%28Field-Tested%29
,4202
Whenever it is ran on a Windows machine, I have no issue. Whenever I run it on my EC2 instance, however, it adds that extra newline. Any ideas why? Running commands on the file like
awk 'NR%2{printf $0" ";next;}1' output.csv
do not do anything. I have transferred it to my Windows machine and it still reads the same. However, when I paste the output into Steam's chat client it concatenates it in the way that I want.
Thanks in advance!

This is where the problem occurs
code:
csvlist.write(line + ',' + medianstr + '\n')
This can be cleared is you strip the space
modified code:
csvlist.write(line.strip() + ',' + medianstr + '\n')
Problem:
The problem is due to the fact you are reading raw lines from the input file
Raw_lines contain \n to indicate there is a new line for every line which is not the last and for the last line it just ends with the given character .
for more details:
Just type print(repr(line)) before writing and see the output

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to convert from csv to xml with python - python

You can use python Panda for this https://pypi.org/project/pandas/ imports pandas as pd df= pd.read_csv('movies2.csv') with open('outputf.xml', 'w') as myfile: myfile.write(df.to_xml())

Related

Find length of a contig in one fasta, using the header of another fasta as query in python

web scraping and transferring data into excel using python

How to generate sequential numbers in xml? (Python 2.7)

read and write only a line not the full file

Python: File Writing Adding Unintentional Newlines on Linux Only

Categories

Resources