Compare two XML files preferably using Python - python

I need to compare two XML files having same structure for any extra lines. Below example provides some details:
File 1:
<Log1> Some text here </Log1>
<Log2>
<Id1> Some text here </Id1>
<Id2> Some text here </Id2>
</Log2>
File2:
<Log1> Some text here </Log1>
<Log2>
<Id1> Some text here </Id1>
<Id2> Some text here </Id2>
</Log2>
<Log3> Some text here </Log3>
I need the diff to identify here that there is one extra tag in File2. Is there an efficient to do this in Python?

Related

Using a text file with strings and searching for the strings in files in folders

I'm working on searching for strings inside text files. What I have is a CSV file with multiple lines of a single word. Now I need to search files in multiple folders and subfolders for the words in this CSV file. In the end I would like to dump out the results into a text file. The results should have the original word and the result file name that the string was found in. How do you loop through a CSV file with strings while searching files for with these strings? I've only come across individual Python programs that will search for one string in a folder and then print out the results. I've modified one of these to print to a file but am having trouble looping through a CSV search string file.
I suggest the following approach: read the CSV file and create the list of search words. Then create a regular expression out of them, matching any of these words:
regexp = re.compile( '(' + '|'.join(words) + ')' )
Then go through the files using os.walk and apply the regexp to them using re.search.

Splitting PDF files into Paragraphs

I have a question regarding the splitting of pdf files. basically I have a collection of pdf files, which files I want to split in terms of paragraph. so to each paragraph of the pdf file to be a file on its own. I would appreciate if you can help me with this, preferably in Python, but if that is not possible any language will do.
You can use pdftotext for the above, wrap it in python subprocess. Alternatively you could use some other library which already do it implicitly like textract. Here is a quick example, Note: I have used 4 spaces as delimiter to convert the text to paragraph list, you might want to use different technique.
import re
import textract
#read the content of pdf as text
text = textract.process('file_name.pdf')
#use four space as paragraph delimiter to convert the text into list of paragraphs.
print re.split('\s{4,}',text)

How to extract text from several .txt files with Python?

I'm relatively new to programming and using Python, and I couldn't find anything on here that quite answered my question. Basically what I'm looking to do is extract a certain section of about 150 different .txt files and collect each of these pieces into a single .txt file.
Each of the .txt files contains DNA sequence alignment data, and each file basically reads out several dozen different possible sequences. I'm only interested in one of the sequences in each file, and I want to be able to use a script to excise that sequence from all of the files and combine them into a single file that I can then feed into a program that translates the sequences into protein code. Really what I'm trying to avoid is having to go one by one through each of the 150 files and copy/paste the desired sequence into the software.
Does anyone have any idea how I might do this? Thanks!
Edit: I tried to post an image of one of the text files, but apparently I don't have enough "reputation."
Edit2: Hi y'all, I'm sorry I didn't get back to this sooner. I've uploaded the image, here's a link to the upload: http://imgur.com/k3zBTu8
Im assuming you have 150 fasta files and in each fasta file you have sequence id that you want its sequence. you could use Biopython module to do this, put all your 150 files in a folder such as "C:\seq_folder"(folder should not contain any other file, and txt files should not be open)
import os
from Bio import SeqIO
from Bio.Seq import Seq
os.chdir('C:\\seq_folder') # changing working directory, to make it easy for python finding txt files
seq_id=x # the sequence id you want the sequence
txt_list=os.listdir('C:\\seq_folder')
result=open('result.fa','w')
for item in txt_list:
with open (item,'rU') as file:
for records in SeqIO.parse(file,'fasta'):
if records.id == seq_id:
txt.write('>'+records.id+'\n')
txt.write(str(records.seq)+'\n')
else:
continue
result.close()
this code will produce a fasta file including the sequence from your desired id from all the files and put them in 'result.fa'. you can also translate them into protein using Biopythn module.

Importing Text Files To Python And Spintax Then Outputting Text

I'm stuck on how to accomplish this. What I have is a folder of say 10 text files and in each text file is a article done in spintax, example: {This is|Here is} {My|Your|Their} {Articles|Post}
Each text file contains a full article with paragraphs in spintax. What I'm wanting to do is randomly grab from that folder one article, spin it off the spintax and then output/save it as a new text file or append to file.
I've tried to find some examples of how to do this but have had no success.

Is it possible to compare the values of a csv and text file in python?

i have a csv file and a text file. is it possible to compare the values in both files? or should i have the values of both in a csv file to make it easier?
is it possible to compare the values
in both files?
Yes. You can open them both in binary mode an compare the bytes, or in text mode and compare the characters. Neither will be particularly useful, though.
or should i have the values of both in
a csv file to make it easier?
Convert them both to list-of-lists format. For the CSV file, use a csv.reader. For the text file, use [line.split('\t') for line in open('filename.txt')] or whatever the equivalent is for your file format.
Yes, you can compare values from any N sources. You have to extract the values from each and then compare them. If you make your question more specific (the format of the text file for instance), we might be able to help you more.
csv itself is of course text as well. And that's basically the problem when "comparing", there's no "text file standard". Even csv isn't that strictly defined, and there's no normal form. For exmaple, should a header be included? Is column ordering relevant?
How are fields separated in the textfile? Fixed width records? Newlines? Special markers (like csv)?
,
If you know the format of the textfile, you can read/parse it and compare the result with the csv file (which you will also need to read/parse of course), or generate csv from the textfile and compare that using diff.

Categories