want to merge specific code lines from 2 text files - python

for starters, i am actually a medical student, so i wouldn't know first thing about programming, but i found myself in desperate need for this, so pardon me for my complete ignorance about the subject.
i have 2 XML files containing text, each one contains nearly 2 million lines, the first one looks like this:
<TEXT>
<Unknown1>-65535</Unknown1>
<autoId>1</autoId>
<autoId2>0</autoId2>
<alias>Name2.Boast_Duel_Season01_sudden_death_1vs1</alias>
<original>Уникальная массовая дуэль: Битва один на один до полного уничтожения в один раунд</original>
</TEXT>
<TEXT>
<Unknown1>-65535</Unknown1>
<autoId>2</autoId>
<autoId2>0</autoId2>
<alias>Name2.Boast_Duel_Season01_sudden_death_3vs3</alias>
<original>Уникальная массовая дуэль: Битва трое на трое до полного уничтожения в один раунд</original>
and the second one looks like this:
<TEXT>
    <Unknown1>-65535</Unknown1>
    <autoId>1</autoId>
    <autoId2>0</autoId2>
    <alias>Name2.Boast_Duel_Season01_sudden_death_1vs1</alias>
<replacement>Unique mass duel one on one battle to the complete destruction of one round</replacement>
  </TEXT>
  <TEXT>
    <Unknown1>-65535</Unknown1>
    <autoId>2</autoId>
    <autoId2>0</autoId2>
    <alias>Name2.Boast_Duel_Season01_sudden_death_3vs3</alias>
    <replacement>Unique mass duel Battle three against three to the complete destruction of one round</replacement>
  </TEXT>
and those blocks of code are repeated along the files for like half a million time, netting me the 2 million liner i told you about..
now what i need to do is merge both files to make the final product look like this:
<TEXT>
    <Unknown1>-65535</Unknown1>
    <autoId>1</autoId>
    <autoId2>0</autoId2>
    <alias>Name2.Boast_Duel_Season01_sudden_death_1vs1</alias>
<original>Уникальная массовая дуэль: Битва один на один до полного уничтожения в один раунд</original>
    <replacement>Unique mass duel one on one battle to the complete destruction of one round</replacement>
  </TEXT>
  <TEXT>
    <Unknown1>-65535</Unknown1>
    <autoId>2</autoId>
    <autoId2>0</autoId2>
    <alias>Name2.Boast_Duel_Season01_sudden_death_3vs3</alias>
<original>Уникальная массовая дуэль: Битва трое на трое до полного уничтожения в один раунд</original>
    <replacement>Unique mass duel Battle three against three to the complete destruction of one round</replacement>
  </TEXT>
so, Basically i want to add the "Replacement" line under each respective "original" line while the rest of the file is kept intact (it's the same in both) , doing this manually would take me like 2 weeks..and i only have 1 day to do so!
any help is appreciated, and again..sorry if i sound like a total idiot at this, cause i kind of am!
P.S: i can't even chose a proper tag! i will totally understand if i just get lashed in the answers now..this job is way to big for me!

The truth about "where to start" is to learn basic python string manipulation. I was feeling nice and I like these sorts of problem, however, so here's a (quick and dirty) solution. The only things you'll need to change are the "original.xml" and "replacement.xml" file names. You'll also need a working python version, of course. That's up to you to figure out.
A couple caveats about my code:
Parsing XML is a solved problem. Using regular expressions to do it is frowned upon, but it works, and when you're doing something as simple and fixed as this, it really doesn't matter.
I made a few assumptions when building the outputted XML file (for example an indentation style of 4 spaces), but it outputs valid XML. The application you're using should play nice with it.
-
import re
def loadfile(filename):
'''
Returns a string containing all data from file
'''
infile = open(filename, 'r')
infile_string = infile.read()
infile.close()
return infile_string
def main():
#load the files into strings
original = loadfile("original.xml")
replacement = loadfile("replacement.xml")
#grab all of the "replacement" lines from the replacement file
replacement_regex = re.compile("(<replacement>.*?</replacement>)")
replacement_list = replacement_regex.findall(replacement)
#grab all of the "TEXT" blocks from the original file
original_regex = re.compile("(<TEXT>.*?</TEXT>)", re.DOTALL)
original_list = original_regex.findall(original)
#a string to write out to the new file
outfile_string = ""
to_find = "</original>" #this is the point where the replacement text is going to be appended after
additional_len = len(to_find)
for i in range(len(original_list)): #loop through all of the original text blocks
#build a new string with the replacement text after the original
build_string = ""
build_string += original_list[i][:original_list[i].find(to_find)+additional_len]
build_string += "\n" + " "*4
build_string += replacement_list[i]
build_string += "\n</TEXT>\n"
outfile_string+=build_string
#write the outfile string out to a file
outfile = open("outfile.txt", 'w')
outfile.write(outfile_string)
outfile.close()
if __name__ == "__main__":
main()
Edit (reply to comment): The IndexError, list out of range error means that the regex isn't properly working (it's not finding the exactly correct amount of replacement text and grabbing each item to put it into a list). I tested what I wrote on the blurbs you provided, so there's a discrepancy between the blurbs you provided and the full-blown XML files. If there aren't the same amount of original/replacement tags or anything like that, that will break the code. Impossible for me to figure out without access to the files themselves.

here I present a straightforward way to do that (without xml parseing).
def parse_org(file_handle):
for line in file_handle:
if "<TEXT>" in line:
record = line## start a new record if find tag <TEXT>
elif "</TEXT>" in line:
yield record## end a record if find tag <\TEXT>
record = None
elif record is not None:
record +=line
def parse_rep(file_handle):
for line in file_handle:
if "<TEXT>" in line:
record = None
elif "</TEXT>" in line:
yield record
record = None
elif "<replacement>" in line:
record = line
if __name__ == "__main__":
orginal_file = open("filepath/yourfile.xml")
replacement_file = ("filepath/yourfile.xml")
a_new_file = open("result_file","w")
END = "NOT"
while END =="NOT":
try:
org = parse_org(orginal_file).next()
rep = parse_rep(replacement_file).next()
new_record = org+rep+"</TEXT>\n"
a_new_file.write(new_record)
except StopIteration:
END = "YES"
a_new_file.close()
orginal_file.close()
replacement_file.close()
the code is written using python, and it uses keyword yield, use http://www.codecademy.com/ if you want to learn python, google yield python to learn how to use yield in python. if you would like to process such txt file in future, you should learn a script language, python may be the easiest one. If you encounter any questions you could post them on this website, but dont do nothing and just ask like "write this program for me".

Related

Removing '\n' from a string without using .translate, .replace or strip()

I'm making a simple text-based game as a learning project. I'm trying to add a feature where the user can input 'save' and their stats will be written onto a txt file named 'save.txt' so that after the program has been stopped, the player can then upload their previous stats and play from where they left off.
Here is the code for the saving:
user inputs 'save' and class attributes are saved onto the text file as text, one line at a time
elif first_step == 'save':
f = open("save.txt", "w")
f.write(f'''{player1.name}
{player1.char_type} #value is 'Wizard'
{player1.life}
{player1.energy}
{player1.strength}
{player1.money}
{player1.weapon_lvl}
{player1.wakefulness}
{player1.days_left}
{player1.battle_count}''')
f.close()
But, I also need the user to be able to load their saved stats next time they run the game. So they would enter 'load' and their stats will be updated.
I'm trying to read the text file one line at a time and then the value of that line would become the value of the relevant class attribute in order, one at a time. If I do this without converting it first to a string I get issues, such as some lines being skipped as python is reading 2 lines as one and putting them altogether as a list.
So, I tried the following:
In the below example, I'm only showing the data from the class attributes 'player1.name' and 'player1.char_type' as seen above as to not make this question as short as possible.
elif first_step == 'load':
f = open("save.txt", 'r')
player1.name_saved = f.readline() #reads the first line of the text file and assigns it's value to player1.name_saved
player1.name_saved2 = str(player1.name_saved) # converts the value of player1.name_saved to a string and saves that string in player1.name_saved2
player1.name = player1.name_saved2 #assigns the value of player1.name_saved to the class attribute player1.name
player1.char_type_saved = f.readlines(1) #reads the second line of the txt file and saves it in player1.char_type_saved
player1.char_type_saved2 = str(player1.char_type_saved) #converts the value of player1.char_type_saved into a string and assigns that value to player1.char_type_saved2
At this point, I would assign the value of player1.char_type_saved2 to the class attribute player1.char_type so that the value of player1.char_type enables the player to load the previous character type from the last time they played the game. This should make the value of player1.char_type = 'Wizard' but I'm getting '['Wizard\n']'
I tried the following to remove the brackets and \n:
final_player1.char_type = player1.char_type_saved2.translate({ord(c): None for c in "[']\n" }) #this is intended to remove everything from the string except for Wizard
For some reason, the above only removes the square brackets and punctuation marks but not \n from the end.
I then tried the following to remove \n:
final_player1.char_type = final_player1.char_type.replace("\n", "")
final_player1.char_type is still 'Wizard\n'
I've also tried using strip() but I've been unsuccessful.
If anyone could help me with this I would greatly appreciate it. Sorry if I have overcomplicated this question but it's hard to articulate it without lots of info. Let me know if this is too much or if more info is needed to answer.
If '\n' is always at the end it may be best to use:
s = 'wizard\n'
s = s[:-1]
print(s, s)
Output:
wizard wizard
But I still think strip() is best:
s = 'wizard\n'
s = s.strip()
print(s, s)
Output:
wizard wizard
Normaly it should work with just
char_type = "Wizard\n"
char_type.replace("\n", "")
print(char_type)
The output will be "Wizard"

Python Readline Loop and Subloop

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python

What is the most effective way to compare strings using python in two very large files?

I have two large text files with about 10k lines in each. Each line has a unique string in the same position that needs to be compared with all the other strings in the other file to see if it matches and if not, print it out. I'm not sure how to do this in a way that makes sense time wise since the files are so large. Heres an example of the files.
File 1:
https://www.exploit-db.com/exploits/10185/
https://www.exploit-db.com/exploits/10189/
https://www.exploit-db.com/exploits/10220/
https://www.exploit-db.com/exploits/10217/
https://www.exploit-db.com/exploits/10218/
https://www.exploit-db.com/exploits/10219/
https://www.exploit-db.com/exploits/10216/
file 2:
EXPLOIT:10201 CVE-2009-4781
EXPLOIT:10216 CVE-2009-4223
EXPLOIT:10217 CVE-2009-4779
EXPLOIT:10218 CVE-2009-4082
EXPLOIT:10220 CVE-2009-4220
EXPLOIT:10226 CVE-2009-4097
I want to check if the numbers at the end of the first file match any of the numbers after EXPLOIT:
as others have said, 10k lines aren't a problem for computers that have gigabytes of memory. the important steps are:
figure out how to get the identifier out of lines in the first file
and again, but for the second file
put them together to loop over lines in each file and produce your output
regular expressions are for working with text like this, I get regexes that look like /([0-9]+)/$ and :([0-9]+) for the two files (services like https://regex101.com/ are great for playing)
you can put these together in Python by doing:
from sys import stderr
import re
# collect all exploits for easy matching
exploits = {}
for line in open('file_2'):
m = re.search(r':([0-9]+) ', line)
if not m:
print("couldn't find an id in:", repr(line), file=stderr)
continue
[id] = m.groups()
exploits[id] = line
# match them up
for line in open('file_1'):
m = re.search(r'/([0-9]+)/$', line)
if not m:
print("couldn't find an id in:", repr(line), file=stderr)
continue
[id] = m.groups()
if id in exploits:
pass # print(line, 'matched with', exploits[id])
else:
print(line)

extract a certain quote after a keyword has been detected in Python 3

I'm trying to make a multi-term definer to quicken the process of searching for the definitions individually.
After python loads a webpage, it saves the page as a temporary text file.
Sample of saved page: ..."A","Answer":"","Abstract":"Harriet Tubman was an American abolitionist.","ImageIs...
In this sample, I'm after the string that contains the definition, in this case Harriet Tubman. The string "Abstract": is the portion always before the definition of the term.
What I need is a way to scan the text file for "Abstract":. Once that has been detected, look for an opening ". Then, copy and save all text to another text file until reaching the end ".
If you just wanted to find the string following "Abstract:" you could take a substring.
page = '..."A","Answer":"","Abstract":"Harriet Tubman was an American abolitionist.","ImageIs...'
i = page.index("Abstract") + 11
defn = page[i: page.index("\"", i)]
If you wanted to extract multiple parts of the page you should try the following.
dict_str = '"Answer":"","Abstract":"Harriet Tubman was an American abolitionist."'
definitions = {}
for kv in dict_str.split(","):
parts = kv.replace("\"", "").split(":")
if len(parts) != 2:
continue
definitions[parts[0]] = parts[1]
definitions['Abstract'] # 'Harriet Tubman was an American abolitionist.'
definitions["Answer"] # ''

Split txt file into multiple new files with regex

I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.
I have a txt file of Letters to the Editor that I need to split into their own individual files.
The files are all formatted in relatively the same way with:
For once, before offering such generous but the unasked for advice, put yourselves in...
Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...
Why is it that The Times does not urge totalitarian Arab slates and terrorist...
PAUL STONEHILL Los Angeles
There you go again. Your editorial again makes groundless criticisms of the Israeli...
On Dec. 7 you called proportional representation “bizarre," despite its use in the...
Proportional representation distorts Israeli politics? Huh? If Israel changes the...
MATTHEW SHUGART Laguna Beach
Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...
Although the mayor did not support Proposition U (the slow-growth initiative) his...
If West Los Angeles is any indication of the no-growth policy, where do we go from here?
MARJORIE L. SCHWARTZ Los Angeles
I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.
I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.
The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.
import re
thefile = raw_input('Filename to split: ')
name_occur = []
full_file = []
pattern = re.compile("^[A-Z]{4,}")
with open (thefile, 'rt') as in_file:
for line in in_file:
full_file.append(line)
if pattern.search(line):
name_occur.append(line)
totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)
while letters <= totalFiles:
f1 = open(thefile + '-' + str(letters) + ".txt", "a")
doIHaveToCopyTheLine = False
ignoreLines = False
for line in full_file:
if not ignoreLines:
f1.write(line)
full_file.remove(line)
if pattern.search(line):
doIHaveToCopyTheLine = True
ignoreLines = True
letters += 1
f1.close()
I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.
I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.
import string
def split_letters(fullpath):
current_letter = []
letter_index = 1
fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)
with open(fullpath, 'r') as letters_file:
letters = letters_file.readlines()
for line in letters:
words = line.split()
upper_words = []
for word in words:
upper_word = ''.join(
c for c in word if c in string.ascii_uppercase)
upper_words.append(upper_word)
len_upper_words = len(upper_words)
first_word_upper = len_upper_words and len(upper_words[0]) > 1
second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
if first_word_upper and (second_word_upper or third_word_upper):
current_letter.append(line)
new_filename = '{0}{1}.{2}'.format(
fullpath_base, letter_index, fullpath_ext)
with open(new_filename, 'w') as new_letter:
new_letter.writelines(current_letter)
current_letter = []
letter_index += 1
else:
current_letter.append(line)
I tested it on your sample input and it worked fine.
While the other answer is suitable, you may still be curious about using a regex to split up a file.
smallfile = None
buf = ""
with open ('input_file.txt', 'rt') as f:
for line in f:
buf += str(line)
if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
if smallfile:
smallfile.close()
match = re.findall(r'^([A-Z\s\.]+\b)' , line)
smallfile_name = '{}.txt'.format(match[0])
smallfile = open(smallfile_name, 'w')
smallfile.write(buf)
buf = ""
if smallfile:
smallfile.close()
If you run on Linux, use csplit.
Otherwise, check out these two threads:
How can I split a text file into multiple text files using python?
How to match "anything up until this sequence of characters" in a regular expression?

Categories