Removing duplicate lines and split parallel corpus

Removing duplicate lines and split parallel corpus - python

I have parallel translated corpus in English-French (text.en,text.fr),
each text includes around 500K of lines (sentences in source and target languge). what I want is to:
1- Remove the duplicated lines in both texts using python command; and avoid any alignment problem in both files. e.g: command deleted line 32 in text.en, then of course delete it in text.fr.
2- Then Split both files into Train/Dev/Test data, only 1K for dev, and 1K for test, and the rest for train.
I need to split text.en and text.fr using the same command, so I could keep the alignment and corresponding sentences in both files.
It would be better if I could extract test and dev data randomly, that will help getting better results.
How can I do that? please write the commands.
I appreciate any help, Thank you !

If when you say lines you mean grammer sentences then you need to split sentences firstly by :-
Eng = "..."
Frn = "..."
GEngLines = Eng.split(".");
GFrnLines = Frn.split(".");
for i in range(len(GEngLines)):
for j in range(len(GFrnLines)):
if GEngLines[i] == GFrnLines[j] :
GEngLines.remove(i);
GFrnLines.remove(j);
DevLinesNumber = 500
TestLinesNumber = 500
EngDevLines = []
EngTestLines = []
EngTrainLines = []
FrnDevLines = []
FrnTestLines = []
FrnTrainLines = []
for i in range(len(GEngLines)):
if i < DevLinesNumber :
EngDevLines.append(GEngLines[i])
FrnDevLines.append(GFrnLines[i]);
elif i >= DevLinesNumber and i < DevLinesNumber + TestLinesNumber :
EngTestLines.append(GEngLines[i])
FrnTestLines.append(GFrnLines[i]);
else:
EngTrainLines.append(GEngLines[i])
FrnTrainLines.append(GFrnLines[i]);
But dont forget to add two tabs(4 spaces) before end two lines because i am useing mobile i couldnt write easily.

Related

Python Readline Loop and Subloop

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.

If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.

As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python

Is there a better way than using python-docx to extract text-chunks from a high amount of unstructured MS Word-documents?

For a text-classification problem I need to extract a huge amount of text-chunks out of Word documents. I need to write these chunks of text into a jsonlines-file so the text can be annotated using an annotation tool. The function that I wrote is thisone:
def writeParagraphsToFile(filename):
textblocks = []
paragraphs = []
txtblock = ""
para_ID = 0
lineNr = 0
startNum = re.compile(r'\A\d+')
startsWithWhitespaces = re.compile(r'\A\s+')
d = Document(path2 + filename)
amountOfPara = len(d.paragraphs)
for p in d.paragraphs:
lineNr += 1
if (lineNr == 1):
titel = p.text
elif(lijnNr > 1 and startNum.match(p.text)):
TuTitel = ""
if txtblock != "":
textblocks.append(txtblok)
txtblok = ""
for run in p.runs:
if run.bold and run.underline:
TuTitel += " "+run.text
elif (lineNr >1):
if p.text == "":
txtblock += "\n"
elif len(p.text)>6:
txtblok += " "+p.text
if lineNr == amountOfPara-1:
textblocks.append(txtblok)
for tb in textblocks:
paragraphs.append(tb.strip().splitlines())
paragrafen_nieuw = [p for p in paragrafen if p]
for r in paragrafen_nieuw:
for t in r:
if t != "":
para_ID +=1
writer.write({"text": t.lstrip(), "meta": {"Bestandsnaam": filename, "Doc_id": getDocID(filename), "Para_id": para_ID}})
This code works for 1 specific kind of Word Documents, namely those ones that start with 1 line which is the document title and where the subtitles start with a number and are written in "bold" & "underline" font.
The problem is: I have MANY documents, with MANY layouts. Some start with a title, some don't, some start with a title existing out of 2 lines, some have actual "heading 1", "heading 2"-attributes for subtitles but most of them don't. Most of the subtitles are manualy put in "bold" and/or "underline" font. (sometimes only 1 or another). Some documents have a subtitle starting with a number and a subsubtitle starting with a letter.
And then there Is me, who barely has experience with all of this. I think I'll need to write a lot of "if-else" statements to be able to extract the text-chunks from all of these documents, no matter which document the program gets as an input but I'm really wondering if there would be another point of vieuw (a better one) from someone who's more experienced?
Any help would be greatly appreciated.
Here are some sample documents, all with a (slightly) different layout:
[From this file I can parse the textblocks already witht the code you can find above.][1]
The next links are from other 'types of' documents. These are not the only ones but there are too many different kinds because different people worked on these documents and everybody has his way / knowledge on how to write a document.
https://www.scribd.com/document/436242588/example-1
https://www.scribd.com/document/436242590/example-2
https://www.scribd.com/document/436242592/example-3
https://www.scribd.com/document/436242589/example-4
https://www.scribd.com/document/436242591/example-5
https://www.scribd.com/document/436242593/example-6
I don't need subtitles or subsubtitles in my jsonstrings. I just need chunks of normal text. Small "blocks" of text basicly. For example in B, from subtitle 3, I would get 3 json-strings with 1 textblock in each of them. In example 4 all the parts that are separated by an empty line could be 1 textblock and so 1 jsonstring. I hope this explanation makes my original question more clear.

How can I pull out text snippets around specific words?

I have a large txt file and I'm trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above.
def occurs(word1, word2, filename):
import os
infile = open(filename,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(filename, 'w')
f.write("start")
f.write(os.linesep)
for word in wordlist:
matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]
for m in matches:
l = " ".join(words[m-15:m+16])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
f.close
So far, when two of the same word are too close together, the program just doesn't run on one of them. Instead, I want to get out a longer chunk of text that extends 15 words behind and in front of furthest back and forward words

This snippet will get number of words around the chosen keyword. If there are some keywords together, it will join them:
s = '''xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above. xxx'''
words = s.split()
from itertools import groupby, chain
word = 'xxx'
def get_snippets(words, word, l):
snippets, current_snippet, cnt = [], [], 0
for v, g in groupby(words, lambda w: w != word):
w = [*g]
if v:
if len(w) < l:
current_snippet += [w]
else:
current_snippet += [w[:l] if cnt % 2 else w[-l:]]
snippets.append([*chain.from_iterable(current_snippet)])
current_snippet = [w[-l:] if cnt % 2 else w[:l]]
cnt = 0
cnt += 1
else:
if current_snippet:
current_snippet[-1].extend(w)
else:
current_snippet += [w]
if current_snippet[-1][-1] == word or len(current_snippet) > 1:
snippets.append([*chain.from_iterable(current_snippet)])
return snippets
for snippet in get_snippets(words, word, 15):
print(' '.join(snippet))
Prints:
xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15
other, which I'm trying to get as one large snippet of text. I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working
topic. So far, I have working code for all instances except the scenario mentioned above. xxx
With the same data and different lenght:
for snippet in get_snippets(words, word, 2):
print(' '.join(snippet))
Prints:
xxx and I'm
I have xxx trying to
trying to xxx get chunks
mentioned above. xxx

As always, a variety of solutions avaliable here. A fun one would a be a recursive wordFind, where it searches the next 15 words and if it finds the target word it can call itself.
A simpler, though perhaps not efficient, solution would be to add words one at a time:
for m in matches:
l = " ".join(words[m-15:m])
i = 1
while i < 16:
if (words[m+i].lower() == word):
i=1
else:
l.join(words[m+(i++)])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
Or if you're wanting the subsequent uses to be removed...
bExtend = false;
for m in matches:
if (!bExtend):
l = " ".join(words[m-15:m])
f.write("...")
bExtend = false
i = 1
while (i < 16):
if (words[m].lower() == word):
l.join(words[m+i])
bExtend = true
break
else:
l.join(words[m+(i++)])
f.write(l)
if (!bExtend):
f.write("...")
f.write(os.linesep)
Note, have not tested so may require a bit of debugging. But the gist is clear: add words piecemeal and extend the addition process when a target word is encountered. This also allows you to extend with other target words other than the current one with a bit of addition to to the second conditional if.

Python: is there a maximum of values the write() functions could process?

I´m new in python so I would be thankful for every help...
My problem is the following:
I wrote a program in python analysing gene sequences of a huge database (more than 600 genes). With the help of the write() function the program should insert the results in a text file - one result per gene. Opening my output file, there are only the first genes followed by "..." followed by the last gene.
Is there a maximum this function could process? How do I make python write all results?
relevant part of code:
fasta_df3 = pd.read_table(fasta_out3, delim_whitespace=True, names=
('qseqid','sseqid', 'evalue', 'pident'))
fasta_df3_sorted = fasta_df3.sort_values(by='qseqid', ascending = True)
fasta_df3_grouped = fasta_df3_sorted.groupby('qseqid')
for qseqid, fasta_df3_sorted in fasta_df3_grouped:
subj3_pident_max = str(fasta_df3_grouped['pident'].max())
subj3_pident_min = str(fasta_df3_grouped['pident'].min())
current_gene = str(qseqid)
with open(dir_output+outputall_file+".txt","a") as gene_list:
gene_list.write("\n"+"subj3: {} \t {} \t {}".format(current_gene,
subj3_pident_max, subj3_pident_min))
gene_list.close()

Trying to read text file and count words within defined groups

I'm a novice Python user. I'm trying to create a program that reads a text file and searches that text for certain words that are grouped (that I predefine by reading from csv). For example, if I wanted to create my own definition for "positive" containing the words "excited", "happy", and "optimistic", the csv would contain those terms. I know the below is messy - the txt file I am reading from contains 7 occurrences of the three "positive" tester words I read from the csv, yet the results print out to be 25. I think it's returning character count, not word count. Code:
import csv
import string
import re
from collections import Counter
remove = dict.fromkeys(map(ord, '\n' + string.punctuation))
# Read the .txt file to analyze.
with open("test.txt", "r") as f:
textanalysis = f.read()
textresult = textanalysis.lower().translate(remove).split()
# Read the CSV list of terms.
with open("positivetest.csv", "r") as senti_file:
reader = csv.reader(senti_file)
positivelist = list(reader)
# Convert term list into flat chain.
from itertools import chain
newposlist = list(chain.from_iterable(positivelist))
# Convert chain list into string.
posstring = ' '.join(str(e) for e in newposlist)
posstring2 = posstring.split(' ')
posstring3 = ', '.join('"{}"'.format(word) for word in posstring2)
# Count number of words as defined in list category
def positive(str):
counts = dict()
for word in posstring3:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
total = sum (counts.values())
return total
# Print result; will write to CSV eventually
print ("Positive: ", positive(textresult))

I'm a beginner as well but I stumbled upon a process that might help. After you read in the file, split the text at every space, tab, and newline. In your case, I would keep all the words lowercase and include punctuation in your split call. Save this as an array and then parse it with some sort of loop to get the number of instances of each 'positive,' or other, word.
Look at this, specifically the "train" function:
https://github.com/G3Kappa/Adjustable-Markov-Chains/blob/master/markovchain.py
Also, this link, ignore the JSON stuff at the beginning, the article talks about sentiment analysis:
https://dev.to/rodolfoferro/sentiment-analysis-on-trumpss-tweets-using-python-
Same applies with this link:
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Good luck!

I looked at your code and passed through some of my own as a sample.
I have 2 idea's for you, based on what I think you may want.
First Assumption: You want a basic sentiment count?
Getting to 'textresult' is great. Then you did the same with the 'positive lexicon' - to [positivelist] which I thought would be the perfect action? Then you converted [positivelist] to essentially a big sentence.
Would you not just:
1. Pass a 'stop_words' list through [textresult]
2. merge the two dataframes [textresult (less stopwords) and positivelist] for common words - as in an 'inner join'
3. Then basically do your term frequency
4. It is much easier to aggregate the score then
Second assumption: you are focusing on "excited", "happy", and "optimistic"
and you are trying to isolate text themes into those 3 categories?
1. again stop at [textresult]
2. download the 'nrc' and/or 'syuzhet' emotional valence dictionaries
They breakdown emotive words by 8 emotional groups
So if you only want 3 of the 8 emotive groups (subset)
3. Process it like you did to get [positivelist]
4. do another join
Sorry, this is a bit hashed up, but if I was anywhere near what you were thinking let me know and we can make contact.
Second apology, Im also a novice python user, I am adapting what I use in R to python in the above (its not subtle either :) )

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing duplicate lines and split parallel corpus - python

Related

Python Readline Loop and Subloop

Is there a better way than using python-docx to extract text-chunks from a high amount of unstructured MS Word-documents?

How can I pull out text snippets around specific words?

Python: is there a maximum of values the write() functions could process?

Trying to read text file and count words within defined groups

Categories

Resources