I have a text below, How to extract the text between the time range. Code is available to extract all the values
s = '''00:00:14,099 --> 00:00:19,100
a classic math problem a
00:00:17,039 --> 00:00:28,470
will come from an unexpected place
00:00:18,039 --> 00:00:19,470
00:00:20,039 --> 00:00:21,470
00:00:22,100 --> 00:00:30,119
binary numbers first I'm going to give
00:00:30,119 --> 00:00:35,430
puzzle and then you can try to solve it
00:00:32,489 --> 00:00:37,170
like I said you have a thousand bottles'''
Can i extract the test from 00:00:17,039 --> 00:00:28,470 and 00:00:30,119
code to write back all the values
import re
lines = s.split('\n')
dict = {}
for line in lines:
is_key_match_obj = re.search('([\d\:\,]{12})(\s-->\s)([\d\:\,]{12})', line)
if is_key_match_obj:
#current_key = is_key_match_obj.group()
print (current_key)
continue
if current_key:
if current_key in dict:
if not line:
dict[current_key] += '\n'
else:
dict[current_key] += line
else:
dict[current_key] = line
print(dict.values())
Expected Out from 00:00:17,039 --> 00:00:28,470 to 00:00:30,119 --> 00:00:35,430
dict_values(['will come from an unexpected place ', '', '', 'binary numbers first I'm going to give', ' puzzle and then you can try to solve it'])
No need to iterate line by line. Try the below code. It will give you a dictionary as you wanted.
import re
dict = dict(re.findall('(\d{2}:\d{2}.*)\n(.*)', s))
print(dict.values())
Output
dict_values(['a classic math problem a', 'will come from an unexpected place', '', '', "binary numbers first I'm going to give", 'puzzle and then you can try to solve it', 'like I said you have a thousand bottles'])
import re
line = re.sub(r'\d{2}[:,\d]+[ \n](-->)*', "", s)
print(line)
will print:
" a classic math problem a\n\n will come from an unexpected place\n\n
\n \n binary numbers first I'm going to give\n\n puzzle and then you
can try to solve it\n\n like I said you have a thousand bottles"
Explanation
'\d{2}[:,\d] capture two digits numbers followed by : or , or a number - this captures both start and end timelines
[ \n] : captures an empty space after the first timeline and line break after the end timeline
(-->)* : captures the occurrence of 0 or more -->
As some else suggested in the comment, you might want to look at parser that do this for you by building a parse tree. They are more full-proof. Google search leads me to this srt python library
Related
I have a translation file that looks like this:
Apple=Apfel
Apple pie=Apfelkuchen
Banana=Banane
Bananaisland=Bananen Insel
Cherry=Kirsche
Train=Zug
...500+ more lines like that
now I have a file I need to work on with text. Only certain parts of text needs to be replaced, example:
The [[Apple]] was next to the [[Banana]]. Meanwhile the [[Cherry]] was chilling by the [[Train]].
The [[Apple pie]] tastes great on the [[Bananaisland]].
Result needs to be
The [[Apfel]] was next to the [[Banane]]. Meanwhile the [[Kirsche]] was chilling by the [[Zug]].
The [[Apfelkuchen]] tastes great on the [[Bananen Insel]].
There are way too many incident to copy/paste manually. What is an easy way to search for [[XXX]] and replace from another file as mentioned?
I tried getting help for this for many hours but to no avail. The closest I have gotten was this script:
import re
separators = "=", "\n"
def custom_split(sepr_list, str_to_split):
# create regular expression dynamically
regular_exp = '|'.join(map(re.escape, sepr_list))
return re.split(regular_exp, str_to_split)
with open('D:/_working/paired-search-replace.txt') as f:
for l in f:
s = custom_split(separators, l)
editor.replace(s[0], s[1])
However, this will replace too much, or not consistent. E.g. [[Apple]] gets correctly replaced by [[Apfel]] but [[File:Apple.png]] gets wrongly replaced by [[File:Apfel.png]] and [[Apple pie]] gets replaced by [[Apfel pie]], so I tried tweaking the regular expression for hours on end to no avail. Does anyone have any info -in very simple terms please- how I can fix this/achieve my goal?
This is a little tricky because [ is a meta character in regex.
I'm sure there is a more efficient way to do it but this works:
replaces="""Apple=Apfel
Apple pie=Apfelkuchen
Banana=Banane
Bananaisland=Bananen Insel
Cherry=Kirsche
Train=Zug"""
text = """
The [[Apple]] was next to the [[Banana]]. Meanwhile the [[Cherry]] was chilling by the [[Train]].
The [[Apple pie]] tastes great on the [[Bananaisland]].
"""
if __name__ == '__main__':
import re
for replace in replaces.split('\n'):
english, german = replace.split('=')
text = re.sub(rf'\[\[{english}\]\]', f'[[{german}]]', text)
print(text)
outputs:
The [[Apfel]] was next to the [[Banane]]. Meanwhile the [[Kirsche]] was chilling by the [[Zug]].
The [[Apfelkuchen]] tastes great on the [[Bananen Insel]].
First, read in the file with translations:
translations={}
with open('file/with/translations.txt', 'r', encoding='utf-8') as f:
for line in f:
items = line.strip().split('=', 1)
translations[items[0]] = items[1]
I assume the phrases/words are unique in the file.
Then, you need to match all substrings between [[ and ]], capture the text in between (with a regex like \[\[(.*?)]], see the online demo), check if there is a key with the group 1 value in the translations dictionary, and replace with [[ + dictionary value + ]] if there is such a key, or return the whole match if there is no such a translation:
text = """The [[Apple]] was next to the [[Banana]]. Meanwhile the [[Cherry]] was chilling by the [[Train]].
The [[Apple pie]] tastes great on the [[Bananaisland]]."""
import re
translated_text = re.sub(r"\[\[(.*?)]]", lambda x: f'[[{translations[x.group(1)]}]]' if x.group(1) in translations else x.group(), text)
Output:
>>> translated_text
'The [[Apfel]] was next to the [[Banane]]. Meanwhile the [[Kirsche]] was chilling by the [[Zug]]. \nThe [[Apfelkuchen]] tastes great on the [[Bananen Insel]].'
So from a text file which has a content:
Lemonade juice whiskey beer soda vodka
In Python, by implementing that same .txt file, I would like to output word-pairs in the next order:
juice-lemonade
whiskey-juice
beer-whiskey
soda-beer
vodka-soda
I managed outputing something like that by using list instead of opening file in Python, but in the case with some major .txt file, that is not really a handy solution.
Also, the bonus task for this would be to output the probability for each of those pairs. Any kind of hint would be highly appreciated.
To read large files efficiently, you should read them line-by-line, or (if you have really long lines, which is what the snippet below assumes) token-by-token.
A clean way to do this while keeping an open handle on a file is by using generators that yield a word at a time.
You can have another generator that combines 2 words at a time and yields pairs.
from typing import Iterator
def memory_efficient_word_generator(text_file: str) -> Iterator[str]:
word = ''
with open(text_file) as text:
while True:
character = text.read(1)
if not character:
return
if character.isspace():
yield word.lower()
word = ''
else:
word += character
def pair_generator(text_file: str) -> Iterator[str]:
previous_word = ''
for word in memory_efficient_word_generator(text_file):
if previous_word and word:
yield f'{previous_word}-{word}'
previous_word = word or previous_word
for pair in pair_generator('filename.txt'):
print(pair)
Assuming filename.txt contains:
Lemonade juice whiskey beer soda vodka
cola tequila lemonade juice
You should see something like:
lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice
Of course, there's a lot more you should handle depending on your desired behaviour (for example, handling non-alphabetic characters in your input).
Thank you very much for the feedback.
That's pretty much it, I just added encoding = 'utf-8' here:
with open(text_file, encoding='utf-8') as text:
since it outputs error for 'charmap' for me.
And just one more thing, I also wanted to output the number of the elements(words) from the text file by using:
file = open("filename.txt", "rt", encoding="utf8")
data = file.read()
words = data.split()
print('Number of words :', len(words))
which I did, now I'm trying to do the same with those word-pairs that you sent, basically each of those pairs would be one element, like for example:
lemonade-juice ---> one element
So if we would to count all of these from a text file:
lemonade-juice
juice-whiskey
whiskey-beer
beer-soda
soda-vodka
vodka-cola
cola-tequila
tequila-lemonade
lemonade-juice
we would get the output of 9 elements or
Number of word-pairs: 9
Was thinking now to try to do that with using len function and calling text_file.
Fix me if I'm looking in a wrong direction.
Once again, thank you for your time.
I originally posted this question here but was then told to post it to code review; however, they told me that my question needed to be posted here instead. I will try to better explain my problem so hopefully there is no confusion. I am trying to write a word-concordance program that will do the following:
1) Read the stop_words.txt file into a dictionary (use the same type of dictionary that you’re timing) containing only stop words, called stopWordDict. (WARNING: Strip the newline(‘\n’) character from the end of the stop word before adding it to stopWordDict)
2) Process the WarAndPeace.txt file one line at a time to build the word-concordance dictionary(called wordConcordanceDict) containing “main” words for the keys with a list of their associated line numbers as their values.
3) Traverse the wordConcordanceDict alphabetically by key to generate a text file containing the concordance words printed out in alphabetical order along with their corresponding line numbers.
I tested my program on a small file with a short list of stop words and it worked correctly (provided an example of this below). The outcome was what I expected, a list of the main words with their line count, not including words from the stop_words_small.txt file. The only difference between the small file I tested and the main file I am actually trying to test, is the main file is much longer and contains punctuation. So the problem I am running into is when I run my program with the main file, I am getting way more results then expected. The reason I am getting more results then expected is because the punctuation is not being removed from the file.
For example, below is a section of the outcome where my code counted the word Dmitri as four separate words because of the different capitalization and punctuation that follows the word. If my code were to remove the punctuation correctly, the word Dmitri would be counted as one word followed by all the locations found. My output is also separating upper and lower case words, so my code is not making the file lower case either.
What my code currently displays:
Dmitri : [2528, 3674, 3687, 3694, 4641, 41131]
Dmitri! : [16671, 16672]
Dmitri, : [2530, 3676, 3685, 13160, 16247]
dmitri : [2000]
What my code should display:
dmitri : [2000, 2528, 2530, 3674, 3676, 3685, 3687, 3694, 4641, 13160, 16671, 16672, 41131]
Words are defined to be sequences of letters delimited by any non-letter. There should also be no distinction made between upper and lower case letters, but my program splits those up as well; however, blank lines are to be counted in the line numbering.
Below is my code and I would appreciate it if anyone could take a look at it and give me any feedback on what I am doing wrong. Thank you in advance.
import re
def main():
stopFile = open("stop_words.txt","r")
stopWordDict = dict()
for line in stopFile:
stopWordDict[line.lower().strip("\n")] = []
hwFile = open("WarAndPeace.txt","r")
wordConcordanceDict = dict()
lineNum = 1
for line in hwFile:
wordList = re.split(" |\n|\.|\"|\)|\(", line)
for word in wordList:
word.strip(' ')
if (len(word) != 0) and word.lower() not in stopWordDict:
if word in wordConcordanceDict:
wordConcordanceDict[word].append(lineNum)
else:
wordConcordanceDict[word] = [lineNum]
lineNum = lineNum + 1
for word in sorted(wordConcordanceDict):
print (word," : ",wordConcordanceDict[word])
if __name__ == "__main__":
main()
Just as another example and reference here is the small file I test with the small list of stop words that worked perfectly.
stop_words_small.txt file
a, about, be, by, can, do, i, in, is, it, of, on, the, this, to, was
small_file.txt
This is a sample data (text) file to
be processed by your word-concordance program.
The real data file is much bigger.
correct output
bigger: 4
concordance: 2
data: 1 4
file: 1 4
much: 4
processed: 2
program: 2
real: 4
sample: 1
text: 1
word: 2
your: 2
You can do it like this:
import re
from collections import defaultdict
wordConcordanceDict = defaultdict(list)
with open('stop_words_small.txt') as sw:
words = (line.strip() for line in sw)
stop_words = set(words)
with open('small_file.txt') as f:
for line_number, line in enumerate(f, 1):
words = (re.sub(r'[^\w\s]','',word).lower() for word in line.split())
good_words = (word for word in words if word not in stop_words)
for word in good_words:
wordConcordanceDict[word].append(line_number)
for word in sorted(wordConcordanceDict):
print('{}: {}'.format(word, ' '.join(map(str, wordConcordanceDict[word]))))
Output:
bigger: 4
data: 1 4
file: 1 4
much: 4
processed: 2
program: 2
real: 4
sample: 1
text: 1
wordconcordance: 2
your: 2

I will add explanations tomorrow, it's getting late here ;). Meanwhile, you can ask in the comments if some part of the code isn't clear for you.
Here is my code
import re
with open('newfiles.txt') as f:
k = f.read()
p = re.compile(r'[\w\:\-\.\,\']+|[^[\w\:\-\.\'\,]\s]')
originaltext = p.findall(k)
uniquelist = []
for word in originaltext:
if word not in uniquelist:
uniquelist.append(word)
indexes = ' '.join(str(uniquelist.index(word)+1) for word in originaltext)
n = p.findall(indexes)
file = open("newfiletwo.txt","w")
file.write (' '.join(str(e) for e in n))
file.close()
file = open("newfilethree.txt","w")
file.write(' '.join(uniquelist))
file.close()
with open('newfiletwo.txt') as f:
indexess = f.read()
with open('newfilethree.txt') as f:
differentwords = f.read()
differentwords = p.findall(differentwords)
indexess = [uniquelist.index(word) for word in originaltext]
for word in originaltext:
if not word in differentwords:
differentwords.append(word)
i = differentwords.index(word)
indexess.append(i)
s = "" # the reconstructed sentence
for i in indexess:
s = s + differentwords[i] + " "
print(s)
The program basically takes an external text file, returns the index of its positions (if any word repeats, then the first position is taken) and then saves the positions as an external file. Whilst doing this, I have split up the text file including splitting punctuation and saved different words and punctuation that occur in the file as an external file too. Now for the hard part, using both of these external files - the indexes and the different separated words, I am trying to recreate the original text file, including the punctuation. But the error shown in the title occurs:
Traceback (most recent call last):
File "E:\Python\Index.py", line 31, in <module>
s = s + differentwords[i] + " "
IndexError: list index out of range
Not trying to sound rude but I am a sort of beginner, please try to change as less as possible in a simple way, as I have created this myself. You guys maybe know a far shorter way to do this, but this is the level of simplicity I can handle, proved by the length of the code. I have tried to shorten the original text file but that proves no use. Anyone know why the error occurs and how to fix it? I am not looking for efficiency right now, maybe after another couple of months of learning, but the simplest (i don't mind long) answer will be the best. Sorry if I have repeated myself a lot :-)
'newfiles' - A bunch of sentences with punctuation
UPDATE
The code does not show the error but prints the original sentence twice. The error has gone due to the removal of +1 on line 23. Does anyone know why the output repeats twice though?
Problem is, how you qualify what word is, what is not. For instance is comma part of word? In your case that is not mentioned as such, while it is also not a separator. So you end up with separate word comma, or dot, and so on. I have no access to your input, so I can just provide sample:
p = re.compile(r'[\w\:\-\.\,]+|[^[\w\:\-\.\,]\s]')
There is one point - in this case: 'Word', 'word', 'Word', 'Word.', 'word,' are all separate words. Since dot, and coma are parts of word. You can't eat cake and have it. To fix that... you need to store information if there is white space before separation.
UPDATE:
Oh, yes. Double output. Files that are stored in the middle - are OK. So something was filed after that. Look at this two lines:
i = differentwords.index(word)
indexess.append(i)
They need to be inside preceding if statement.
I have been having difficulty with organizing a function that will handle strings in the manner I want. I have looked into a handful previous questions 1, 2, 3 among others that I have sorted through. Here is the set up, I have well structured but variable data that needs to be split from a string read from the file, to an array of strings. The following showcases some examples of the data I am dealing with
('Vdfbr76','gsdf','gsfd','',NULL),
('Vkdfb23l','gsfd','gsfg','ggg#df.gf',NULL),
('4asg0124e','Lead Actor/SFX MUA/Prop designer','John Smith','jsmith#email.com',NULL),
('asdguIux','Director, Camera Operator, Editor, VFX','John Smith','',NULL),
...
(492,'E1asegaZ1ox','Nysdag_5YmD','145872325372620',1,'long, string, with, commas'),
I want to split these strings based on commas, however, there are commas occasionally contained within the strings which causes problems. In addition to this, developing an accurate re.split(regex, line) becomes difficult becomes the number of items in each line changes throughout the read.
Some solutions that I have tried up to this point.
def splitLine(text, fields, delimiter):
return_line = []
regex_string = "(.*?),"
for i in range(0,len(fields)-1):
regex_string+=("(.*)")
if i < len(fields)-2:
regex_string+=delimiter
return_line = re.split(regex_string, text)
return return_line
This will give a result where we have the following output
regex_string
return_line
However the main problem with this is that it occasionally lumps two fields together. In the case the 3rd value in the array.
(.*?),(.*),(.*),(.*),(.*),(.*)
['', '\t(222', "'Vy1asdfnuJkA','Ndfbyz3_YMD'", "'14541242640005471'", '2', "'Hello World!')", '', '\n']
Where the ideal result would look like:
['', '\t(222', "'Vy1asdfnuJkA'", "'Ndfbyz3_YMD'", "'14541242640005471'", '2', "'Hello World!')", '', '\n']
It is a small change, but it has a huge influence on the result. I tried manipulating the regex string to better suit what I was trying to do, but with each case I solved, another broke it unfortunately.
Another case which I played around with came from user Aaron Cronin in this post 4, which looks like below
def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
result = []
buff = ""
level = 0
is_quoted = False
for char in text:
if char in delimiter and level == 0 and not is_quoted:
result.append(buff)
buff = ""
else:
buff += char
if char in opens:
level += 1
if char in closes:
level -= 1
if char in quotes:
is_quoted = not is_quoted
if not buff == "":
result.append(buff)
return result
The results of this look like so:
["\t('Vk3NIasef366l','gsdasdf','gsfasfd','',NULL),\n"]
The main problem is that it comes out as the same string. Which puts me in a feedback loop.
The ideal result would look like:
[\t('Vk3NIasef366l','gsdasdf','gsfasfd','',NULL),\n]
Any help is appreciated, I am not sure what the best approach is in this scenario. I am happy to clarify any questions that arise as well. I tried to be as complete as possible.
Use ast's literal_eval!
from ast import literal_eval
s = """('Vdfbr76','gsdf','gsfd','',NULL),
('Vkdfb23l','gsfd','gsfg','ggg#df.gf',NULL),
('4asg0124e','Lead Actor/SFX MUA/Prop designer','John Smith','jsmith#email.com',NULL),
('asdguIux','Director, Camera Operator, Editor, VFX','John Smith','',NULL),
(492,'E1asegaZ1ox','Nysdag_5YmD','145872325372620',1,'long, string, with, commas'),
"""
for line in s.split("\n"):
line = line.strip().rstrip(",").replace("NULL", "None")
if line:
print list(literal_eval(line)) #list(..) is just an example
Output:
['Vdfbr76', 'gsdf', 'gsfd', '', None]
['Vkdfb23l', 'gsfd', 'gsfg', 'ggg#df.gf', None]
['4asg0124e', 'Lead Actor/SFX MUA/Prop designer', 'John Smith', 'jsmith#email.com', None]
['asdguIux', 'Director, Camera Operator, Editor, VFX', 'John Smith', '', None]
[492, 'E1asegaZ1ox', 'Nysdag_5YmD', '145872325372620', 1, 'long, string, with, commas']