Add missing full-stops at the end of a text block [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Currently, I'm trying to prepare some texts for my machine learning task in python3.
The input data is a single long string and has the following format:
<SPEAKER gender="female" id="1" name="unknown"> sentence_1. sentence_2? ... sentence_n, </SPEAKER><SPEAKER gender="male" id="2" name="unknown"> sentence_1. sentence_2? ... sentence_n </SPEAKER><SPEAKER gender="female" id="1" name="unknown"> sentence_1. sentence_2? ... sentence_n; </SPEAKER> ...
It consists of multiple "text blocks", starting <SPEAKER ...> and ending </SPEAKER> with tags.
As you can see, sometimes the last sentence within a block (sentence_n) is missing a full-stop . or the sentence end with a comma , or semicolon ;.
The current problem is, when I cleanse the provided string and delete the tags, the last sentence (sentence_n) of a block and the first sentence (sentence_1) of the following block merge. I just want to avoid this. I want to the sentences to end with punctuation to be able to split the total string sentence-wise in my later text preprocessing steps.
Therefore, I would like to check the LAST character of the LAST sentence (sentence_n) of every block and
add a full-stop if it's missing
replace a comma or semicolon with full-stop
if a full-stop already exists, just keep it
Thank you very much in advance!
Edit1: It does not have to be a regex solution. Since I handle thousands of such strings, performance is still important.
Edit2: Specified the question.

You can indeed use a regular expression:
import re
s = re.sub(r"([;,.])?(\s*</SPEAKER>)", r".\2", s)
This captures the ;, , or . when it is the last non-white-space character in the tag, or -- if not possible -- captures the empty string at the spot where the point should occur. In either case it replaces that capture with a point.
Then apply your solution for removing the tags.

Related

How to remove matching letters in a string that come after a comma? Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 months ago.
Improve this question
I am working on a coding challenge that takes user input removes matching letters that come after a comma and scrubs the white space. For the life of me I can’t figure it out.
For example the user inputs:
Hello World, wea
Output would be:
Hllo orld
Any direction on how to solve this would be greatly appreciated it is driving me crazy.
Try using a simple for loop that iterates across the phrase and places characters that don't appear after the comma into a separate string. Then the separate string is the result once the for loop has finished.
There are tons of different ways of achieving this, this way is fairly easy to understand though.
text = "Hello World, wea"
phrase, chars = text.split(",") # split the text by the comma
chars = chars.strip() # remove whitespace from second part
output = "" # separate string to collect chars
for letter in phrase:
if letter.lower() not in chars: # check lowercased letters
output += letter
print(output)
output
Hllo orld

Python check string pattern follows xxxx-yyyy-zzzz [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 months ago.
The community is reviewing whether to reopen this question as of 4 months ago.
Improve this question
I am trying to build a check for strings that allows a pattern "xxxx-yyyy-zzzz". The string needs to have three blocks seperated by "-". Each block can contain "a-z", "A-Z", "_" (underscore) and "." (dot).
This is what I got to now:
file_name: str = "actors-zero-This_Is_Fine.exe.yml"
ab = re.compile("^([A-Z][0-9])-([A-Z][0-9])-([A-Z][0-9])+$")
if ab.match(file_name):
pass
else:
print(f"WARNING wrong syntax in {file_name}")
sys.exit()
Output is:
WARNING wrong name in actors-zero-This_Is_Fine.exe.yml
If I understand the question correctly, you want 3 groups of alphanumerical characters (plus .) separated by dashes:
Regex for this would be ^([\w.]+)-([\w.]+)-([\w.]+)$
^ matches the start of a string
Then we have 3 repeating groups:
([\w.]+) will match letters, numbers, underscores (\w) and dots (.) at least one time (+)
We make 3 of these, then separate each with a dash.
We finish off the regex with a $ to match the end of the string, making sure you're matching the whole thing.
What exactly is your question?
This looks alright so far. Your file name returns the warning because you have not specified the underscores in the third "block".
Instead of ([A-Z][0-9]). you could use character classes:
Like so: ^([\w.]+)-([\w.]+)-([\w.]+)$
Generally, I found the chapter in Automate The Boring Stuff on regex Pattern matching very concise and helpful:
https://automatetheboringstuff.com/2e/chapter7/
You will find the above table in this chapter also.

Convert a big table of value into string [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 10 months ago.
Improve this question
I have a table of value that i want to insert in my python code as a string :
I tried this but not working :
str(3.354219E-03 3.584506E-03 3.830603E-03 4.093597E-03 4.374646E-03
4.674992E-03 4.995957E-03 5.338959E-03 5.705510E-03 6.097227E-03
6.515837E-03 6.963188E-03 7.441252E-03 7.952137E-03 8.498098E-03
9.081543E-03 9.705044E-03 1.037135E-02 1.108341E-02 1.184435E-02
1.265753E-02 1.352655E-02 1.445522E-02 1.544766E-02 1.650823E-02
1.764162E-02 1.885282E-02 2.014718E-02 2.153040E-02 2.300859E-02
2.458826E-02 2.627639E-02 2.808042E-02 3.000831E-02 3.206855E-02
3.427025E-02 3.662310E-02 3.913749E-02 4.182451E-02 4.469601E-02
4.776465E-02 5.104398E-02 5.454845E-02 5.829352E-02 6.229571E-02
6.657268E-02 7.114329E-02 7.602769E-02 8.124744E-02 8.682555E-02)
Or another way would be to put quotation marks at the beggining and end of line but it's take too much time. If there are some options in Vim or Notepad i would be glad to here about it.
One way would of course to do it programmatically and immediately split your string so that you retrieve the desired table. If you read a multi-line string, you should use three quotation marks """.
If you are looking for a way to do it in a text editor, here are two solutions that work in VS Code (and many more editors)
Column-selection mode. Decently fast, because your inputs are all the same length. You can highlight all rows and then just add quotes accordingly.
Reg-ex replacement. You can use the following regex, see explanations here. Note that I highlighted the reg-ex mode (next to "1 of 200").
Regardless of the method, by using three quotation marks here as well, Python should be able to read your string including the "inner" quotation marks.

Excluding/including strings in one re.compile statement in python to extract urls of interest [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
So I'm trying to get urls that contain specific strings, but also avoiding urls that contain a bad string.
So I don't want any urls that contain the string "/inventory/all/", and I only want urls that contain either the string "/inventories/" or "/inventory/2017/"
So I've managed to exclude at least the urls with strings that contain "/inventory/all/" by:
get_urls = soup.findAll('a', href=re.compile('^(?!.*/inventory/all/).*$'))
But when I try to include the strings I do want to get, then it no longer works, I tried:
get_urls = soup.findAll('a', href=re.compile('^(?!.*/inventory/all/).*$'|/inventories/|/inventory/2017/'))
Thanks for the help, I'm quite the novice
you can use the following regex:
^(?=.*inventor(?:ies|y/2017))^(?:(?!inventory/all).)+$
^(?=.*inventor(?:ies|y/2017)) This is a look ahead that ensures that we are just looking for strings with either inventories or inventory/2017. For fewer backtracking, you need to anchor it ie ^ which shows that the matching should start at the beginning of the sentence. Thus just doing ^.*inventor(?:ies|y/2017).*$ should be enough since the only ones selected are the two.
^(?:(?!inventory/all).)+$ this part is a negative look ahead which asserts that from the beginning of the string to the end of the string there is no inverntory/all. I added this part in case you find a string that is of the format inventoy/2017/inventory/all This will be dropped.

How do I extract strings between the substrings at their first instance (readable string at all times) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am working on a hobby project where I need to extract certain info between two substrings and there may be more than one occurrence.
Example:
Sample Text:
import re
s = "FooFoo name=JohnSmith and BarBar name=JackSmith"
start="name="
end="Smith"
Sample Code:
result = re.findall('name=(.*)Smith', s)
Sample Output: result array with all extracted substrings
print(result)
>>>['John', 'Jack']
I have tried regex and str.sub which have worked, but I have had trouble to put my search results into an array that can be called back later. I can post my attempted solutions if they help at all.
NOTE: It should only start the substring at the first occurrence of Start and end at the first occurrence of End, and then continue parsing the string after that location (don't want repeats or nested substrings)
PS: I am unsure of the encoding of the input, would there be a way to make sure it is always in a readable format for this method?
Ex:
MyString.decode() or MyString.encode('utf-8') or
MyString.encode('ascii','ignore') or unicodedata.normalize(MyString)?
Please let me know and any help is appreciated! Thank you!
All you need is a non-greedy match:
In [7]: result = re.findall('name=(.*?)Smith', s)
↑ THIS
In [8]: result
Out[8]: ['John', 'Jack']
Without the ?, the * grabs as many characters as it can. As a result, the match starts at the first =name and ends at the last Smith.
With the ?, the * grabs as few characters as it can, resulting in separate matches for each name=/Smith pair.

Categories