How to wrap all words in a document with vim?

How to wrap all words in a document with vim? - python

Here is the file and on line 97 and below are the lines that I want to put in a python dictionary, idea is that words on left side of colon ':' will become keys and on right side of it will be values. All keys and values must be strings and for that I need to wrap all words (from line 97 and below) in quotation. So the question is How to wrap all words, in a document, in quotes?
My purpose of doing this is to obtain column names for prepossessing for machine learning. If you are interested you can find columns without names here.

Difficult to know exactly what you mean if you don't post the code (and no, I am not going to download and open a file called adult.names from a random person on the internet). However, if all you want is for every word to be wrapped in quotes, you can use a global substitution:
:%s/\w\+\ze[\s, \n, :]\+/"\0"/g
Explanation:
:s/regex/text will replace whatever is matched by regex with text on the current line.
Add a % at the beginning and it will do it for all lines.
If you only want to do this for a section of your document, make a visual selection and then run this command without the %.
\w matches a word character
\ze ends the match (so you can specify what comes after whatever you're matching)
The [\s, \n, :] means match spaces, newlines, and colons, and the \+ following that means match a non-zero number of those (i.e. at least one whitespace character or newline or colon).
All of that together means it's matching each word individually.
Then, for each of those matched words, it is replacing it with a quotation mark, then \0 which means the first thing that was substituted before, and another quotation mark.
The /g at the end means that it will do this substitution as many times as it finds the regex on each line. Without that it would only substitute the first match on each line.
The result should be that it wraps every word in quotes. But again, it's difficult to test and find the right solution without seeing what you're working with. In the future please put the relevant pieces of code in your post.

Related

Regex (in Python) for matching whitespace that is preceeded by a even number of quotation marks

I need match the whitespace (for re.split) but only when it is outside of quotation marks.
The method I'm trying is so long as there are an even number of quotations before the whitespace.
"(?<=\".*?\")\s+"
This doesn't work as look-behinds have to be constant size.
EDIT: Turns out look-around is immediate, while I want this to happen where it's immediate or after many characters.
I'd like to be able to use vim's \zs \ze (or some equivalent to):
"\".*?\".*\zs\s\ze"
I only want to match the white-space afterall.
EDIT 2:
I've solved not the question but the issue I was trying to solve.
If any are interested:
'("[^"]*")\s+|([^" ]+)\s+|("[^"]*")$|([^" ]+)$'
This can probably be reduced a fair bit but it works.
re.split returns an array of Nones, empty strings and the strings I actually want so I just have to remove the first two and sorted.

What about ^[^\"]*(\"[^\"]*\"[^\"]*)*(\s+)?
This is using [^\"]* for "any number of characters that are not quotation marks" to prevent a "great" 42" monitor from being matched in three possible ways (first to second quotation mark, first to third, or second to third).
So basically it says:
0 or more times non-quotation-marks
0 or more times a group consisting of
"
0 or more non-quotation-marks
"
0 or more non-quotation-marks
then one or more spaces

How do I remove texts in brackets only from the beginning of lines with regex in Python?

I would like to remove all the line codes in brackets placed at the beginning of lines but want to preserve the other words in brackets.
NOTE:
In the application that I use I cannot import any Python library but can use Python regexes. The regex and the replacement value in the substitution have to be separated by a comma. For example, I use ([^\s\d])(-\s+),\1 to merge hyphenated words at the end of lines. So I would need something similar.
\([^()]*\) finds every text in brackets.
^\h*\([^()]*\) finds only the first one but not the rest.
How should I modify it?
The sample text is the following:
(#p0340r#) This is a sentence. This is another one but I need more sentences to fill the space to start a new line.
(#p0350q#) Why? (this text should be left unchanged)
(#p0360r#) Because I need to remove these codes from interview texts.
The expected outcome should be:
This is a sentence. This is another one but I need more sentences
to fill the space to start a new line.
Why? (this text should be left unchanged)
Because I need to remove these codes from interview texts.
Thank you!

To remove a pattern at the start of any line with Python re.sub (or any re.sub powered search and replace), you need to use the ^ before the pattern (that is what you already have) and pass the multiline (?m) flag (if you have access to code you could use flags=re.M).
Also, \h is not Python re compliant, you need to use a construct like [ \t] or [^\S\n] (in some rare cases, also [^\S\r\n], usually when you read a file in binary mode) to match any horizontal whitespace.
So you can use
(?m)^[^\S\n]*\([^()]*\)[^\S\n]*
and replace with an empty string.
Note: if you ever want to remove one or more substrings inside parentheses at the start of a line group the pattern and apply the + quantifier on it:
(?m)^(?:[^\S\n]*\([^()]*\))+[^\S\n]*
# ^^^ ^^

regex for matching german characters in python

Could someone help me on regex to match German words/sentences in
python? It does not work on jupyter notebook. I tried same in jsfiddle
it works fine. I tried using this below script but does not work
import re
pattern = re.compile(r'\[^a-zA-Z0-9äöüÄÖÜß]\\', re.UNICODE)
print(pattern.search(text))

Your expression will always fail:
\[^a-zA-Z0-9äöüÄÖÜß]\\
Broken down, you require
[ # literally
^ # start of the line / text
a-z # literally, etc.
The problem is that you require a [ literally right before the start of a line which can never be true (either there's nothing or a newline). So in the end, either remove the backslash to get a proper character class as in:
[^a-zA-Z0-9äöüÄÖÜß]+
But this will surely not match the words you're looking for (quite the opposite). So either use something as simple as \w+ or the solution proposed by #Wiktor in the comments section.

The square brackets define a range of characters you want to look for, however the '^' negates these characters if it appears within the character class.
If you want to specify the beginning of the line you need to put the '^' before the brackets.
Also you need to add a multiplier behind the class to search for more than just one character in this case:
r'^[a-zA-Z0-9äöüÄÖÜß]+'
One ore more characters contained in the brackets are matched as long as they are not seperated by any other character not listed between '[]'
Here's the link to the official documentation

How to use regex to match Uppercase words that are non-consecutive duplicates of lowercase words else where in a file

I am a noob to regex and need some help. I have a basic understanding of regex character classes, anchors, and lookarounds but this specific use-case is proving difficult for me.
I am trying to parse an output of a script that runs monthly that outputs a list of users for inventory purposes. I want to use regex to parse the file to match these conditions:
Regex to find lower case words that are repeated as Uppercase words. These Uppercase words are not on the same line, are Non-consecutive occurrences, and can be on new lines or else where in the file.
I need a regex that is able to display the duplicate upper case matches
Another regex that deletes the duplicate upper case matches
Here is an example of the file output I am trying to parse:
"hello","2018-11-19","unitelife"
"world","2018-11-09","unitelife"
"foo","2018-11-16","unitelife"
"bar","2018-10-05","unitelife"
"hello123","2018-09-06","unitelife"
"HELLO123","2018-11-18","unitelife"
"FOO","2018-11-20","unitelife"
"WOWMUCHHAPPY","2018-10-20","unitelife"
"suchjoy","2017-11-28","unitelife"
The desired matches I am looking for are:
HELLO123
FOO
I have tried a number of combinations suggested by the referenced URL discussions below but nothing seems to work for me. Either I am trying incorrect combinations or this feature is not possible. In addition, most topics are discussing consecutive words or letters/characters.
Regex for non-consecutive upper-case words
Regex for non-consecutive upper-case words PART DEUX
PCRE Regex non-consecutive repeating
Regex for non-consecutive uppercase with lowercase
Regex non-consecutive chars
I want to apologize in advanced if this was discussed before or if it has been categorized incorrectly. Please let me know rather than down voting it so that I may edit, close, or reclassify the question to meet the standards accordingly.
Sincerely,
A humbly committed student

You can use the pattern
(?sm)^"([a-z\d]+)"(?i)(?=(?:[^\n]*\n)+?"(?=\1")(?-i)[A-Z\d]+")
https://regex101.com/r/nM3iBH/2
The idea is, start by being case-sensitive (no i), and match the lower-cased string at the beginning of the line inside the quotes. Then, enable the case-insensitive flag, and lookahead for the same string at the beginning of the line inside quotes. Re-enable case sensitivity, and match that same string again inside the lookahead, permitting only upper-case letters (and digits).
Note that this (like any regex algorithm for this problem) has O(N^2) complexity, because with any match, you need to check the entire remaining substring for the (uppercased) match.
Also note the use of "s rather than the \bs you were thinking of - using "s instead is more precise given this sort of input and will result in significantly fewer steps overall.
The pattern is quite strict, for the sake of reducing steps. Separated out:
(?sm)^"([a-z\d]+)" - Initial flags, capture quoted lower-cased word at the beginning of a line
(?i) - Enable case-insensivity so that the future \1 backreference will work properly
(?=(?:[^\n]*\n)+?"(?=\1")(?-i)[A-Z\d]+") Big lookahead for:
(?:[^\n]*\n)+ - Match non-newline characters, followed by a newline character
"(?=\1") Backreference the original matched word inside quotes at the beginning of a line
(?-i) Re-enable case-sensitivity so the capitalization can be checked
[A-Z\d]+" - Match capital letters and digits, followed by a "
In order to replace the capital word line with the empty string, instead of using a big lookahead, match everything between the lower-case word and the upper-case word in a group (so you have two groups, the lower-case word, and everything following it up until the upper-case word), then match the upper-case word, and replace with only the first two groups (thereby replacing out the upper-case word's line):
(?sm)^("[a-z\d]+")(?i)((?:[^\n]*\n)*[^\n]*)\n(?=\1)(?-i)"[A-Z\d]+"[^\n]*
replace with
\1\2
(or the equivalent in your environment)
https://regex101.com/r/nM3iBH/3
Note that if you have overlapping matches, you'll have to do this iteratively until no matches remain.

Here's a solution using just bash, with no regular expressions:
> cat filter.sh
#!/bin/bash
declare -A lower=()
declare -A upper=()
while IFS= read -r line; do
eval "words=( $(tr ',' ' ' <<< "$line") )"
for w in "${words[#]}"; do
[[ "${w^^}" = "$w" ]] && upper["$w"]=1 || lower["$w"]=1
done
done
for u in "${!upper[#]}"; do
exists=${lower["${u,,}"]+foo}
[[ -n "$exists" ]] && echo "$u"
done
There are a couple tricks I'm using here.
First, I'm using associative arrays to weed out repeats. For instance, if "HELLO123" appears multiple times in the file, it will only be counted once.
Secondly, I'm parsing the CSV by using tr to replace the commas with spaces, and then using eval to parse the string into an array, taking advantage of the fact that the individual words are always enclosed in double quotes.
Lastly, I'm using [[ "${w^^}" = "$w" ]] as a test to check if a word contains all uppercase letters. The ${w^^} syntax is a bash-ism that converts a variable to uppercase. I also use ${u,,} in the second loop, which converts $u to lowercase. Note that if you have a word with a mix of uppercase and lowercase letters, it will be counted as a lowercase word. If that doesn't match your expectations, you could change the logic.
The first loop just reads from stdin, splits each line into individual words, and then categorizes each word as either uppercase or lowercase. Again, I'm using associative arrays so that each word (regardless of case) only gets counted once.
The second loop just loops through the keys of the upper associative array ({${!upper[#]}), which are just all the uppercase words encountered in the input. For each word, it checks to see if the matching lowercase word was also encountered. The ${lower["${u,,}"]+foo} syntax is just checking to see if the lowercase word exists in the lower array. The foo part is just an arbitrary string. You could just as well use bar or exists or abc. This is how you check for the existence of a key in an associative array in bash. If the key exists in the array, then the expression will evaluate to "foo", otherwise it will evaluate to an empty string. That's what the subsequent -n test is checking for.
Example:
> cat input.txt
"hello","2018-11-19","unitelife"
"world","2018-11-09","unitelife"
"foo","2018-11-16","unitelife"
"bar","2018-10-05","unitelife"
"hello123","2018-09-06","unitelife"
"HELLO123","2018-11-18","unitelife"
"FOO","2018-11-20","unitelife"
"WOWMUCHHAPPY","2018-10-20","unitelife"
"suchjoy","2017-11-28","unitelife"
> cat input.txt | ./filter.sh
FOO
HELLO123
NOTE: Please don't use eval in production code. It is prone to all sorts of abuse and mishaps due to unexpected things appearing in the input. For example, consider what would happen if you inserted the following line into the input:
"); rm -rf *; foo=("
Then eval would end up evaluating the string "words=(); rm -rf *; foo=()". Definitely not good. I only used eval here as a quick and dirty way to parse the CSV. There are much better (and safer) ways to parse CSV in bash. The point of this solution is the use of associative arrays to keep track of uppercase and lowercase words, while filtering out duplicates.
Edit: Also note that FOO and HELLO123 appear out of order in the output. This is due to the fact that associative arrays don't store the keys in the order in which you created them. So when you do ${!hash[#]}, it's a crapshoot what order the keys will be in. If that's an issue for you, you could keep a separate regular array to preserve the order.

not with regex but using awks toupper() and tolower() functions
$ awk -F, '{lower=tolower($1)} lower==$1 {a[$1];next}
toupper($1)==$1 && lower in a{print $1}' file
"HELLO123"
"FOO"
if the field is lowercase add to the set, if uppercase and the lowercase is in the set print.
here the order is important (lowercase should appear before the uppercase) as in your example. If not, need to convert to a two-pass version. Also easy to remove the quotes if not needed.

Preserve key:value values in text while regex replacing non-word characters in keys (Notepad++)

Trying without luck in Notepad++ to replace any non-word characters \W with underscore _ from a block of multi-line text, with exception to (and right of) a colon : (which doesn't occur on every line- something of space-delineated hierarchy, terminating in a key-value pair). A python solution could be of use as well, as I'm trying to do other things with it once reformatted. Example:
This 100% isn't what I want
Yet, it's-what-I've got currently: D#rnit :(
This_100_is_what_I_d_like: See?
Indentation_isn_t_necessary
_to_maintain_but_would_be_nice: :)<-preserved!
I_m_Mr_Conformist_over_here: |Whereas, I'm like whatever's clever.|
If_you_can_help: Thanks 100.1%!

I admit that I'm answering an off-topic question I just liked the problem. Hold CTRL+H, enable Regular Expressions in N++ then search for:
(:[^\r\n]*|^\s+)|\W(?<![\r\n])
And replace with:
(?1\1:_)
Regex has two main parts. First side of outer alternation which matches leading spaces of a line (indentation) or every thing after first occurrence of a colon, and second side which matches a non-word character except a carriage return \r or newline \n character (in negative lookbehind) to preserve linebreaks. Replacement string is a conditional block which says if first capturing group is matched replace it with itself and if not replace it with a _.

Seeing a better description of what you're trying to do, I don't think you'll be able to do it from inside notepad++ using a single regular expression. However, you could write a python script that scrolls through your document, one line at time, and sanitizes anything to the left of a colon (if one exists)
Here's a quick and dirty example (untested). This assumes doc is an open file pointer to the file you want to sanitize
import re
sanitized_lines = []
for line in doc:
line_match = re.match(r"^(\s*)([^:\n]*)(.*)", line)
indentation = line_match.group(1)
left_of_colon = line_match.group(2)
remainder = line_match.group(3)
left_of_colon = re.sub(r"\W", "_", left_of_colon)
sanitized_lines.append("".join((indentation, left_of_colon, remainder)))
sanitized_doc = "".join(sanitized_lines)
print(sanitized_doc)

You may try this python script,
ss="""This 100% isn't what I want
Yet, it's-what-I've got currently: D#rnit :(
If you can help: Thanks 100.1%!"""
import re
splitcapture=re.compile(r'(?m)^([^:\n]+)(:[^\n]*|)$')
subregx=re.compile(r'\W+')
print(splitcapture.sub(lambda m: subregx.sub('_', m.group(1))+m.group(2), ss))
in which first I tried to match each line and capture 2 parts separately(the one part not containing ':'character is capured to group 1, and the other possible part started with ':' and goes on to the end of the line is captured to group 2), and then implemented replacing process only on group 1 captured string and finally joined 2 parts, replaced group 1 + group 2
And output is
This_100_isn_t_what_I_want_
_Yet_it_s_what_I_ve_got_currently: D#rnit :(
If_you_can_help: Thanks 100.1%!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.