regex to fix csv quotes - python

I have a simple csv with quotes, something like:
"something","something","something","something",...
BUT, sometimes I get csv with
"something","som"ething"","s"omething",...
and I wanted to create a regex that will fix this problem, does someone have something to offer?
something that will take out everything out from the string that is not a number or text, but when I take out " I need to make sure its not the ones that bounds the string cause i need those..
so from "som"ething"","s"ometh8 ing" id expect => "something","someth8 ing"
im using scala but any solution will be great!
thanks!!

Simple solution
A simple solution in Scala:
scala> val input = """"som"ething"","s"ometh8 ing""""
input: String = "som"ething"","s"ometh8 ing"
scala> val values = input.split("\",\"").map(_.filter(c => c.isLetterOrDigit || c.isWhitespace))
values: Array[String] = Array(something, someth8 ing)
scala> val output = values.mkString("\"", "\",\"", "\"")
output: String = "something","someth8 ing"
Assuming you never have "," inside your values, but if you do then there's no way to fix your CSV unambiguously anyway.
This isn't the most optimal solution speed or memory-wise, but it's short and simple.
EDIT: Regex solution
In case you really want some regexes, enjoy:
scala> input.replaceAll("""(^"|"$|","|[\p{IsAlphabetic}\p{Digit}\p{Space}])|.""", "$1")
res17: String = "something","someth8 ing"
This tries to match " at the beginning or end of input OR "," anywhere else OR any of your approved characters. If any of these match, it goes to the first capturing group. Otherwise, it matches any character (.), but doesn't capture it in a group, so the first group stays empty. Then, the matched substring is replaced with $1, which is the content of the first capturing group.
I still think the first solution is cleaner and easier to understand.

import re
csv_string = '"something","som"ething"","s"omething"'
for each_str in re.findall(r'(.*?)[\,\n]', csv_string):
print(re.sub(r'\"', '', each_str)
add a line feed, to the end of the string so that you can include the last part of the string in re.findall

Related

Delete specific duplicated punctuation from string

I have this string s = "(0|\\+33)[1-9]( *[0-9]{2}){4}". And I want to delete just the duplicated just one ' \ ', like I want the rsult to look like (0|\+33)[1-9]( *[0-9]{2}){4}.
When I used this code, all the duplicated characters are removed:
result = "".join(dict.fromkeys(s)).
But in my case I want just to remove the duplicated ' \ '. Any help is highly appreciated
A solution using the re module:
import re
s = r"(0|\\+33)[1-9]( *[0-9]{2}){4}"
s = re.sub(r"\\(?=\\)", "", s)
print(s)
I look for all backslashes, that are followed by another backslash and replace it with an empty sign.
Output: (0|\+33)[1-9]( *[0-9]{2}){4}​
The function you need is replace
s = "(0|\\+33)[1-9]( *[0-9]{2}){4}"
result = s.replace("\\","")
EDIT
I see now that you want to remove just one \ and not both.
In order to do this you have to modify the call to replace this way
result = s.replace("\","",1) # last argument is the number of occurrances to replace
or
result = s.replace("\\","\")
EDIT of the EDIT
Backslashes are special in Python.
I'm using Python 3.10.5. If I do
x = "ab\c"
y = "ab\\c"
print(len(x)==len(y))
I get a True.
That's because backslashes are used to escape special characters, and that makes the backslash a special character :)
I suggest you to try a little bit with replace until you get what you need.

Parse a string using regex to obtain matches beginning with a certain word

I tried to search but the information that I am getting seems to be kinda overwhelming and far from what I need. I can't seem to get it to work.
The requirement is to get the function that starts with "meta" and its parentheses.
input:
one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)
output:
[ metaOmph , (uno) ]
[ metaAsdf, (dos) ]
[ metaPoil, (tres)]
The one that I currently have just gets the entire line if it starts with "meta". so I have the entire "one meta<>" if it's a match, would it be possible do what I'm aiming for?
Edit: It's one input/line at a time.
I'd love to post what I did earlier but I closed repl.it due to my frustration. I'll keep it in mind on my next post. (quite new here)
import re
s = """one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)"""
print(re.findall(".+(meta\w+)(\(\w+\))", s))
Outputs:
[('metaOmph', '(uno)'), ('metaAsdf', '(dos)'), ('metaPoil', '(tres)')]
re.findall() approach with valid regex pattern:
import re
s = '''
one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)
'''
result = re.findall(r'\b(meta\w+)(\([^()]+\))', s)
print(result)
The output:
[('metaOmph', '(uno)'), ('metaAsdf', '(dos)'), ('metaPoil', '(tres)')]
If you are going to pass a multiline string, it would seem simple to use the module level re.findall function.
text = '''one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)'''
r = re.findall(r'\b(meta.*?)(\(.*?\))', text, re.M)
print(r)
[('metaOmph', '(uno)'), ('metaAsdf', '(dos)'), ('metaPoil', '(tres)')]
If you are going to be passing 1-line strings as input to a loop, it might make more sense to compile the pattern beforehand, using re.compile and re.search inside a function:
pat = re.compile(r'\b(meta.*?)(\(.*?\))')
def find(text):
return pat.search(text)
for text in list_of_texts: # assuming you're passing in your strings from a list, or elsewhere
m = find(text)
if m:
print(list(m.groups()))
['metaOmph', '(uno)']
['metaAsdf', '(dos)']
['metaPoil', '(tres)']
Note that m might return a match object or None depending on whether a search was found. You'll want to query the return value, otherwise you'll receive an AttributeError: 'NoneType' object has no attribute 'groups', or something along those lines.
Alternatively, if you want to append the result to a list, you might instead use:
r_list = []
for text in list_of_texts:
m = find(text)
if m:
r_list.append(list(m.groups()))
print(r_list)
[['metaOmph', '(uno)'], ['metaAsdf', '(dos)'], ['metaPoil', '(tres)']]
Regex Details
\b # word boundary (thought to add this in thanks to Roman's answer)
(
meta # literal 'meta'
.*? # non-greedy matchall
)
(
\( # literal opening brace (escaped)
.*?
\) # literal closing brace (escaped)
)

Python split before a certain character

I have following string:
BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6
I am trying to split it in a way I would get back the following dict / other data structure:
BUCKET1 -> /dir1/dir2/, BUCKET1 -> /dir3/dir4/, BUCKET2 -> /dir5/dir6/
I can somehow split it if I only have one BUCKET, not multiple, like this:
res.split(res.split(':', 1)[0].replace('.', '').upper()) -> it's not perfect
Input: ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/
Output: [(ADRIAN, /dir1/dir11), (DANIEL, /dir2/), (CULEA, /dir3/), (ADRIAN, /dir5/), (ADRIAN, /dir6/)
As per Wiktor Stribiżew comments, the following regex does the job:
r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"
If you're experienced, I'd recommend learning Regex just as the others have suggested. However, if you're looking for an alternative, here's a way of doing such without Regex. It also produces the output you're looking for.
string = input("Enter:") #Put your own input here.
tempList = string.replace("BUCKET",':').split(":")
outputList = []
for i in range(1,len(tempList)-1,2):
someTuple = ("BUCKET"+tempList[i],tempList[i+1])
outputList.append(someTuple)
print(outputList) #Put your own output here.
This will produce:
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
This code is hopefully easier to understand and manipulate if you're unfamiliar with Regex, although I'd still personally recommend Regex to solve this if you're familiar with how to use it.
Use re.findall() function:
s = "ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/"
result = re.findall(r'(\w+):([^:]+\/)', s)
print(result)
The output:
[('ADRIAN', '/dir1/dir11/'), ('DANIEL', '/dir2/'), ('ADI_BUCKET', '/dir3/'), ('CULEA', '/dir4/'), ('ADRIAN', '/dir5/'), ('ADRIAN', '/dir6/')]
Use regex instead?
impore re
test = 'BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6'
output = re.findall(r'(?P<bucket>[A-Z0-9]+):(?P<path>[/a-z0-9]+)', test)
print(output)
Which gives
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
It appears you have a list of predefined "buckets" that you want to use as boundaries for the records inside the string.
That means, the easiest way to match these key-value pairs is by matching one of the buckets, then a colon and then any chars not starting a sequence of chars equal to those bucket names.
You may use
r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"
Compile with re.S / re.DOTALL if your values span across multiple lines. See the regex demo.
Details:
(BUCKET1|BUCKET2) - capture group one that matches and stores in .group(1) any of the bucket names
: - a colon
(.*?) - any 0+ chars, as few as possible (as *? is a lazy quantifier), up to the first occurrence of (but not inlcuding)...
(?=(?:BUCKET1|BUCKET2)|$) - any of the bucket names or end of string.
Build it dynamically while escaping bucket names (just to play it safe in case those names contain * or + or other special chars):
import re
buckets = ['BUCKET1','BUCKET2']
rx = r"({0}):(.*?)(?=(?:{0})|$)".format("|".join([re.escape(bucket) for bucket in buckets]))
print(rx)
s = "BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6"
print(re.findall(rx, s))
# => (BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
See the online Python demo.

How to find a string between to special characters in python?

I have a set of strings like this:
uc001acu.2;C1orf159;chr1:1046736-1056736;uc001act.2;C1orf159;
I need to extract the sub-string between two semicolons and I only need the first occurrence.
The result should be: C1orf159
I have tried this code, but it does not work:
import re
info = "uc001acu.2;C1orf159;chr1:1046736-1056736;uc001act.2;C1orf159;"
name = re.search(r'\;(.*)\;', info)
print name.group()
Please help me.
Thanks
You can split the string and limit it to two splits.
x = info.split(';',2)[1]
import re
pattern=re.compile(r".*?;([a-zA-Z0-9]+);.*")
print pattern.match(info).groups()
This looks for first ; eating up non greedily through .*? .Then it captures the alpha numeric string until next ; is found.Then it eats up the rest of the string.Match captured though .groups()

What should I use the Non-greedy match in this case

Assume I have a string which includes some data fields that are separated by "|", like
|1|2|3|4|5|6|7|8|
My purpose is to get the 8th field. This is what I'm doing:
pattern = re.compile(r'^\s+(\|.*?\|){8}')
match = pattern.match(test_line)
if match:
print:match.group(8)
But looks like it can not match. I know in this case I need to use ? for non-greedy match, but why I can not get the 8th field?
Thanks
Regex might be complicating this problem rather than simplifying it. A simple way to get an eighth item from a | delimited string is using split():
a = '|here|is|some|data|separated|by|bars|hooray!|'
print a.split('|')[8]
RETURNS
hooray!
Using regex, one way to get it would be:
import re
a = '|here|is|some|data|separated|by|bars|hooray!|'
pattern = re.compile(r'([^\|]+)')
match = pattern.findall(a)
print match[7]
RETURNS
hooray!

Categories