how to split records with non-standard delimiters

how to split records with non-standard delimiters - python

in my csv file I have the following records separated by a , between brackets:
(a1,a2,a3),(b1,b2,b3),(c1,c2,c3),(d1,d2,d3)
How do I split the data into a list so that I get something more like this:
a1,a2,a3
b1,b2,b3
c1,c2,c3
d1,d2,d3
Currently my python code looks like this:
dump = open('sample_dump.csv','r').read()
splitdump = dump.split('\n')
print splitdump

You could do something along the lines of:
Remove first and last brackets
Split by ),( character sequence
To split by a custom string, just add it as a parameter to the split method, e.g.:
line.split("),(")
It's a bit hacky, so you'll have to generalize based on any expected variations in your input data format (e.g. will your first/last chars always be brackets?).

Try this, split first by ")," then, join and split again by ( to left tuples without brackets
_line = dump.split("),")
_line = ''.join(_line).split("(")
print _line
>> ['', 'a1,a2,a3,', 'b1,b2,b3,', 'c1,c2,c3,', 'd1,d2,d3']
#drop first empty element
print _line.pop(0)
>> ['a1,a2,a3,', 'b1,b2,b3,', 'c1,c2,c3,', 'd1,d2,d3']

First you need to the steps you need to perform in order to get your result, here's a hacky solution:
remove first and last brackets
use the ),( as the group separator, split
split each group by ,
line = '(a1,a2,a3),(b1,b2,b3),(c1,c2,c3),(d1,d2,d3)'
[group.split(',') for group in line[1:-1].split('),(')]

Related

Extract strings between brackets and nested brackets

So I have a file of text and titles, (titles indicated with the starting ";")
;star/stellar_(class(ification))_(chart)
Hertz-sprussels classification of stars is shows us . . .
What I want to do is have it where it's split by "_" into
['star/stellar','(class(ification))','(chart)'], interating through them and extracting whats in the brackets, e.g. '(class(ification))' to {'class':'ification'} and (chart) to just ['chart'].
All i've done so far is the splitting part
for ln in open(file,"r").read().split("\n"):
if ln.startswith(";"):
keys=ln[1:].split("_")
I have ways to extract bits in brackets, but I have had trouble finding a way that supports nested brackets in order.
I've tried things like re.findall('\(([^)]+)',ln) but that returns ['star/stellar', '(class', 'chart']. Any ideas?

You can do this with splits. If you separate the string using '_(' instead of only '_', the second part onward will be an enclosed keyword. you can strip the closing parentheses and split those parts on the '(' to get either one component (if there was no nested parentesis) or two components. You then form either a one-element list or dictionary depending on the number of components.
line = ";star/stellar_(class(ification))_(chart)"
if line.startswith(";"):
parts = [ part.rstrip(")") for part in line.split("_(")[1:]]
parts = [ part.split("(",1) for part in parts ]
parts = [ part if len(part)==1 else dict([part]) for part in parts ]
print(parts)
[{'class': 'ification'}, ['chart']]
Note that I assumed that the first part of the string is never included in the process and that there can only be one nested group at the end of the parts. If that is not the case, please update your question with relevant examples and expected output.

You can split (again) on the parentheses then do some cleaning:
x = ['star/stellar','(class(ification))','(chart)']
for v in x:
y = v.split('(')
y = [a.replace(')','') for a in y if a != '']
if len(y) > 1:
print(dict([y]))
else:
print(y)
Gives:
['star/stellar']
{'class': 'ification'}
['chart']

If all of the title lines have the same format, that is they all have these three parts ;some/title_(some(thing))_(something), then you can catch the different parts to separate variables:
first, second, third = ln.split("_")
From there, you know that:
for the first item you need to drop the ;:
first = first[1:]
for the second item, you want to extract the stuff in the parentheses and then merge it into a dict:
k, v = filter(bool, re.split('[()]', second))
second = {k:v}
for the third item, you want to drop the surrounding parentheses
third = third[1:-1]
Then you just need to put them all together again:
[first, second, third]

How to remove first part of URL string in column value with Pandas?

I'm struggling to remove the first part of my URLs in column myId in csv file.
my.csv
myID
https://mybrand.com/trigger:open?Myservice=Email&recipient=brn:zib:b1234567-9ee6-11b7-b4a2-7b8c2344daa8d
desired output for myID
b1234567-9ee6-11b7-b4a2-7b8c2344daa8d
my code:
df['myID'] = df['myID'].map(lambda x: x.lstrip('https://mybrand.com/trigger:open?Myservice=Email&recipient=brn:zib:'))
output in myID (first letter 'b' is missing in front of the string):
1234567-9ee6-11b7-b4a2-7b8c2344daa8d
the above code removes https://mybrand.com/trigger:open?Myservice=Email&recipient=brn:zib: However it also removes the first letter from myID if there is one in front of the ID, if it's a number then it remains unchanged.
Could someone help with this? thanks!

You could try a regex replacement here:
df['myID'] = df['myID'].str.replace('^.*:', '', regex=True)
This approach is to simply remove all content from the start of MyID up to, and including, the final colon. This would leave behind the UUID you want to keep.

With lstrip you remove all characters from a string that match the set of characters you pass as an argument. So:
string = abcd
test = string.lstrip(ad)
print(test)
If you want to strip the first x characters of the string, you can just slice it like an array. For you, that would be something like:
df['myID'] = df['myID'].map(lambda x: x[:-37])
However, for this to work, the part you want to get from the string should have a constant size.

You can use re (if the part before what you want to extract is always the same)
import re
idx = re.search(r':zib:', myID)
myNewID = myID[idx.end():]
Then you will have :
myNewID
'b1234567-9ee6-11b7-b4a2-7b8c2344daa8d'

How to execute it correctly?

list1 = ['192,3.2', '123,54.2']
yx = ([float(i) for i in list1])
print(list1)
This is the code I have and I am trying to learn for future reference on how to remove , within a list of string. I tried various things like mapping but the mapping would not work due to the comma within the num.

If you want to remove commas from a string use :
list1 = string.split(",")
the string variable contains your string input, you get your output in the form a list, join the list if you want the original string without the commas.
string_joined = "".join(list1)
string_joined will contain your string without the commas.
If you want your string to just remove the comma and retain the empty space at that position, your syntax :
string = string.replace(","," ")
Also, the fist two syntax I explained, can be shortened to a single syntax :
string = string.replace(",","")
Now if you want to iterate in your list of strings, consider each element(string) in your list one at a time :
for string in list1 :
<your codes go here>
Hope this answers what you are looking for.

we can do regex to remove the non-digits to get rid of other characters
import regex as re
print([float(re.sub("[^0-9|.]", "", s)) for s in list1])
without regex:
[float(s.replace(',','')) for s in list1 ]
output:
[1923.2, 12354.2]

how to split findall result which contain "," in data

x = re.findall(r'FROM\s(.*?\s)(WHERE|INNER|OUTER|JOIN|GROUP,data,re.DOTALL)
I am using above expression to parse oracle sql query and get the result.
I get multiple matches and want to print them each line by line.
How can i do that.
Some result even have "," in between them.

You can try this :
for elt in x:
print('\n'.join(elt.split(',')))
join returns a list of the comma-separated elements, which are then joined again with \n (new line). Therefore, you get one result per line.

Your result is returned in a list.
from https://docs.python.org/2/library/re.html:
re.findall(pattern, string, flags=0) Return all non-overlapping
matches of pattern in string, as a list of strings.
If you are not familiar with data structures, more information here
you should be able to easily iterate on over the returned list with a for loop:
for matchedString in x:
#replace commas
n = matchedString.replace(',','') #to replace commas
#add to new list or print, do something, any other logic
print n

How to remove brackets from python string?

I know from the title you might think that this is a duplicate but it's not.
for id,row in enumerate(rows):
columns = row.findall("td")
teamName = columns[0].find("a").text, # Lag
playedGames = columns[1].text, # S
wins = columns[2].text,
draw = columns[3].text,
lost = columns[4].text,
dif = columns[6].text, # GM-IM
points = columns[7].text, # P - last column
dict[divisionName].update({id :{"teamName":teamName, "playedGames":playedGames, "wins":wins, "draw":draw, "lost":lost, "dif":dif, "points":points }})
This is how my Python code looks like. Most of the code is removed but essentially i am extracting some information from a website. And i am saving the information as a dictionary. When i print the dictionary every value has a bracket around them ["blbal"] which causes trouble in my Iphone application. I know that i can convert the variables to strings but i want to know if there is a way to get the information DIRECTLY as a string.

That looks like you have a string inside a list:
["blbal"]
To get the string just index l = ["blbal"] print(l[0]) -> "blbal".
If it is a string use str.strip '["blbal"]'.strip("[]") or slicing '["blbal"]'[1:-1] if they are always present.

you can also you replace to just replace the text/symbol that you don't want with the empty string.
text = ["blbal","test"]
strippedText = str(text).replace('[','').replace(']','').replace('\'','').replace('\"','')
print(strippedText)

import re
text = "some (string) [another string] in brackets"
re.sub("\(.*?\)", "", text)
# some in brackets
# works for () and will work for [] if you replace () with [].
The \(.*?\) format matches brackets with some text in them with an unspecified length. And the \[.*?\] format matches also but a square brackets with some text inside the brackets.
The output will not contain brackets and texts inside of them.
If you want to match only square brackets replace square brackets with the bracket of choice and vise versa.
To match () and [] bracket in one go, use this format (\(.*?\)|\[.*?\]:) joining two pattern with the | character.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to split records with non-standard delimiters - python

Related

Extract strings between brackets and nested brackets

How to remove first part of URL string in column value with Pandas?

How to execute it correctly?

how to split findall result which contain "," in data

How to remove brackets from python string?

Categories

Resources