Conver string from `ABCD0011` to `ABCD_11` - python

I have the following CSV file:
ABCD0011
ABCD1404
ABCD1255
There are many such rows in the CSV file which I want to convert as follows:
Input
Expected Output
Actual Output
ABCD0011
ABCD_11_0
ABCD_0011_0
ABCD1404
ABCD_1404_0
ABCD_144_0
ABCD1255
ABCD_1255_0
ABCD_1255_0
Basically, it takes the zeros after the letters and replace it with an underscore ("_").
Code
import numpy as np
import pandas as pd
df = pd.read_csv('Book1.csv')
df.A = df.A.str.replace('[0-9]+', '')+'_'+df.A.str.replace('([A-Z])+', '')+'_0'
Actual Output and Issues
I got the values that are without leading zeros correctly converted like
from ABCD1255 to ABCD_1255_0.
But for values with leading zeros it failed, example:
from ABCD0011 to ABCD_0011_0. Did not change the format.
Even for values with zeros inside it failed, like
from ABCD1404 to ABCD_144_0. It deleted the zero in the middle.
Question
How can I fix this issue?

If we know the input strings will always be eight characters, with the first four being letter and the second set of four being a number, we could:
>>> s = "ABCD0011"
>>> f"{s[:4]}_{int(s[4:])}_0"
'ABCD_11_0'
If we don't know the lengths for sure, we can use re.sub with a lambda to transform two different matching groups.
>>> import re
>>> re.sub(r'([a-zA-Z]+)(\d+)', lambda m: f"{m.group(1)}_{int(m.group(2))}_0", s)
'ABCD_11_0'
>>> re.sub(r'([a-zA-Z]+)(\d+)', lambda m: f"{m.group(1)}_{int(m.group(2))}_0", 'A709')
'A_709_0'

Ignoring the apparent requirement for a dataframe, this is how you could parse the file and generate the strings you need. Uses re. Does not use numpy and/or pandas
import re
FILENAME = 'well.csv'
PATTERN = re.compile(r'^([a-zA-Z]+)(\d+)$')
with open(FILENAME) as csv_data:
next(csv_data) # skip header(s)
for line in csv_data:
if m := PATTERN.search(line):
print(f'{m.group(1)}_{int(m.group(2))}_0')
This will work for the data shown in the question. Other data structures may cause this to fail.
Output:
ABCD_11_0
ABCD_1404_0
ABCD_1255_0

Related

Split List Elements in byte format to separate bytes in python

I have a list with byte elements like this:
list = [b'\x00\xcc\n', b'\x14I\x8dy_\xeb\xbc1C']
Now I want to separate all bytes like following:
list_new =[b'\x00', b'\xcc', b'\x14I', b'\x8dy_', b'\xeb', b'\xbc1C']
I am assuming here that you wanted to split the data with split criteria of '\x', this seems to be matching with your desired output. Let me know otherwise. Also I am not sure why you got this type of string, its little awkward to work with. A bigger context on the question might be more helpful. Nevertheless, I tried to get your desired output in following way:(May be not efficient but gets your job done).
import re
from codecs import encode
lists = [b'\x00\xcc\n', b'\x14I\x8dy_\xeb\xbc1C']
split = [re.split(r'(?=\\x)', str(item)) for item in lists] ## splitting with assumption of \x using lookarounds here
output = [] ## container to save the final item
for item in split: ## split is list of lists hence required two for loops
for nitem in item:
if nitem != "b'": ## remove anything which has only "b'"
output.append(nitem.replace('\\n','').replace("'",'').encode()) ## finally appending everyitem
## Note here that output contains two backward slashes , to remove them we use encode function from codecs module
## like below
[encode(itm.decode('unicode_escape'), 'raw_unicode_escape') for itm in output] ## Final output
Output:
[b'\x00', b'\xcc', b'\x14I', b'\x8dy_', b'\xeb', b'\xbc1C']

Reading irregular colunm data into python 3.X using pandas or numpy

Below is my piece of code.
import numpy as np
filename1=open(f)
xf = np.loadtxt(filename1, dtype=float)
Below is my data file.
0.14200E+02 0.18188E+01 0.44604E-03
0.14300E+02 0.18165E+01 0.45498E-03
0.14400E+02-0.17694E+01 0.44615E+03
0.14500E+02-0.17226E+01 0.43743E+03
0.14600E+02-0.16767E+01 0.42882E+03
0.14700E+02-0.16318E+01 0.42033E+03
0.14800E+02-0.15879E+01 0.41196E+03
as one can see there are negative values that take up the space between 2 values this causes numpy to give
ValueError: Wrong number of columns at line 3
This is just small snippet of my code. I want to read this data using numpy or pandas. Any suggestion would be great.
Edit 1:
#ZarakiKenpachi I used your suggestion of sep=' |-' but it gives me extra 4th column with NaN values.
Edit 2:
#Serge Ballesta nice suggestion but all these are some kind of pre-processing. I want some kind of inbuild function to do this in pandas or numpy.
Edit 3:
Important Note it should be noted that there also negative sign in 0.4373E-03
Thank-you
np.loadtext can read from a (byte string) generator, so you can filter the input file while loading it to add an additional before a minus:
...
def filter(fd):
rx = re.compile(rb'\d-')
for line in fd:
yield rx.sub(b' -', line)
xf = np.loadtxt(filter(open(f, 'b')), dtype=float)
This does not require to preload everything into memory, so it is expected to be memory efficient.
The regex is required to avoid to change something like 0.16545E-012.
In my tests for 10k lines, this should be at most 10% slower than loading everything in memory but will require far less memory
You can do a preprocess your data to add an additional space before your - signs. While there are many ways of doing it, the best approach would be in my opinion (in order to avoid adding whitespaces at the start of the line) is using regex re.sub:
import re
with open(f) as file:
raw_data = file.readlines()
processed_data = re.sub(r'(?:\d)-', " -", raw_data)
xf = np.loadtxt(processed_data, dtype=float)
This replaces every - preceded by a number with -.
Try the below code :
with open('app.txt') as f:
data = f.read()
import re
data_mod = []
for number in data.split('\n')[:-1]:
num = re.findall(r'[\w\.-]+-[\w\.-]',number)
for n in num:
number = number.replace('-',' -')
data_mod.append(number)
with open('mod_text.txt','w') as f:
for data in data_mod:
f.write(data+"\n")
filename1='mod_text.txt'
xf = np.loadtxt(filename1, dtype=float)
Actually you have to per-process the data, using regex. After that you can load that data as you required.
I hope this helps.

How to use regex as delimiter function while reading a file into numpy array or similar

Well I have this txt file:
.xsh 1:
..sxi
..kuxz
...iucdb
...khjub
..kjb
.hjub 2:
..ind
..ljnasdc
...kicd
...lijnbcd
.split 3:
..asd
I want to load this file into a numpy array (because numpy is fast to work with) to make it faster I want to begin parsing while loading. So to say I want it to split the file on every delimiter
delim = '(^\.\w+\s\d+\:)'
Now I have tried to do it like this:
import numpy as np
import os,re
path = 'C:\\temp'
filename = 'file.txt'
delim = '(^\.\w+\s\d+\:)'
delimFunc = (lambda s: re.split(delim,s))
fname = os.path.join(path,filename)
ar=np.loadtxt(fname, dtype = str, delimiter = delimFunc)
print len(ar)
Here it does not split the way I want too, instead it splits on every newline. Is it possible to make numpy, pandas or any other fast library to behave the way I want to here?
i want the result:
[[.xsh 1:
..sxi
..kuxz
...iucdb
...khjub
..kjb]
[.hjub 2:
..ind
..ljnasdc
...kicd
...lijnbcd]
[.split 3:
..asd]]
I think pandas supports this out of the box, if that is an option for you.
have a look at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
the sep arg:
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'
You can also turn pandas dataframes back into numpy arrays without too much trouble using the .values method iirc
(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.values.html)
I had to solve it differently, but still faster than before:
import numpy as np
import os,re
import time
t1=time.time()
path = 'C:\\temp'
filename = 'file.txt'
delim = '(^\.\w+\s\d+\:)'
fname = os.path.join(path,filename)
ar=np.loadtxt(fname, dtype = str, delimiter = '\n')
x = np.array([],np.int32)
for (i,v) in enumerate(ar):
if re.search(delim,v):
x=np.append(x,i)
t2=time.time()
print np.split(ar,x)[1]
print 'Length of array:{0:d},took as long as {1:.2f} to complete'.format(len(x),(t2-t1))
I would go like this
...
d = re.compile(delim)
# np.nonzero in this case returns a 1-uple of arrays, we have to unwrap
ixs = np.nonzero([d.search(item) for item in ar])[0]
splitted = np.split(ar, ixs if ixs[0] else ixs[1:])
...
The ixs if ixs[0] else ixs[1:] expression takes care of the possibility of a valid "delimiter" in the first record, to achieve the type of result (i.e., no void record of records) that you have shown in the original question.
This anything like you're looking for?
s = """
.xsh 1:
..sxi
..kuxz
...iucdb
...khjub
..kjb
.hjub 2:
..ind
..ljnasdc
...kicd
...lijnbcd
.split 3:
..asd
"""
print(re.findall(r"(\.\w+ \d+:)\s*((?:.(?!\.\w+ \d+:))+)", s, re.M|re.DOTALL))
>>> [('.xsh 1:', '..sxi \n..kuxz \n...iucdb \n...khjub \n..kjb '), ('.hjub 2:', '..ind \n..ljnasdc \n...kicd \n...lijnbcd '), ('.split 3:', '..asd\n')]

Formatting form data?

I need to convert the form data below to a slightly different format to be able to submit correctly.
I have this form data.
PaReq:eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz/lO6en5PrLZCs6j
NWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA/6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNp
CEXF7Pg8X9JRgAIICbhCWz9wMY+oj/EYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQY
YMu12NOtlOUUgKZp3N+ikGUsRbF3WeHWO0CAVphXgMdnkFWtiap/Y5sldBGFjf1Yuzzv0PL8evrc
pDMCtMLqk1hyiqCHoT/0HIimCE/HmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQ
FfC2LHKuzqg+3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ+takbbhOfr6h9sjC65rpSehE
d4Yy1TXkQb9zlNkWEmD+r642A6n71A0vHRBwP9j/7TDLBQ==
TermUrl:https://www.footpatrol.co.uk/checkout/3d
MD:
Wanted format:
PaReq=eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz%2FlO6en5PrLZCs6j%0D%0ANWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA%2F6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNp%0D%0ACEXF7Pg8X9JRgAIICbhCWz9wMY%2Boj%2FEYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQY%0D%0AYMu12NOtlOUUgKZp3N%2BikGUsRbF3WeHWO0CAVphXgMdnkFWtiap%2FY5sldBGFjf1Yuzzv0PL8evrc%0D%0ApDMCtMLqk1hyiqCHoT%2F0HIimCE%2FHmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQ%0D%0AFfC2LHKuzqg%2B3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ%2BtakbbhOfr6h9sjC65rpSehE%0D%0Ad4Yy1TXkQb9zlNkWEmD%2Br642A6n71A0vHRBwP9j%2F7TDLBQ%3D%3D%0D%0A&TermUrl=https%3A%2F%2Fwww.footpatrol.co.uk%2Fcheckout%2F3d&MD=
I have tried this but seems to be a different format than what I need to submit correctly.
Code:
import urllib.parse
print(urllib.parse.quote_plus('''PaReq:eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz/lO6en5PrLZCs6j
NWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA/6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNp
CEXF7Pg8X9JRgAIICbhCWz9wMY+oj/EYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQY
YMu12NOtlOUUgKZp3N+ikGUsRbF3WeHWO0CAVphXgMdnkFWtiap/Y5sldBGFjf1Yuzzv0PL8evrc
pDMCtMLqk1hyiqCHoT/0HIimCE/HmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQ
FfC2LHKuzqg+3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ+takbbhOfr6h9sjC65rpSehE
d4Yy1TXkQb9zlNkWEmD+r642A6n71A0vHRBwP9j/7TDLBQ==
TermUrl:https://www.footpatrol.co.uk/checkout/3d
MD:'''))
Is this obtainable with python? And what do i need to do to achieve the wanted end result?
if your paraneters are separated by newlines you can use the splitlines method to get a list of parameters, and use re.split on each item to get a list with name, value.
Then apply quote_plus on each name and value, '='.join them and '&'.join all parameters.
import urllib.parse
import re
data = '''PaReq:eJxdUt1ugjAYvfcpyB6AlvpTMLUJG1vmEp2Z7mKXpHRIVMBSBvr0a9FatAlJz/lO6en5PrLZCs6jNWe14HTgOGTBqypOuZMls6cydrGHgwn2UOA/6bISrMIvfrzsFfrjosqKnHoudBEBBpryggu2jXNpCEXF7Pg8X9JRgAIICbhCWz9wMY+oj/EYDyfwugi40FaWxwdOPyJnXRZCVgR02JZZUedSnKiPJgQYYMu12NOtlOUUgKZp3N+ikGUsRbF3WeHWO0CAVphXgMdnkFWtiap/Y5sldBGFjf1Yuzzv0PL8evrcpDMCtMLqk1hyiqCHoT/0HIimCE/HmICO78V10OapNxy5QaDiukBbL7WT8CbSmj7VS6QWgufMRGKQFfC2LHKuzqg+3vY9v7xidBg5VTcryqfGt4QeAyEv73c9Z1J1LwxZ+takbbhOfr6h9sjC65rpSehEd4Yy1TXkQb9zlNkWEmD+r642A6n71A0vHRBwP9j/7TDLBQ==
TermUrl:https://www.footpatrol.co.uk/checkout/3d
MD:'''
data = [re.split(':(?!//)', line) for line in data.splitlines()]
data = '&'.join('='.join(urllib.parse.quote_plus(i) for i in l) for l in data)
If your data is split by newlines arbitrarily, you could join the lines and split by name. Then zip names and values, quote and join.
data = ''.join(data.splitlines())
data = zip(['PaReq', 'TermUrl', 'MD'], re.split('PaReq:|TermUrl:|MD:', data)[1:])
data = '&'.join('='.join(urllib.parse.quote_plus(i) for i in l) for l in data)
If you want to keep the newline cheracter, use only the last two lines in the second code snippet.

Python: How split data into different data types into 2D array

I’m trying to split downloaded data to an 2D array into different datatypes. The downloaded data looks like this:
000|17:40
000|17:45
010|17:50
025|17:55
056|18:00
178|18:05
202|18:10
203|18:15
190|18:20
072|18:25
013|18:30
002|18:35
000|18:40
000|18:45
000|18:50
000|18:55
000|19:00
000|19:05
000|19:10
000|19:15
000|19:20
000|19:25
000|19:30
000|19:35
000|19:40
I’m using the following code to parse this into a two dimensional array:
#!/usr/bin/python
import urllib2
response = urllib2.urlopen('http://gps.buienradar.nl/getrr.php?lat=52&lon=4')
html = response.read()
htmlsplit = []
for record in html.split("\r\n"):
htmlsplit.append(record.split("|"))
print htmlsplit
This is working great, but as expected, it treats it as a string. I’ve found some examples that splits into integers. That’s great if both sides where integers. But in my case it’s an integer | string (or maybe some kind of Python time format)
How can I split this directly into different data types?
Something like this?
for record in html.split("\r\n"): # beware, newlines are treacherous!
s = record.split("|")
htmlsplit.append((int(s[0]), s[1]))
Just write a parser for each record, if you have data this simple. However, I would add some try/except clause to catch errors for non-conforming lines, empty lines, etc. which may be present in the data. The code above is very fragile. Also, you might want to break at only \n and then clean your strings by strip() (i.e. replace s[1] by s[1].strip()). The integer conversion takes care of it automatically.
Use str.splitlines instead of splitting on \r\n
Use the csv module to iterate over the lines:
import csv
txt = '000|17:40\n000|17:45\n000|17:50\n000|17:55\n000|18:00\n000|18:05\n000|18:10\n000|18:15\n000|18:20\n000|18:25\n000|18:30\n000|18:35\n000|18:40\n000|18:45\n000|18:50\n000|18:55\n000|19:00\n000|19:05\n000|19:10\n000|19:15\n000|19:20\n000|19:25\n000|19:30\n000|19:35\n000|19:40\n'
reader = csv.reader(txt.splitlines(), delimiter='|')
column1 = []
column2 = []
for c1, c2 in reader:
column1.append(c1)
column2.append(c2)
You can also use the DictReader
import StringIO
reader2 = csv.DictReader(StringIO.StringIO(txt),
fieldnames=['int', 'time'],
delimiter='|')
column1 = []
column2 = []
for row in reader2:
column1.append(row['time'])
column2.append(row['int'])

Categories