Regex: replace comma in between a string Python - python

I have a list of rows like,
Totally 10 columns but the csv generated using tabula by 9 columns
ELECSERV FINALED(string values might be change) these two columns generated as one I want to separated by two different columns separated by comma and i then removed comma at the end of row.
D12-1234,041-260-32,714 EL DFRO ST,ELECSERV FINALED,0,$0.00,10/15/2009 ,CONSTRUCTION,Electrical service upgrade from 100 amp to 200 amp (same location),
D12-1235,037-071-07,127 S HORN DR,ELECSERV ISSUED,0,$0.00,10/22/2009 ,"AGANS & ELLIOTT, INC, A&E ELECTRIC",Service upgrade (same location),
Output should be like this:
D12-1234,041-260-32,714 EL DFRO ST,ELECSERV,FINALED,0,$0.00,10/15/2009 ,CONSTRUCTION,Electrical service upgrade from 100 amp to 200 amp (same location)
D12-1235,037-071-07,127 S HORN DR,ELECSERV,ISSUED,0,$0.00,10/22/2009 ,"AGANS & ELLIOTT, INC, A&E ELECTRIC",Service upgrade (same location)

I hope this is what you want, also make sure last line has a \n to work properly, you can also add an if statement to check if last char is \n.
with open("t.csv",'r') as myfile ,open ('out.csv','w') as outputfile:
for line in myfile:
outputfile.write(line[:-2]+"\n")
Edit : At the end you can write the result to the old file

You can first split the line with delimiter as ,.
cols = line.split(',')
Now change the 4th element in the result array and replace by ,
cols[3] = string.replace(cols[3],' ',',')
Join the array to form a string and the remove last comma by using rstrip
','.join(cols).rstrip(',')
Edit 1:
Please refer the following line, it's working for me.
line = 'D12-1234,041-260-32,714 EL DFRO ST,ELECSERV FINALED,0,$0.00,10/15/2009 ,CONSTRUCTION,Electrical service upgrade from 100 amp to 200 amp (same location),'
cols = line.split(',')
cols[3] = str.replace(cols[3],' ',',')
print(','.join(cols).rstrip(','))

Related

Python to remove extra delimiter

We have a 100MB pipe delimited file that has 5 column/4 delimiters each separated by a pipe. However there are few rows where the second column has an extra pipe. For these few rows total delimiter are 5.
For example, in the below 4 rows, the 3rd is a problematic one as it has an extra pipe.
1|B|3|D|5
A|1|2|34|5
D|This is a |text|3|5|7
B|4|5|5|6
Is there any way we can remove an extra pipe from the second position where the delimiter count for the row is 5. So, post correction, the file needs to look like below.
1|B|3|D|5
A|1|2|34|5
D|This is a text|3|5|7
B|4|5|5|6
Please note that the file size is 100 MB. Any help is appreciated.
Source: my_file.txt
1|B|3|D|5
A|1|2|34|5
D|This is a |text|3|5|7
B|4|5|5|6
E|1 |9 |2 |8 |Not| a |text|!!!|3|7|4
Code
# If using Python3.10, this can be Parenthesized context managers
# https://docs.python.org/3.10/whatsnew/3.10.html#parenthesized-context-managers
with open('./my_file.txt') as file_src, open('./my_file_parsed.txt', 'w') as file_dst:
for line in file_src.readlines():
# Split the line by the character '|'
line_list = line.split('|')
if len(line_list) <= 5:
# If the number of columns doesn't exceed, just write the original line as is.
file_dst.write(line)
else:
# If the number of columns exceeds, count the number of columns that should be merged.
to_merge_columns_count = (len(line_list) - 5) + 1
# Merge the columns from index 1 to index x which includes all the columns to be merged.
merged_column = "".join(line_list[1:1+to_merge_columns_count])
# Replace all the items from index 1 to index x with the single merged column
line_list[1:1+to_merge_columns_count] = [merged_column]
# Write the updated line.
file_dst.write("|".join(line_list))
Result: my_file_parsed.txt
1|B|3|D|5
A|1|2|34|5
D|This is a text|3|5|7
B|4|5|5|6
E|1 9 2 8 Not a text!!!|3|7|4
A simple regular expression pattern like this works on Python 3.7.3:
from re import compile
bad_pipe_re = compile(r"[ \w]+\|[ \w]+(\|)[ \w]+\|[ \w]+\|[ \w]+\|[ \w]+\n")
with open("input", "r") as fp_1, open("output", "w") as fp_2:
line = fp_1.readline()
while line is not "":
mo = bad_pipe_re.fullmatch(line)
if mo is not None:
line = line[:mo.start(1)] + line[mo.end(1):]
fp_2.write(line)
line = fp_1.readline()

Remove initial space from lines, keeping others spaces in python/pandas

I need to remove the initial space from lines, like show below:
From
(space)09/Mar/21 16,997,520.6
To
09/Mar/21 16,997,520.6
I've tryed this : remove spaces from beginning of lines , but removed all spaces.
Assuming from your question that you have a file loaded, with multiple lines (where 09/Mar/21 16,997,520.6 is one of those lines). Have you tried something like:
for line in file:
line = line.strip()
<--- Do stuff with the line --->
Use .lstrip(' ')
string = ' 09/Mar/21 16,997,520.6'
print(string.lstrip(' '))
>>> 09/Mar/21 16,997,520.6
.lstrip() method will remove whatever is passed into it at the beginning of string.
For example:
print(string.lstrip(' 09'))
>>> /Mar/21 16,997,520.6
python
" 09/Mar/21 16,997,520.6".lstrip()
or pandas specific
df = pd.DataFrame([" 09/Mar/21 16,997,520.6"], columns = ['dummy_column'])
df['dummy_column'].str.lstrip()

Correct mistakes in a python program dealing with CSV

I'm trying to edit a CSV file using informations from a first one. That doesn't seem simple to me as I should filter multiple things. Let's explain my problem.
I have two CSV files, let's say patch.csv and origin.csv. Output csv file should have the same pattern as origin.csv, but with corrected values.
I want to replace trip_headsign column fields in origin.csv using forward_line_name column in patch.csv if direction_id field in origin.csv row is 0, or using backward_line_name if direction_id is 1.
I want to do this only if the part of the line_id value in patch.csv between ":" and ":" symbols is the same as the part of route_id value in origin.csv before the ":" symbol.
I know how to replace a whole line, but not only some parts, especially that I sometimes have to look only part of a value.
Here is a sample of origin.csv:
route_id,service_id,trip_id,trip_headsign,direction_id,block_id
210210109:001,2913,70405957139549,70405957,0,
210210109:001,2916,70405961139553,70405961,1,
and a sample of patch.csv:
line_id,line_code,line_name,forward_line_name,forward_direction,backward_line_name,backward_direction,line_color,line_sort,network_id,commercial_mode_id,contributor_id,geometry_id,line_opening_time,line_closing_time
OIF:100110010:10OIF439,10,Boulogne Pont de Saint-Cloud - Gare d'Austerlitz,BOULOGNE / PONT DE ST CLOUD - GARE D'AUSTERLITZ,OIF:SA:8754700,GARE D'AUSTERLITZ - BOULOGNE / PONT DE ST CLOUD,OIF:SA:59400,DFB039,91,OIF:439,metro,OIF,geometry:line:100110010:10,05:30:00,25:47:00
OIF:210210109:001OIF30,001,FFOURCHES LONGUEVILLE PROVINS,Place Mérot - GARE DE LONGUEVILLE,,GARE DE LONGUEVILLE - Place Mérot,OIF:SA:63:49,000000 1,OIF:30,bus,OIF,,05:39:00,19:50:00
Each file has hundred of lines I need to parse and edit this way.
Separator is comma in my csv files.
Based on mhopeng answer to a previous question, I obtained that code:
#!/usr/bin/env python2
from __future__ import print_function
import fileinput
import sys
# first get the route info from patch.csv
f = open(sys.argv[1])
d = open(sys.argv[2])
# ignore header line
#line1 = f.readline()
#line2 = d.readline()
# get line of data
for line1 in f.readline():
line1 = f.readline().split(',')
route_id = line1[0].split(':')[1] # '210210109'
route_forward = line1[3]
route_backward = line1[5]
line_code = line1[1]
# process origin.csv and replace lines in-place
for line in fileinput.input(sys.argv[2], inplace=1):
line2 = d.readline().split(',')
num_route = line2[0].split(':')[0]
# prevent lines with same route_id but different line_code to be considered as the same line
if line.startswith(route_id) and (num_route == line_code):
if line.startswith(route_id):
newline = line.split(',')
if newline[4] == 0:
newline[3] = route_backward
else:
newline[3] = route_forward
print('\t'.join(newline),end="")
else:
print(line,end="")
But unfortunately, that doesn't push the right forward or backward_line_name in trip_headsign (always forward is used), the condition to compare patch.csv line_code to the end of route_id of origin.csv (after the ":") doesn't work, and the script finally triggers that error, before finishing parsing the file:
Traceback (most recent call last):
File "./GTFS_enhancer_headsigns.py", line 28, in
if newline[4] == 0:
IndexError: list index out of range
Could you please help me fixing these three problems?
Thanks for your help :)
You really should consider using the python csv module instead of split().
Out of experience , everything is much easier when working with csv files and the csv module.
This way you can iterate through the dataset in a structured way without the risk of getting index out of range errors.

Filling tabs until the maximum length of column

I have a tab-delimited txt that looks like
11 22 33 44
53 25 36 25
74 89 24 35 and
But there is no "tab" after 44 and 25. So the 1st and 2nd rows have 4 columns, 3rd row has 5 columns.
To rewrite it so that tabs are shown,
11\t22\t33\t44
53\t25\t36\t25
74\t89\t24\t35\tand
I need to have a tool to mass-add tabs where there are no entries.
If the maximum length of column is n (n=5 in the above example), then I want to fill tabs until that nth column for all rows to make
11\t22\t33\t44\t
53\t25\t36\t25\t
74\t89\t24\t35\tand
I tried to do it by notepad++, and python by using replacer code like
map_dict = {'':'\t'}
but it seems I need more logic to do it.
I am assuming your file also contains newlines so it would actually look like this:
11\t22\t33\t44\n
53\t25\t36\t25\n
74\t89\t24\t35\tand\n
If you know for sure that the maximum length of your columns is 5, you can do it like this:
with open('my_file.txt') as my_file:
y = lambda x: len(x.strip().split('\t'))
a = [line if y(line) == 5 else '%s%s\n' % (line.strip(), '\t'*(5 - y(line)))
for line in my_file.readlines()]
# ['11\t22\t33\t44\t\n', '53\t25\t36\t25\t\n', '74\t89\t24\t35\tand\n']
This will add ending tabs until you reach 5 columns. You will get a list of lines that you need to write back to a file (i have 'my_file2.txt' but you can write back to the original one if you want).
with open('my_file2.txt', 'w+') as out_file:
for line in a:
out_file.write(line)
If I understood it correctly, you can achieve this in Notepad++ only using following:
And yes, if you have several files on which you want to perform this, you can record this as a macro and bind it on to key as a shortcut

Python: Read large file in chunks

Hey there, I have a rather large file that I want to process using Python and I'm kind of stuck as to how to do it.
The format of my file is like this:
0 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
So I basically want to read in the chunk up from 0-1, do my processing on it, then move on to the chunk between 1 and 2.
So far I've tried using a regex to match the number and then keep iterating, but I'm sure there has to be a better way of going about this. Any suggestion/info would be greatly appreciated.
If they are all within the same line, that is there are no line breaks between "1." and "2." then you can iterate over the lines of the file like this:
for line in open("myfile.txt"):
#do stuff
The line will be disposed of and overwritten at each iteration meaning you can handle large file sizes with ease. If they're not on the same line:
for line in open("myfile.txt"):
if #regex to match start of new string
parsed_line = line
else:
parsed_line += line
and the rest of your code.
Why don't you just read the file char by char using file.read(1)?
Then, you could - in each iteration - check whether you arrived at the char 1. Then you have to make sure that storing the string is fast.
If the "N " can only start a line, then why not use use the "simple" solution? (It sounds like this already being done, I am trying to reinforce/support it ;-))
That is, just reading a line at a time, and build up the data representing the current N object. After say N=0, and N=1 are loaded, process them together, then move onto the next pair (N=2, N=3). The only thing that is even remotely tricky is making sure not to throw out a read line. (The line read that determined the end condition -- e.g. "N " -- also contain the data for the next N).
Unless seeking is required (or IO caching is disabled or there is an absurd amount of data per item), there is really no reason not to use readline AFAIK.
Happy coding.
Here is some off-the-cuff code, which likely contains multiple errors. In any case, it shows the general idea using a minimized side-effect approach.
# given an input and previous item data, return either
# [item_number, data, next_overflow] if another item is read
# or None if there are no more items
def read_item (inp, overflow):
data = overflow or ""
# this can be replaced with any method to "read the header"
# the regex is just "the easiest". the contract is just:
# given "N ....", return N. given anything else, return None
def get_num(d):
m = re.match(r"(\d+) ", d)
return int(m.groups(1)) if m else None
for line in inp:
if data and get_num(line) ne None:
# already in an item (have data); current line "overflows".
# item number is still at start of current data
return [get_num(data), data, line]
# not in item, or new item not found yet
data += line
# and end of input, with data. only returns above
# if a "new" item was encountered; this covers case of
# no more items (or no items at all)
if data:
return [get_num(data), data, None]
else
return None
And usage might be akin to the following, where f represents an open file:
# check for error conditions (e.g. None returned)
# note feed-through of "overflow"
num1, data1, overflow = read_item(f, None)
num2, data2, overflow = read_item(f, overflow)
If the format is fixed, why not just read 3 lines at a time with readline()
If the file is small, you could read the whole file in and split() on number digits (might want to use strip() to get rid of whitespace and newlines), then fold over the list to process each string in the list. You'll probably have to check that the resultant string you are processing on is not initially empty in case two digits were next to each other.
If the file's content can be loaded in memory, and that's what you answered, then the following code (needs to have filename defined) may be a solution.
import re
regx = re.compile('^((\d+).*?)(?=^\d|\Z)',re.DOTALL|re.MULTILINE)
with open(filename) as f:
text = f.read()
def treat(inp,regx=regx):
m1 = regx.search(inp)
numb,chunk = m1.group(2,1)
li = [chunk]
for mat in regx.finditer(inp,m1.end()):
n,ch = mat.group(2,1)
if int(n) == int(numb) + 1:
yield ''.join(li)
numb = n
li = []
li.append(ch)
chunk = ch
yield ''.join(li)
for y in treat(text):
print repr(y)
This code, run on a file containing :
1 mountain
orange 2
apple
produce
2 gas
solemn
enlightment
protectorate
3 grimace
song
4 snow
wheat
51 guludururu
kelemekinonoto
52asabi dabada
5 yellow
6 pink
music
air
7 guitar
blank 8
8 Canada
9 Rimini
produces:
'1 mountain\norange 2\napple\nproduce\n'
'2 gas\nsolemn\nenlightment\nprotectorate\n'
'3 grimace\nsong\n'
'4 snow\nwheat\n51 guludururu\nkelemekinonoto\n52asabi dabada\n'
'5 yellow\n'
'6 pink \nmusic\nair\n'
'7 guitar\nblank 8\n'
'8 Canada\n'
'9 Rimini'

Categories