Python to remove extra delimiter - python

We have a 100MB pipe delimited file that has 5 column/4 delimiters each separated by a pipe. However there are few rows where the second column has an extra pipe. For these few rows total delimiter are 5.
For example, in the below 4 rows, the 3rd is a problematic one as it has an extra pipe.
1|B|3|D|5
A|1|2|34|5
D|This is a |text|3|5|7
B|4|5|5|6
Is there any way we can remove an extra pipe from the second position where the delimiter count for the row is 5. So, post correction, the file needs to look like below.
1|B|3|D|5
A|1|2|34|5
D|This is a text|3|5|7
B|4|5|5|6
Please note that the file size is 100 MB. Any help is appreciated.

Source: my_file.txt
1|B|3|D|5
A|1|2|34|5
D|This is a |text|3|5|7
B|4|5|5|6
E|1 |9 |2 |8 |Not| a |text|!!!|3|7|4
Code
# If using Python3.10, this can be Parenthesized context managers
# https://docs.python.org/3.10/whatsnew/3.10.html#parenthesized-context-managers
with open('./my_file.txt') as file_src, open('./my_file_parsed.txt', 'w') as file_dst:
for line in file_src.readlines():
# Split the line by the character '|'
line_list = line.split('|')
if len(line_list) <= 5:
# If the number of columns doesn't exceed, just write the original line as is.
file_dst.write(line)
else:
# If the number of columns exceeds, count the number of columns that should be merged.
to_merge_columns_count = (len(line_list) - 5) + 1
# Merge the columns from index 1 to index x which includes all the columns to be merged.
merged_column = "".join(line_list[1:1+to_merge_columns_count])
# Replace all the items from index 1 to index x with the single merged column
line_list[1:1+to_merge_columns_count] = [merged_column]
# Write the updated line.
file_dst.write("|".join(line_list))
Result: my_file_parsed.txt
1|B|3|D|5
A|1|2|34|5
D|This is a text|3|5|7
B|4|5|5|6
E|1 9 2 8 Not a text!!!|3|7|4

A simple regular expression pattern like this works on Python 3.7.3:
from re import compile
bad_pipe_re = compile(r"[ \w]+\|[ \w]+(\|)[ \w]+\|[ \w]+\|[ \w]+\|[ \w]+\n")
with open("input", "r") as fp_1, open("output", "w") as fp_2:
line = fp_1.readline()
while line is not "":
mo = bad_pipe_re.fullmatch(line)
if mo is not None:
line = line[:mo.start(1)] + line[mo.end(1):]
fp_2.write(line)
line = fp_1.readline()

Related

Regex: replace comma in between a string Python

I have a list of rows like,
Totally 10 columns but the csv generated using tabula by 9 columns
ELECSERV FINALED(string values might be change) these two columns generated as one I want to separated by two different columns separated by comma and i then removed comma at the end of row.
D12-1234,041-260-32,714 EL DFRO ST,ELECSERV FINALED,0,$0.00,10/15/2009 ,CONSTRUCTION,Electrical service upgrade from 100 amp to 200 amp (same location),
D12-1235,037-071-07,127 S HORN DR,ELECSERV ISSUED,0,$0.00,10/22/2009 ,"AGANS & ELLIOTT, INC, A&E ELECTRIC",Service upgrade (same location),
Output should be like this:
D12-1234,041-260-32,714 EL DFRO ST,ELECSERV,FINALED,0,$0.00,10/15/2009 ,CONSTRUCTION,Electrical service upgrade from 100 amp to 200 amp (same location)
D12-1235,037-071-07,127 S HORN DR,ELECSERV,ISSUED,0,$0.00,10/22/2009 ,"AGANS & ELLIOTT, INC, A&E ELECTRIC",Service upgrade (same location)
I hope this is what you want, also make sure last line has a \n to work properly, you can also add an if statement to check if last char is \n.
with open("t.csv",'r') as myfile ,open ('out.csv','w') as outputfile:
for line in myfile:
outputfile.write(line[:-2]+"\n")
Edit : At the end you can write the result to the old file
You can first split the line with delimiter as ,.
cols = line.split(',')
Now change the 4th element in the result array and replace by ,
cols[3] = string.replace(cols[3],' ',',')
Join the array to form a string and the remove last comma by using rstrip
','.join(cols).rstrip(',')
Edit 1:
Please refer the following line, it's working for me.
line = 'D12-1234,041-260-32,714 EL DFRO ST,ELECSERV FINALED,0,$0.00,10/15/2009 ,CONSTRUCTION,Electrical service upgrade from 100 amp to 200 amp (same location),'
cols = line.split(',')
cols[3] = str.replace(cols[3],' ',',')
print(','.join(cols).rstrip(','))

Python: How to split txt file data into csv according to user input

So I am reading in a .txt file that is largely similar to this: TTACGATATACGA etc. but contains thousands of characters. Now I can read in a file and output it as a csv according to user input that decides characters per column and number of columns however it writes a new file for each time.
Ideally I would like to have a format such as such per file:
User enters 4 and 3.
Output: TCAG, TGCT, TACG,
My curent output is this:
TCAGTGCTTACG
I have tried looking at string splitting but I don't seem to be able to get it to work.
here is what I've written thus far, apologies if it's poor:
#user input for parameters
user_input_character = int(input("Enter how many characters you;d like
per column"))
user_input_column = int(input("Enter how many columns you'd like"))
character_per_column = user_input_character
columns_per_entry = user_input_column
characters_to_read = int((character_per_column * columns_per_entry))
print("Total characters: " + str(characters_to_read))
#counts used to set letters to be taken into intake
index_start = 0
index_finish = characters_to_read
count =1
#open the file to be read
lines = []
test_file = open("dna.txt", "r")
for line in test_file:
line = line.strip()
if not line:
continue
lines.append(',')
#read the file and take note of its size for index purposes
read_file = test_file.read()
file_size = read_file.__len__()
print((file_size))
i = 1
index = 0
#use loop to make more than one file output
while(index < 50):
#print count used to measure progress for testing
print('the count is', count)
count += 1
index += characters_to_read
print('index: ',index)
#intake only uses letters from index count per file
intake = read_file[index_start:index_finish]
print(intake)
index_start += characters_to_read
index_finish +=characters_to_read
#output a txt file with the 4 letters from intake as a individually numbered txt file
text_file_output = open("Output%i.csv"%i,'w')
i += 1
text_file_output.write(intake)
text_file_output.close()
#define path to print to console for file saving
path = os.path.abspath("Output%i")
directory = os.path.dirname(path)
print(path)
test_file.close()
Here's a simple way to split your DNA data into rows consisting of columns and chunks of specified sizes. It assumes that the DNA data is in a single string with no white space characters (spaces, tabs, newlines, etc).
To test this code, I create some fake data using the random module.
from random import seed, choice
seed(42)
# Make some random DNA data
num = 66
data = ''.join([choice('ACGT') for _ in range(num)])
print(data, '\n')
# Split the data into chunks, columns and rows
chunksize, cols = 4, 3
row = []
for i in range(0, len(data), chunksize):
chunk = data[i:i+chunksize]
row.append(chunk)
if len(row) == cols:
print(' '.join(row))
row = []
if row:
print(' '.join(row))
output
AAGCCCAATAAACCACTCTGACTGGCCGAATAGGGATATAGGCAACGACATGTGCGGCGACCCTTG
AAGC CCAA TAAA
CCAC TCTG ACTG
GCCG AATA GGGA
TATA GGCA ACGA
CATG TGCG GCGA
CCCT TG
On my old 2GHz 32 bit machine, running Python 3.6.0, this code can process and save to disk around 100000 chars per second (that includes the time taken to generate the random data).
Here's a version of the above code that handles spaces and blank lines in the input data. It reads the input data from a file and writes the output to a CSV file.
Firstly, here's the code I used to create some fake test data, which I saved to "dnatest.txt".
from random import seed, choice, randrange
seed(123)
# Make some random DNA data containing spaces
pool = 'ACGT' * 5 + ' '
for _ in range(15):
# Choose a random line length
size = randrange(50, 70)
data = ''.join([choice(pool) for _ in range(size)])
print(data)
# Randomly add a blank line
if randrange(5) < 2:
print()
Here's the file it created:
dnatest.txt
AGCATCACCGGCCAGCGTCACGTAGAGGTCGAAACCGTATCCGATGT AGG
ACC TTACTAC CGTACGGCAGGAGGAGGG TATTACAC CT TCTCACGAGCAAGGAATA
ATTGATGGCACAGC AAGATCCGCTA CCGATTG CAACCA CATACGAT CGACCAGATGG
ACAGAACAGATCTTGGGAATGGAACAGGAGAGAGTGTGGGCCACATTAAAGTGATAAT ATTT
TCTGTCGTGGGGCACCAAACCATGCTAATGCACGACTGGGT GAGGGTTGAGAGCCTACTATCCTCAG
TCGATCGAGATGACCCTCCTATCGCAACAGCTGTCAGTGTCCAGAG ACGTCGC CA
TAGGTCTGGAAAC GCACTCCCCTC GGAATAGTCTACACGAGTCCATTATGTC
GATCTGACTATGGGGACCATAACGGCTATGCGACCATGGACTGGTTCGAG
GATTCCCGTTCTACAT CACCTT ACCTCTGATAA CGACTGGTTCGA GGGTCTC CC
AAA CGTCTATTATGTCATAACGTAACTCTGC CGTAGTTTGATCAAACGTACAGCCACCAC
TGAAGC CGCCTCGAACCGCGTCCGACCCTGGGGAGCCTGGGGCCCAGCA
CCTTAGC ACTGCGA AGCTACACCCCACGAGTAATTTG T CTATCGT CCG
GCCTCGTTTCCTTGTGAAATTAT ATGGT C AGTCTTCAATCAA CACCTA CTAATAA
GTGCTAGC CCGGGGATCTTGTCCTGGTCCA GGTC AT AATCCGTGCTCAAATTACATGGCTT
TTAGTAATGAGTTCGGGC GCGCCCTCAAAGTTGGTCTAGAAGCGCGCAGTTTTCCTTAGGT
Here's the code that processes that data:
# Input & output file names
iname = 'dnatest.txt'
oname = 'dnatest.csv'
# Read the data and eliminate all whitespace
with open(iname) as f:
data = ''.join(f.read().split())
# Split the data into chunks, columns and rows
chunksize, cols = 4, 3
with open(oname, 'w') as f:
row = []
for i in range(0, len(data), chunksize):
chunk = data[i:i+chunksize]
row.append(chunk)
if len(row) == cols:
f.write(', '.join(row) + '\n')
row = []
if row:
f.write(', '.join(row) + '\n')
And here's the file it creates:
dnatest.csv
AGCA, TCAC, CGGC
CAGC, GTCA, CGTA
GAGG, TCGA, AACC
GTAT, CCGA, TGTA
GGAC, CTTA, CTAC
CGTA, CGGC, AGGA
GGAG, GGTA, TTAC
ACCT, TCTC, ACGA
GCAA, GGAA, TAAT
TGAT, GGCA, CAGC
AAGA, TCCG, CTAC
CGAT, TGCA, ACCA
CATA, CGAT, CGAC
CAGA, TGGA, CAGA
ACAG, ATCT, TGGG
AATG, GAAC, AGGA
GAGA, GTGT, GGGC
CACA, TTAA, AGTG
ATAA, TATT, TTCT
GTCG, TGGG, GCAC
CAAA, CCAT, GCTA
ATGC, ACGA, CTGG
GTGA, GGGT, TGAG
AGCC, TACT, ATCC
TCAG, TCGA, TCGA
GATG, ACCC, TCCT
ATCG, CAAC, AGCT
GTCA, GTGT, CCAG
AGAC, GTCG, CCAT
AGGT, CTGG, AAAC
GCAC, TCCC, CTCG
GAAT, AGTC, TACA
CGAG, TCCA, TTAT
GTCG, ATCT, GACT
ATGG, GGAC, CATA
ACGG, CTAT, GCGA
CCAT, GGAC, TGGT
TCGA, GGAT, TCCC
GTTC, TACA, TCAC
CTTA, CCTC, TGAT
AACG, ACTG, GTTC
GAGG, GTCT, CCCA
AACG, TCTA, TTAT
GTCA, TAAC, GTAA
CTCT, GCCG, TAGT
TTGA, TCAA, ACGT
ACAG, CCAC, CACT
GAAG, CCGC, CTCG
AACC, GCGT, CCGA
CCCT, GGGG, AGCC
TGGG, GCCC, AGCA
CCTT, AGCA, CTGC
GAAG, CTAC, ACCC
CACG, AGTA, ATTT
GTCT, ATCG, TCCG
GCCT, CGTT, TCCT
TGTG, AAAT, TATA
TGGT, CAGT, CTTC
AATC, AACA, CCTA
CTAA, TAAG, TGCT
AGCC, CGGG, GATC
TTGT, CCTG, GTCC
AGGT, CATA, ATCC
GTGC, TCAA, ATTA
CATG, GCTT, TTAG
TAAT, GAGT, TCGG
GCGC, GCCC, TCAA
AGTT, GGTC, TAGA
AGCG, CGCA, GTTT
TCCT, TAGG, T

Python: a way to ignore/account for newlines with read()

So I am having a problem extracting text from a larger (>GB) text file. The file is structured as follows:
>header1
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
andEnds
>header2
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAlineAtPosition_80
MaybeAnotherTargetBBBBBBBBBBBrestText
andEndsSomewhereHere
Now I have the information that in the entry with header2 I need to extract the text from position X to position Y (the A's in this example), starting with 1 as the first letter in the line below the header.
BUT: the positions do not account for newline characters. So basically when it says from 1 to 95 it really means just the letters from 1 to 80 and the following 15 of the next line.
My first solution was to use file.read(X-1) to skip the unwanted part in front and then file.read(Y-X) to get the part I want, but when that stretches over newline(s) I get to few characters extracted.
Is there a way to solve this with another python-function than read() maybe? I thought about just replacing all newlines with empty strings but the file maybe quite large (millions of lines).
I also tried to account for the newlines by taking extractLength // 80 as added length, but this is problematic in cases like the example when eg. of 95 characters it's 2-80-3 over 3 lines I actually need 2 additional positions but 95 // 80 is 1.
UPDATE:
I modified my code to use Biopython:
for s in SeqIO.parse(sys.argv[2], "fasta"):
#foundClusters stores the information for substrings I want extracted
currentCluster = foundClusters.get(s.id)
if(currentCluster is not None):
for i in range(len(currentCluster)):
outputFile.write(">"+s.id+"|cluster"+str(i)+"\n")
flanking = 25
start = currentCluster[i][0]
end = currentCluster[i][1]
left = currentCluster[i][2]
if(start - flanking < 0):
start = 0
else:
start = start - flanking
if(end + flanking > end + left):
end = end + left
else:
end = end + flanking
#for debugging only
print(currentCluster)
print(start)
print(end)
outputFile.write(s.seq[start, end+1])
But I get the following error:
[[1, 55, 2782]]
0
80
Traceback (most recent call last):
File "findClaClusters.py", line 92, in <module>
outputFile.write(s.seq[start, end+1])
File "/usr/local/lib/python3.4/dist-packages/Bio/Seq.py", line 236, in __getitem__
return Seq(self._data[index], self.alphabet)
TypeError: string indices must be integers
UPDATE2:
Changed outputFile.write(s.seq[start, end+1]) to:
outRecord = SeqRecord(s.seq[start: end+1], id=s.id+"|cluster"+str(i), description="Repeat-Cluster")
SeqIO.write(outRecord, outputFile, "fasta")
and its working :)
With Biopython:
from Bio import SeqIO
X = 66
Y = 130
for s in in SeqIO.parse("test.fst", "fasta"):
if "header2" == s.id:
print s.seq[X: Y+1]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Biopython let's you parse fasta file and access its id, description and sequence easily. You have then a Seq object and you can manipulate it conveniently without recoding everything (like reverse complement and so on).

Convert a Column oriented file to CSV output using shell

I have a file that come from map reduce output for the format below that needs conversion to CSV using shell script
25-MAY-15
04:20
Client
0000000010
127.0.0.1
PAY
ISO20022
PAIN000
100
1
CUST
API
ABF07
ABC03_LIFE.xml
AFF07/LIFE
100000
Standard Life
================================================
==================================================
AFF07-B000001
2000
ABC Corp
..
BE900000075000027
AFF07-B000002
2000
XYZ corp
..
BE900000075000027
AFF07-B000003
2000
3MM corp
..
BE900000075000027
I need the output like CSV format below where I want to repeat some of the values in the file and add the TRANSACTION ID as below format
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,AFF07-B000001, 2000,ABC Corp,..,BE900000075000027
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,AFF07-B000002,2000,XYZ Corp,..,BE900000075000027
TRANSACTION ID is AFF07-B000001,AFF07-B000002,AFF07-B000003 which have different values and I have put a marked line from where the Transaction ID starts . Before the demarkation , the values should be repeating and the transaction ID column needs to be added along with the repeating values before the line as given in above format
BASH shell script I may need and CentOS is the flavour of linux
I am getting error as below when I execute the code
Traceback (most recent call last):
File "abc.py", line 37, in <module>
main()
File "abc.py", line 36, in main
createTxns(fh)
File "abc.py", line 7, in createTxns
first17.append( fh.readLine().rstrip() )
AttributeError: 'file' object has no attribute 'readLine'
Can someone help me out
Is this a correct description of the input file and output format?
The input file consists of:
17 lines, followed by
groups of 10 lines each - each group holding one transaction id
Each output row consists of:
29 common fields, followed by
5 fields derived from each of the 10-line groups above
So we just translate this into some Python:
def createTxns(fh):
"""fh is the file handle of the input file"""
# 1. Read 17 lines from fh
first17 = []
for i in range(17):
first17.append( fh.readLine().rstrip() )
# 2. Form the common fields.
commonFields = first17 + first17[0:12]
# 3. Process the rest of the file in groups of ten lines.
while True:
# read 10 lines
group = []
for i in range(10):
x = fh.readline()
if x == '':
break
group.append( x.rstrip() )
if len(group) <> 10:
break # we've reached the end of the file
fields = commonFields + [ group[2], group[4], group[6], group[7[, group[9] ]
row = ",".join(fields)
print row
def main():
with open("input-file", "r") as fh:
createTxns(fh)
main()
This code shows how to:
open a file handle
read lines from a file handle
strip off the ending newline
check for end of input when reading from a file
concatenate lists together
join strings together
I would recommend you to read Input and Output if you are going for the python route.
You just have to break the problem down and try it. For the first 17 line use f.readline() and concat into the string. Then the replace method to get the begining of the string that you want in the csv.
str.replace("\n", ",")
Then use the split method and break them down into the list.
str.split("\n")
Then write the file out in the loop. Use a counter to make your life easier. First write out the header string
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API
Then write the item in the list with a ",".
,AFF07-B000001, 2000,ABC Corp,..,BE900000075000027
At the count of 5 write the "\n" with the header again and don't forget to reset your counter so it can begin again.
\n25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API
Give it a try and let us know if you need more assistant. I assumed that you have some scripting background :) Good luck!!

Filling tabs until the maximum length of column

I have a tab-delimited txt that looks like
11 22 33 44
53 25 36 25
74 89 24 35 and
But there is no "tab" after 44 and 25. So the 1st and 2nd rows have 4 columns, 3rd row has 5 columns.
To rewrite it so that tabs are shown,
11\t22\t33\t44
53\t25\t36\t25
74\t89\t24\t35\tand
I need to have a tool to mass-add tabs where there are no entries.
If the maximum length of column is n (n=5 in the above example), then I want to fill tabs until that nth column for all rows to make
11\t22\t33\t44\t
53\t25\t36\t25\t
74\t89\t24\t35\tand
I tried to do it by notepad++, and python by using replacer code like
map_dict = {'':'\t'}
but it seems I need more logic to do it.
I am assuming your file also contains newlines so it would actually look like this:
11\t22\t33\t44\n
53\t25\t36\t25\n
74\t89\t24\t35\tand\n
If you know for sure that the maximum length of your columns is 5, you can do it like this:
with open('my_file.txt') as my_file:
y = lambda x: len(x.strip().split('\t'))
a = [line if y(line) == 5 else '%s%s\n' % (line.strip(), '\t'*(5 - y(line)))
for line in my_file.readlines()]
# ['11\t22\t33\t44\t\n', '53\t25\t36\t25\t\n', '74\t89\t24\t35\tand\n']
This will add ending tabs until you reach 5 columns. You will get a list of lines that you need to write back to a file (i have 'my_file2.txt' but you can write back to the original one if you want).
with open('my_file2.txt', 'w+') as out_file:
for line in a:
out_file.write(line)
If I understood it correctly, you can achieve this in Notepad++ only using following:
And yes, if you have several files on which you want to perform this, you can record this as a macro and bind it on to key as a shortcut

Categories