[Help]Extract from csv file and output txt(python)[Help]

[Help]Extract from csv file and output txt(python)[Help] - python

enter image description here
I read the file 'average-latitude-longitude-countries.csv' to the Southern Hemisphere.
Print the country name of the country in the file 'result.txt'
Question:
I want you to fix it so that it can be printed according to the image file.
infile = open("average-latitude-longitude-countries.csv","r")
outfile = open("average-latitude-longitude-countries.txt","w")
joined = []
infile.readline()
for line in infile:
splited = line.split(",")
if len(splited) > 4:
if float(splited[3]) < 0:
joined.append(splited[2])
outfile.write(str(joined) + "\n")
else:
if float(splited[2]) < 0:
joined.append(splited[1])
outfile.write(str(joined) + '\n')

It's hard without posting a head/first few lines of the CSV
However, assuming your code works and the countries list is successfully populated.
Then,
you can replace the line
outfile.write(str(joined) + '\n')
with:
outfile.write("\n".join(joined))
OR
with those 2 lines:
for country in joined:
outfile.write("%s\n" % country)
Keep in mind, those approaches just do the job. however, not optimum
Extra hints:
you can have a look at csv module of standard Python. can make your parsing easier
Also, splited = line.split(",") can lead to wrong output , if there a single quoted field that contains "," .
like this : field1_value,"field 2,value",field3, field4 , ...
Update:
Now I got you, First of all you are dumping the whole aggregated array to the file for each line you read.
you should keep adding to the array in the loop. then after the whole loop, dump in once (Splitting the accumulated array like above)
Here is your code slightly modified:
infile = open("average-latitude-longitude-countries.csv","r")
outfile = open("average-latitude-longitude-countries.txt","w")
joined = []
infile.readline()
for line in infile:
splited = line.split(",")
if len(splited) > 4:
if float(splited[3]) < 0:
joined.append(splited[2])
#outfile.write(str(joined) + "\n")
else:
if float(splited[2]) < 0:
joined.append(splited[1])
#outfile.write(str(joined) + '\n')
outfile.write("\n".join(joined))

Related

Remove linebreak in csv

I have a CSV file that has errors. The most common one is a too early linebreak.
But now I don't know how to remove it ideally. If I read the line by line
with open("test.csv", "r") as reader:
test = reader.read().splitlines()
the wrong structure is already in my variable. Is this still the right approach and do I use a for loop over test and create a copy or can I manipulate directly in the test variable while iterating over it?
I can identify the corrupt lines by the semikolon, some rows end with a ; others start with it. So maybe counting would be an alternative way to solve it?
EDIT:
I replaced reader.read().splitlines() with reader.readlines() so I could handle the rows which end with a ;
for line in lines:
if("Foobar" in line):
line = line.replace("Foobar", "")
if(";\n" in line):
line = line.replace(";\n", ";")
The only thing that remains are rows that beginn with a ;
Since I need to go back one entry in the list
Example:
Col_a;Col_b;Col_c;Col_d
2021;Foobar;Bla
;Blub
Blub belongs in the row above.

Here's a simple Python script to merge lines until you have the desired number of fields.
import sys
sep = ';'
fields = 4
collected = []
for line in sys.stdin:
new = line.rstrip('\n').split(sep)
if collected:
collected[-1] += new[0]
collected.extend(new[1:])
else:
collected = new
if len(collected) < fields:
continue
print(';'.join(collected))
collected = []
This simply reads from standard input and prints to standard output. If the last line is incomplete, it will be lost.
The separator and the number of fields can be edited into the variables at the top; exposing these as command-line parameters left as an exercise.
If you wanted to keep the newlines, it would not be too hard to only strip a newline from the last fields, and use csv.writer to write the fields back out as properly quoted CSV.

This is how I deal with this. This function fixes the line if there are more columns than needed or if there is a line break in the middle.
Parameters of the function are:
message - content of the file - reader.read() in your case
columns - number of expected columns
filename - filename (I use it for logging)
def pre_parse(message, columns, filename):
parsed_message=[]
i =0
temp_line =''
for line in message.splitlines():
#print(line)
split = line.split(',')
if len(split) == columns:
parsed_message.append(line)
elif len(split) > columns:
print(f'Line {i} has been truncated in file {filename} - too much columns'))
split = split[:columns]
line = ','.join(split)
parsed_message.append(line)
elif len(split) < columns and temp_line =='':
temp_line = line.replace('\n','')
print(temp_line)
elif temp_line !='':
line = temp_line+line
if line.count(',') == columns-1:
print((f'Line {i} has been fixed in file {filename} - extra line feed'))
parsed_message.append(line)
temp_line =''
else:
temp_line=line.replace('\n', '')
i+=1
return parsed_message
make sure you use proper split character and proper line feed characer.

python3 split big file by delimiter into small files (not size, lines)

Newbie here. Ultimate mission is to learn how to take two big yaml files and split them into several hundred small files. I haven't yet figured out how to use the ID # as the filename, so one thing at a time.
First: split the big files into many. Here's a tiny bit of my test data file test-file.yml. Each post has a - delimiter on a line by itself:
-
ID: 627
more_post_meta_data_and_content
-
ID: 628
And here's my code that isn't working. So far I don't see why:
with open('test-file.yml', 'r') as myfile:
start = 0
cntr = 1
holding = ''
for i in myfile.read().split('\n'):
if (i == '-\n'):
if start==1:
with open(str(cntr) + '.md','w') as opfile:
opfile.write(op)
opfile.close()
holding=''
cntr += 1
else:
start=1
else:
if holding =='':
holding = i
else:
holding = holding + '\n' + i
myfile.close()
All hints, suggestions, pointers welcome. Thanks.

Reading the entire file into memory and then splitting the memory regions is very inefficient if the input files are large. Try this instead:
with open('test-file.yml', 'r') as myfile:
opfile = None
cntr = 1
for line in myfile:
if line == '-\n':
if opfile is not None:
opfile.close()
opfile = open('{0}.md'.format(cntr),'w')
cntr += 1
opfile.write(line)
opfile.close()
Notice also, you don't close things you have opened in a with context manager; the very purpose of the context manager is to take care of this for you.

As a newbie myself, at first glance your trying to write an undeclared variable op to your output. You were nearly spot on, just need to iterate through your opfile and write the contents:
with open('test-file.yml', 'r') as myfile:
start = 0
cntr = 1
holding = ''
for i in myfile.read().split('\n'):
if (i == '-\n'):
if start==1:
with open(str(cntr) + '.md','w') as opfile:
for line in opfile:
op = line
opfile.write(op)
opfile.close()
holding=''
cntr += 1
else:
start=1
else:
if holding =='':
holding = i
else:
holding = holding + '\n' + i
myfile.close()
Hope this helps!

When you are working in a with context on an open file, the with will automatically take care of closing it for you when you exit this block. So you don't need file.close() anywhere.
There is a function called readlines that outputs a generator that reads a line from an open file one line at a time. That will work much more efficiently than read() followed by a split(). Think about it. You are loading a massive file in memory, and then asking CPU to split that ginormous text by \n character. Not very efficient.
You wrote opfile.write(op). Where is this op defined? Don't you want to write the content in holding that you've defined?
Try the following.
with open('test.data', 'r') as myfile:
counter = 1
content = ""
start = True
for line in myfile.readlines():
if line == "-\n" and not start:
with open(str(counter) + '.md', 'w') as opfile:
opfile.write(content)
content = ""
counter += 1
else:
if not start:
content += line
start = False
# write the last file if test-file.yml doesn't end with a dash
if content != "":
with open(str(counter) + '.md', 'w') as opfile:
opfile.write(content)

Data comes out shifted using python

What this code is supposed to do is transfer weird looking .csv files written in one line into a multilined csv
import csv
import re
filenmi = "original.csv"
filenmo = "data-out.csv"
infile = open(filenmi,'r')
outfile = open(filenmo,'w+')
for line in infile:
print ('read data :',line)
line2 = re.sub('[^0-9|^,^.]','',line)
line2 = re.sub(',,',',',line2)
print ('clean data: ',line2)
wordlist = line2.split(",")
n=(len(wordlist))/2
print ('num data pairs: ',n)
i=0
print ('data paired :')
while i < n*2 :
pairstr = ','.join( pairlst )
print(' ',i/2+1,' ',pairstr)
pairstr = pairstr + '\n'
outfile.write( pairstr )
i=i+2
infile.close()
outfile.close()
What I want this code to do is change a messed up .txt file
L,39,100,50.5,83,L,50.5,83
into a normally formatted csv file like the example below
39,100
50.5,83
50.5,83
but my data comes out like this
,39
100,50.5
83,50.5
83,
I'm not sure what went wrong or how to fix this. So it would be great if someone could help
::Data Set::
L,39,100,50.5,83,L,50.5,83,57.5,76,L,57.5,76,67,67.5,L,67,67.5,89,54,L,89,54,100.5,49,L,100.5,49,111.5,45.5,L,111.5,45.5,134,42,L,134,42,152.5,44,L,152.5,44,160,46.5,L,160,46.5,168,52,L,168,52,170,56.5,L,170,56.5,162,64.5,L,162,64.5,152.5,70,L,152.5,70,126,85.5,L,126,85.5,113.5,94,L,113.5,94,98,105.5,L,98,105.5,72.5,132,L,72.5,132,64.5,145,L,64.5,145,57.5,165.5,L,57.5,165.5,57,176,L,57,176,63.5,199.5,L,63.5,199.5,69,209,L,69,209,76,216.5,L,76,216.5,83.5,222,L,83.5,222,90.5,224.5,L,90.5,224.5,98,225.5,L,98,225.5,105.5,225,L,105.5,225,115,223,L,115,223,124.5,220,L,124.5,220,133.5,216.5,L,133.5,216.5,142,212,L,142,212,149,207,L,149,207,156.5,201.5,L,156.5,201.5,163.5,195.5,L,163.5,195.5,172.5,185.5,L,172.5,185.5,175,180.5,L,175,180.5,177,173,L,177,173,177.5,154,L,177.5,154,174.5,142.5,L,174.5,142.5,168.5,133.5,L,168.5,133.5,150,131.5,L,150,131.5,135,136.5,L,135,136.5,120.5,144.5,L,120.5,144.5,110.5,154,L,110.5,154,104,161.5,L,104,161.5,99.5,168.5,L,99.5,168.5,98,173,L,98,173,97.5,176,L,97.5,176,99.5,178,L,99.5,178,105,179.5,L,105,179.5,112.5,179,L,112.5,179,132,175.5,L,132,175.5,140.5,175,L,140.5,175,149.5,175,L,149.5,175,157,176.5,L,157,176.5,169.5,181.5,L,169.5,181.5,174,185.5,L,174,185.5,178,206,L,178,206,176.5,214.5,L,176.5,214.5,161,240.5,L,161,240.5,144.5,251,L,144.5,251,134.5,254,L,134.5,254,111.5,254.5,L,111.5,254.5,98,253,L,98,253,71.5,248,L,71.5,248,56,246,

Your code fails because when you tried line2 = re.sub('[^0-9|^,^.]','',line), it outputs to ,39,100,50.5,83,,50.5,83.
In that line you are using re to replace any char that isn't a number, dot or comma with nothing or ''. This will remove the L in your input but the second char which is a comma will stay.
I've just fixed that and made a little modification on how you create a csv list. The below code works.
import csv
import re
filenmi = "original.csv"
filenmo = "data-out.csv"
with open(filenmi, 'r') as infile:
#get a list of words that must be split
for line in infile:
#remove any char which isn't a number, dot, or comma
line2 = re.sub('[^0-9|^,^.]','',line)
#replace ",," with ","
line2 = re.sub(',,',',',line2)
#remove the first char which is a ","
line2 = line2[1:]
#get a list of individual values, sep by ","
wordlist = line2.split(",")
parsed = []
for i,val in enumerate(wordlist):
#for every even index, get the word pair
try:
if i%2 == 0:
parstr = wordlist[i] + "," + wordlist[i+1] + '\n'
parsed.append(parstr)
except:
print("Data set needs cleanup\n")
with open(filenmo, 'w+') as f:
for item in parsed:
f.write(item)

Changing a text file and making a bigger text file in python

I have a tab separated text file like these example:
infile:
chr1 + 1071396 1271396 LOC
chr12 + 1101483 1121483 MIR200B
I want to divide the difference between columns 3 and 4 in infile into 100 and make 100 rows per row in infile and make a new file named newfile
and make the final tab separated file with 6 columns. The first 5 columns would be like infile, the 6th column would be (5th column)_part number (number is 1 to 100).
This is the expected output file:
expected output:
chr1 + 1071396 1073396 LOC LOC_part1
chr1 + 1073396 1075396 LOC LOC_part2
.
.
.
chr1 + 1269396 1271396 LOC LOC_part100
chr12 + 1101483 1101683 MIR200B MIR200B_part1
chr12 + 1101683 1101883 MIR200B MIR200B_part2
.
.
.
chr12 + 1121283 1121483 MIR200B MIR200B_part100
I wrote the following code to get the expected output but it does not return what I expect.
file = open('infile.txt', 'rb')
cont = []
for line in file:
cont.append(line)
newfile = []
for i in cont:
percent = (i[3]-i[2])/100
for j in percent:
newfile.append(i[0], i[1], i[2], i[2]+percent, i[4], i[4]_'part'percent[j])
with open('output.txt', 'w') as f:
for i in newfile:
for j in i:
f.write(i + '\n')
Do you know how to fix the problem?

Try this:
file = open('infile.txt', 'rb')
cont = []
for line in file:
cont.append(list(filter(lambda x: not x.isspace(), line.split(' ')))
newfile = []
for i in cont:
diff= (int(i[3])-int(i[2]))/100
left = i[2]
right = i[2] + diff
for j in range(100):
newfile.append(i[0], i[1], left, right, i[4], i[4]_'part' + j)
left = right
right = right + diff
with open('output.txt', 'w') as f:
for i in newfile:
for j in i:
f.write(i + '\n')
In your code for i in cont youre loop over the string and i is a char and not string.
To fix that i split the line and remove spaces.

Here are some suggestions:
when you open the file, open it as a text file, not a binary file.
open('infile.txt','r')
now, when you read it line by line, you should strip the newline character at the end by using strip(). Then, you need to split your input text line by tabs into a list of strings, vs a just a long string containing your line, by using split('\t'):
line.strip().split('\t')
now you have:
file = open('infile.txt', 'r')
cont = []
for line in file:
cont.append(line.strip().split('\t))
now cont is a list of lists, where each list contains your tab separated data. i.e.
cont[1][0] = 'chr12'.
You will probably able to take it from here.

Others have answered your question with respect to your own code, I thought I would leave my attempt at solving your problem here.
import os
directory = "C:/Users/DELL/Desktop/"
filename = "infile.txt"
path = os.path.join(directory, filename)
with open(path, "r") as f_in, open(directory+"outfile.txt", "w") as f_out: #open input and output files
for line in f_in:
contents = line.rstrip().split("\t") #split line into words stored as a string 'contents'
diff = (int(contents[3]) - int(contents[2]))/100
for i in range(100):
temp = (f"{contents[0]}\t+\t{int(int(contents[2])+ diff*i)}\t{contents[3]}\t{contents[4]}\t{contents[4]}_part{i+1}")
f_out.write(temp+"\n")
This code doesn't follow python style convention well (excessively long lines, for example) but it works. The line temp = ... uses fstrings to format the output string conveniently, which you could read more about here.

Replace Line with New Line in Python

I am reading a text file and searching data line by line, based on some condition, changing some values in the line and writing it back into another file. The new file should not contain the old Line. I have tried the following, but it did not work. I think I am missing a very basic thing.
Solution: In C++ we can increment line but in Python I am not sure how to achieve this. So as of now, I am writing old line than new line. But in the new file, I want only the new line.
Example:
M0 38 A 19 40 DATA2 L=4e-08 W=3e-07 nf=1 m=1 $X=170 $Y=140 $D=8
M0 VBN A 19 40 TEMP2 L=4e-08 W=3e-07 nf=1 m=1 $X=170 $Y=140 $D=8
The code which i tried is the following:
def parsefile():
fp = open("File1", "rb+")
update_file = "File1" + "_update"
fp_latest = open(update_file, "wb+")
for line in fp:
if line.find("DATA1") == -1:
fp_latest.write(line)
if line.find("DATA1") != -1:
line = line.split()
pin_name = find_pin_order(line[1])
update_line = "DATA " + line[1] + " " + pin_name
fp_latest.write(update_line)
line = ''.join(line)
if line.find("DATA2") != -1:
line_data = line.split()
line_data[1] = "TEMP2"
line_data =' '.join(line_data)
fp_latest.write(line_data)
if line.find("DATA3") != -1:
line_data = line.split()
line_data[1] = "TEMP3"
line_data =' '.join(line_data)
fp_latest.write(line_data)
fp_latest.close()
fp.close()

The main problem with your current code is that your first if block, which checks for "DATA1" and writes the line out if it is not found runs when "DATA2" or "DATA3" is present. Since those have their own blocks, the line ends up being duplicated in two different forms.
Here's a minimal modification of your loop that should work:
for line in fp:
if line.find("DATA1") != -1:
data = line.split()
pin_name = find_pin_order(data[1])
line = "DATA " + data[1] + " " + pin_name
if line.find("DATA2") != -1:
data = line.split()
data[1] = "TEMP2"
line =' '.join(data)
if line.find("DATA3") != -1:
data = line.split()
data[1] = "TEMP3"
line =' '.join(data)
fp_latest.write(line)
This ensures that only one line is written because there's only a single write() call in the code. The special cases simply modify the line that is to be written. I'm not sure I understand the modifications you want to have done in those cases, so there may be more bugs there.
One thing that might help would be to make the second and third if statements into elif statements instead. This would ensure that only one of them would be run (though if you know your file will never have multiple DATA entries on a single line, this may not be necessary).

If you want to write a new line in a file replacing the old content that has been readed last time, you can use the file.seek() method for moving arround the file, there is an example.
with open("myFile.txt", "r+") as f:
offset = 0
lines = f.readlines()
for oldLine in lines:
... calculate the new line value ...
f.seek(offset)
f.write(newLine)
offset += len(newLine)
f.seek(offset)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

[Help]Extract from csv file and output txt(python)[Help] - python

Related

Remove linebreak in csv

python3 split big file by delimiter into small files (not size, lines)

Data comes out shifted using python

Changing a text file and making a bigger text file in python

Replace Line with New Line in Python

Categories

Resources