[Help]Extract from csv file and output txt(python)[Help] - python
enter image description here
I read the file 'average-latitude-longitude-countries.csv' to the Southern Hemisphere.
Print the country name of the country in the file 'result.txt'
Question:
I want you to fix it so that it can be printed according to the image file.
infile = open("average-latitude-longitude-countries.csv","r")
outfile = open("average-latitude-longitude-countries.txt","w")
joined = []
infile.readline()
for line in infile:
splited = line.split(",")
if len(splited) > 4:
if float(splited[3]) < 0:
joined.append(splited[2])
outfile.write(str(joined) + "\n")
else:
if float(splited[2]) < 0:
joined.append(splited[1])
outfile.write(str(joined) + '\n')
It's hard without posting a head/first few lines of the CSV
However, assuming your code works and the countries list is successfully populated.
Then,
you can replace the line
outfile.write(str(joined) + '\n')
with:
outfile.write("\n".join(joined))
OR
with those 2 lines:
for country in joined:
outfile.write("%s\n" % country)
Keep in mind, those approaches just do the job. however, not optimum
Extra hints:
you can have a look at csv module of standard Python. can make your parsing easier
Also, splited = line.split(",") can lead to wrong output , if there a single quoted field that contains "," .
like this : field1_value,"field 2,value",field3, field4 , ...
Update:
Now I got you, First of all you are dumping the whole aggregated array to the file for each line you read.
you should keep adding to the array in the loop. then after the whole loop, dump in once (Splitting the accumulated array like above)
Here is your code slightly modified:
infile = open("average-latitude-longitude-countries.csv","r")
outfile = open("average-latitude-longitude-countries.txt","w")
joined = []
infile.readline()
for line in infile:
splited = line.split(",")
if len(splited) > 4:
if float(splited[3]) < 0:
joined.append(splited[2])
#outfile.write(str(joined) + "\n")
else:
if float(splited[2]) < 0:
joined.append(splited[1])
#outfile.write(str(joined) + '\n')
outfile.write("\n".join(joined))
Related
Remove linebreak in csv
I have a CSV file that has errors. The most common one is a too early linebreak. But now I don't know how to remove it ideally. If I read the line by line with open("test.csv", "r") as reader: test = reader.read().splitlines() the wrong structure is already in my variable. Is this still the right approach and do I use a for loop over test and create a copy or can I manipulate directly in the test variable while iterating over it? I can identify the corrupt lines by the semikolon, some rows end with a ; others start with it. So maybe counting would be an alternative way to solve it? EDIT: I replaced reader.read().splitlines() with reader.readlines() so I could handle the rows which end with a ; for line in lines: if("Foobar" in line): line = line.replace("Foobar", "") if(";\n" in line): line = line.replace(";\n", ";") The only thing that remains are rows that beginn with a ; Since I need to go back one entry in the list Example: Col_a;Col_b;Col_c;Col_d 2021;Foobar;Bla ;Blub Blub belongs in the row above.
Here's a simple Python script to merge lines until you have the desired number of fields. import sys sep = ';' fields = 4 collected = [] for line in sys.stdin: new = line.rstrip('\n').split(sep) if collected: collected[-1] += new[0] collected.extend(new[1:]) else: collected = new if len(collected) < fields: continue print(';'.join(collected)) collected = [] This simply reads from standard input and prints to standard output. If the last line is incomplete, it will be lost. The separator and the number of fields can be edited into the variables at the top; exposing these as command-line parameters left as an exercise. If you wanted to keep the newlines, it would not be too hard to only strip a newline from the last fields, and use csv.writer to write the fields back out as properly quoted CSV.
This is how I deal with this. This function fixes the line if there are more columns than needed or if there is a line break in the middle. Parameters of the function are: message - content of the file - reader.read() in your case columns - number of expected columns filename - filename (I use it for logging) def pre_parse(message, columns, filename): parsed_message=[] i =0 temp_line ='' for line in message.splitlines(): #print(line) split = line.split(',') if len(split) == columns: parsed_message.append(line) elif len(split) > columns: print(f'Line {i} has been truncated in file {filename} - too much columns')) split = split[:columns] line = ','.join(split) parsed_message.append(line) elif len(split) < columns and temp_line =='': temp_line = line.replace('\n','') print(temp_line) elif temp_line !='': line = temp_line+line if line.count(',') == columns-1: print((f'Line {i} has been fixed in file {filename} - extra line feed')) parsed_message.append(line) temp_line ='' else: temp_line=line.replace('\n', '') i+=1 return parsed_message make sure you use proper split character and proper line feed characer.
python3 split big file by delimiter into small files (not size, lines)
Newbie here. Ultimate mission is to learn how to take two big yaml files and split them into several hundred small files. I haven't yet figured out how to use the ID # as the filename, so one thing at a time. First: split the big files into many. Here's a tiny bit of my test data file test-file.yml. Each post has a - delimiter on a line by itself: - ID: 627 more_post_meta_data_and_content - ID: 628 And here's my code that isn't working. So far I don't see why: with open('test-file.yml', 'r') as myfile: start = 0 cntr = 1 holding = '' for i in myfile.read().split('\n'): if (i == '-\n'): if start==1: with open(str(cntr) + '.md','w') as opfile: opfile.write(op) opfile.close() holding='' cntr += 1 else: start=1 else: if holding =='': holding = i else: holding = holding + '\n' + i myfile.close() All hints, suggestions, pointers welcome. Thanks.
Reading the entire file into memory and then splitting the memory regions is very inefficient if the input files are large. Try this instead: with open('test-file.yml', 'r') as myfile: opfile = None cntr = 1 for line in myfile: if line == '-\n': if opfile is not None: opfile.close() opfile = open('{0}.md'.format(cntr),'w') cntr += 1 opfile.write(line) opfile.close() Notice also, you don't close things you have opened in a with context manager; the very purpose of the context manager is to take care of this for you.
As a newbie myself, at first glance your trying to write an undeclared variable op to your output. You were nearly spot on, just need to iterate through your opfile and write the contents: with open('test-file.yml', 'r') as myfile: start = 0 cntr = 1 holding = '' for i in myfile.read().split('\n'): if (i == '-\n'): if start==1: with open(str(cntr) + '.md','w') as opfile: for line in opfile: op = line opfile.write(op) opfile.close() holding='' cntr += 1 else: start=1 else: if holding =='': holding = i else: holding = holding + '\n' + i myfile.close() Hope this helps!
When you are working in a with context on an open file, the with will automatically take care of closing it for you when you exit this block. So you don't need file.close() anywhere. There is a function called readlines that outputs a generator that reads a line from an open file one line at a time. That will work much more efficiently than read() followed by a split(). Think about it. You are loading a massive file in memory, and then asking CPU to split that ginormous text by \n character. Not very efficient. You wrote opfile.write(op). Where is this op defined? Don't you want to write the content in holding that you've defined? Try the following. with open('test.data', 'r') as myfile: counter = 1 content = "" start = True for line in myfile.readlines(): if line == "-\n" and not start: with open(str(counter) + '.md', 'w') as opfile: opfile.write(content) content = "" counter += 1 else: if not start: content += line start = False # write the last file if test-file.yml doesn't end with a dash if content != "": with open(str(counter) + '.md', 'w') as opfile: opfile.write(content)
Data comes out shifted using python
What this code is supposed to do is transfer weird looking .csv files written in one line into a multilined csv import csv import re filenmi = "original.csv" filenmo = "data-out.csv" infile = open(filenmi,'r') outfile = open(filenmo,'w+') for line in infile: print ('read data :',line) line2 = re.sub('[^0-9|^,^.]','',line) line2 = re.sub(',,',',',line2) print ('clean data: ',line2) wordlist = line2.split(",") n=(len(wordlist))/2 print ('num data pairs: ',n) i=0 print ('data paired :') while i < n*2 : pairstr = ','.join( pairlst ) print(' ',i/2+1,' ',pairstr) pairstr = pairstr + '\n' outfile.write( pairstr ) i=i+2 infile.close() outfile.close() What I want this code to do is change a messed up .txt file L,39,100,50.5,83,L,50.5,83 into a normally formatted csv file like the example below 39,100 50.5,83 50.5,83 but my data comes out like this ,39 100,50.5 83,50.5 83, I'm not sure what went wrong or how to fix this. So it would be great if someone could help ::Data Set:: L,39,100,50.5,83,L,50.5,83,57.5,76,L,57.5,76,67,67.5,L,67,67.5,89,54,L,89,54,100.5,49,L,100.5,49,111.5,45.5,L,111.5,45.5,134,42,L,134,42,152.5,44,L,152.5,44,160,46.5,L,160,46.5,168,52,L,168,52,170,56.5,L,170,56.5,162,64.5,L,162,64.5,152.5,70,L,152.5,70,126,85.5,L,126,85.5,113.5,94,L,113.5,94,98,105.5,L,98,105.5,72.5,132,L,72.5,132,64.5,145,L,64.5,145,57.5,165.5,L,57.5,165.5,57,176,L,57,176,63.5,199.5,L,63.5,199.5,69,209,L,69,209,76,216.5,L,76,216.5,83.5,222,L,83.5,222,90.5,224.5,L,90.5,224.5,98,225.5,L,98,225.5,105.5,225,L,105.5,225,115,223,L,115,223,124.5,220,L,124.5,220,133.5,216.5,L,133.5,216.5,142,212,L,142,212,149,207,L,149,207,156.5,201.5,L,156.5,201.5,163.5,195.5,L,163.5,195.5,172.5,185.5,L,172.5,185.5,175,180.5,L,175,180.5,177,173,L,177,173,177.5,154,L,177.5,154,174.5,142.5,L,174.5,142.5,168.5,133.5,L,168.5,133.5,150,131.5,L,150,131.5,135,136.5,L,135,136.5,120.5,144.5,L,120.5,144.5,110.5,154,L,110.5,154,104,161.5,L,104,161.5,99.5,168.5,L,99.5,168.5,98,173,L,98,173,97.5,176,L,97.5,176,99.5,178,L,99.5,178,105,179.5,L,105,179.5,112.5,179,L,112.5,179,132,175.5,L,132,175.5,140.5,175,L,140.5,175,149.5,175,L,149.5,175,157,176.5,L,157,176.5,169.5,181.5,L,169.5,181.5,174,185.5,L,174,185.5,178,206,L,178,206,176.5,214.5,L,176.5,214.5,161,240.5,L,161,240.5,144.5,251,L,144.5,251,134.5,254,L,134.5,254,111.5,254.5,L,111.5,254.5,98,253,L,98,253,71.5,248,L,71.5,248,56,246,
Your code fails because when you tried line2 = re.sub('[^0-9|^,^.]','',line), it outputs to ,39,100,50.5,83,,50.5,83. In that line you are using re to replace any char that isn't a number, dot or comma with nothing or ''. This will remove the L in your input but the second char which is a comma will stay. I've just fixed that and made a little modification on how you create a csv list. The below code works. import csv import re filenmi = "original.csv" filenmo = "data-out.csv" with open(filenmi, 'r') as infile: #get a list of words that must be split for line in infile: #remove any char which isn't a number, dot, or comma line2 = re.sub('[^0-9|^,^.]','',line) #replace ",," with "," line2 = re.sub(',,',',',line2) #remove the first char which is a "," line2 = line2[1:] #get a list of individual values, sep by "," wordlist = line2.split(",") parsed = [] for i,val in enumerate(wordlist): #for every even index, get the word pair try: if i%2 == 0: parstr = wordlist[i] + "," + wordlist[i+1] + '\n' parsed.append(parstr) except: print("Data set needs cleanup\n") with open(filenmo, 'w+') as f: for item in parsed: f.write(item)
Changing a text file and making a bigger text file in python
I have a tab separated text file like these example: infile: chr1 + 1071396 1271396 LOC chr12 + 1101483 1121483 MIR200B I want to divide the difference between columns 3 and 4 in infile into 100 and make 100 rows per row in infile and make a new file named newfile and make the final tab separated file with 6 columns. The first 5 columns would be like infile, the 6th column would be (5th column)_part number (number is 1 to 100). This is the expected output file: expected output: chr1 + 1071396 1073396 LOC LOC_part1 chr1 + 1073396 1075396 LOC LOC_part2 . . . chr1 + 1269396 1271396 LOC LOC_part100 chr12 + 1101483 1101683 MIR200B MIR200B_part1 chr12 + 1101683 1101883 MIR200B MIR200B_part2 . . . chr12 + 1121283 1121483 MIR200B MIR200B_part100 I wrote the following code to get the expected output but it does not return what I expect. file = open('infile.txt', 'rb') cont = [] for line in file: cont.append(line) newfile = [] for i in cont: percent = (i[3]-i[2])/100 for j in percent: newfile.append(i[0], i[1], i[2], i[2]+percent, i[4], i[4]_'part'percent[j]) with open('output.txt', 'w') as f: for i in newfile: for j in i: f.write(i + '\n') Do you know how to fix the problem?
Try this: file = open('infile.txt', 'rb') cont = [] for line in file: cont.append(list(filter(lambda x: not x.isspace(), line.split(' '))) newfile = [] for i in cont: diff= (int(i[3])-int(i[2]))/100 left = i[2] right = i[2] + diff for j in range(100): newfile.append(i[0], i[1], left, right, i[4], i[4]_'part' + j) left = right right = right + diff with open('output.txt', 'w') as f: for i in newfile: for j in i: f.write(i + '\n') In your code for i in cont youre loop over the string and i is a char and not string. To fix that i split the line and remove spaces.
Here are some suggestions: when you open the file, open it as a text file, not a binary file. open('infile.txt','r') now, when you read it line by line, you should strip the newline character at the end by using strip(). Then, you need to split your input text line by tabs into a list of strings, vs a just a long string containing your line, by using split('\t'): line.strip().split('\t') now you have: file = open('infile.txt', 'r') cont = [] for line in file: cont.append(line.strip().split('\t)) now cont is a list of lists, where each list contains your tab separated data. i.e. cont[1][0] = 'chr12'. You will probably able to take it from here.
Others have answered your question with respect to your own code, I thought I would leave my attempt at solving your problem here. import os directory = "C:/Users/DELL/Desktop/" filename = "infile.txt" path = os.path.join(directory, filename) with open(path, "r") as f_in, open(directory+"outfile.txt", "w") as f_out: #open input and output files for line in f_in: contents = line.rstrip().split("\t") #split line into words stored as a string 'contents' diff = (int(contents[3]) - int(contents[2]))/100 for i in range(100): temp = (f"{contents[0]}\t+\t{int(int(contents[2])+ diff*i)}\t{contents[3]}\t{contents[4]}\t{contents[4]}_part{i+1}") f_out.write(temp+"\n") This code doesn't follow python style convention well (excessively long lines, for example) but it works. The line temp = ... uses fstrings to format the output string conveniently, which you could read more about here.
Replace Line with New Line in Python
I am reading a text file and searching data line by line, based on some condition, changing some values in the line and writing it back into another file. The new file should not contain the old Line. I have tried the following, but it did not work. I think I am missing a very basic thing. Solution: In C++ we can increment line but in Python I am not sure how to achieve this. So as of now, I am writing old line than new line. But in the new file, I want only the new line. Example: M0 38 A 19 40 DATA2 L=4e-08 W=3e-07 nf=1 m=1 $X=170 $Y=140 $D=8 M0 VBN A 19 40 TEMP2 L=4e-08 W=3e-07 nf=1 m=1 $X=170 $Y=140 $D=8 The code which i tried is the following: def parsefile(): fp = open("File1", "rb+") update_file = "File1" + "_update" fp_latest = open(update_file, "wb+") for line in fp: if line.find("DATA1") == -1: fp_latest.write(line) if line.find("DATA1") != -1: line = line.split() pin_name = find_pin_order(line[1]) update_line = "DATA " + line[1] + " " + pin_name fp_latest.write(update_line) line = ''.join(line) if line.find("DATA2") != -1: line_data = line.split() line_data[1] = "TEMP2" line_data =' '.join(line_data) fp_latest.write(line_data) if line.find("DATA3") != -1: line_data = line.split() line_data[1] = "TEMP3" line_data =' '.join(line_data) fp_latest.write(line_data) fp_latest.close() fp.close()
The main problem with your current code is that your first if block, which checks for "DATA1" and writes the line out if it is not found runs when "DATA2" or "DATA3" is present. Since those have their own blocks, the line ends up being duplicated in two different forms. Here's a minimal modification of your loop that should work: for line in fp: if line.find("DATA1") != -1: data = line.split() pin_name = find_pin_order(data[1]) line = "DATA " + data[1] + " " + pin_name if line.find("DATA2") != -1: data = line.split() data[1] = "TEMP2" line =' '.join(data) if line.find("DATA3") != -1: data = line.split() data[1] = "TEMP3" line =' '.join(data) fp_latest.write(line) This ensures that only one line is written because there's only a single write() call in the code. The special cases simply modify the line that is to be written. I'm not sure I understand the modifications you want to have done in those cases, so there may be more bugs there. One thing that might help would be to make the second and third if statements into elif statements instead. This would ensure that only one of them would be run (though if you know your file will never have multiple DATA entries on a single line, this may not be necessary).
If you want to write a new line in a file replacing the old content that has been readed last time, you can use the file.seek() method for moving arround the file, there is an example. with open("myFile.txt", "r+") as f: offset = 0 lines = f.readlines() for oldLine in lines: ... calculate the new line value ... f.seek(offset) f.write(newLine) offset += len(newLine) f.seek(offset)