I have a text file with data in it. I am trying to create a python code that will format this data in a particular way so another code I have can read it as an input. So, I am trying to remove specific lines and columns, etc. The text file contains some information at the top of the file, and then columns and columns of numerical data below it. However, I only want the numerical data. I do not want the first 34 lines of other information written in the text file.
So, my current question is: How can I remove specific lines of a text file? And how do I print this result so I can see if it worked or not?
Before you come at me: Yes, I have looked up my question on StackOverflow. I still don't get it, and I don't know what I'm doing wrong. I need your help! :)
(Image of my text file for reference)
Since each of the lines I want to remove in the text file begin with a '#', maybe it would be possible to tell the code to remove all lines that begin with "#'. Or maybe I should just specify I want the first 34 lines deleted. Either way, I'm confused on how to do this.
Separated below are all my different attempts: (Don't worry, I have imported pandas for my attempts, it's just not written here.)
with open('EBSD_data.txt','r') as f:
lines = f.readlines()
with open('EBSD_data.txt','w') as f:
for lines in lines:
if line.strip("\n").startswith('#'):
f.write(lines)
with open('EBSD_data.txt', 'r') as E:
data = E.read().splitlines(True)
with open('EBSD_data.txt', 'w') as EB:
EB.writelines(data[34:])
print(EB.read)
with open('EBSD_data.txt', 'w') as data:
for lines in data:
if not lines.startswith('#'):
data.write(lines)
with open('EBSD_data.txt', 'r') as data2:
print(data2)
x = open('EBSD_data.txt')
for line in x:
if line.startswith('#'):
del(line)
x.head()
lines = []
with open(r'EBSD_data.txt','r') as fp:
lines = fp.readlines()
with open(r'EBSD_data.txt','w') as data:
for number, line in enumerate(lines):
if number not in [34]:
fp.write(line)
With some of these attempts, it'll look like it runs fine, but then I have issues with printing my results. The typical error I am encountering after trying to print my result is:
<_io.TextIOWrapper name='EBSD_data.txt' mode='w' encoding='UTF-8'>
Any guidance you could give would be great! Thank you! :)
I personally would always write to a new file and keep the original one...\
This should work:
with open('EBSD_data.txt', 'r') as data:
with open('EBSD_data_filtered.txt', 'w') as outfile:
for line in data:
if not line.startswith('#'):
outfile.write(line)
with open('EBSD_data_filtered.txt', 'r') as data2:
for line in data2:
print(line)
Instead of manually reading each line with a Python context manager, you can use read_csv, skipping the first 34 rows of the file (skiprows=34), and without reading any column names from the file (header=None):
import pandas as pd
df = pd.read_csv('EBSD_data.txt', skiprows=34, header=None)
# print first five rows to confirm expected result
print(df.head())
# Optional: write result back to a CSV file, without the default
# pandas integer index
df.to_csv('EBSD_data_filtered.txt', index=None)
Related
I have a text file which contains text in the first 20 or so lines, followed by CSV data. Some of the text in the text section contains commas and so trying csv.reader or csv.dictreader doesn't work well.
I want to skip past the text section and only then start to parse the CSV data.
Searches don't yield much other than instructions to either use csv.reader/csv.dictreader and iterate through the rows that are returned (which doesn't work because of the commas in the text), or to read the file line-by-line and split the lines using ',' as the delimiter.
The latter works up to a point, but it produces strings, not numbers. I could convert the strings to numbers but I'm hoping that there's a simple way to do this either with the csv or numpy libraries.
As requested - Sample data:
This is the first line. This is all just text to be skipped.
The first line doesn't always have a comma - maybe it's in the third line
Still no commas, or was there?
Yes, there was. And there it is again.
and so on
There are more lines but they finally stop when you get to
EndOfHeader
1,2,3,4,5
8,9,10,11,12
3, 6, 9, 12, 15
Thanks for the help.
Edit#2
A suggested answer gave the following link entitled Read file from line 2...
That's kind of what I'm looking for, but I want to be able to read through the lines until I find the "EndOfHeader" and then call on the CSV library to handle the remainder of the file.
The reply by saimadhu.polamuri is part of what I've tried, specifically
with open(filename , 'r') as f:
first_line = f.readline()
for line in f:
#test if line equals EndOfHeader. If true then parse as CSV
But that's where it comes apart - I can't see how to have CSV work with the data from this point forward.
With thanks to #Mike for the suggestion, the code is actually reasonably straightforward.
with open('data.csv') as f: # open the file
for i in range(7): # Loop over first 7 lines
str=f.readline() # just read them. Could also do f.next()
r = csv.reader(f, delimiter=',') # Now pass the file handle to a csv reader
for row in r: # and loop over the resulting rows
print(row) # Print the row. Or do something else.
In my actual code, it will search for the EndOfHeader line and use that to decide where to start parsing the CSV
I'm posting this as an answer, as the question that this one supposedly duplicates doesn't explicitly consider this issue of the file handle and how it can be passed to a CSV reader, and so it may help someone else.
Thanks to all who took time to help.
I am unfamiliar with the csv library and the "with open" syntax that I see around online sources including stackoverflow use it for processing csv files.
Here is the "with syntax" I was talking about, it only seems to process the correct amount of lines with this code.
How would I do it with the first block of code shown?
Your second code snippet using the csv module gives you list of lists. To get the same functionality, you should read each line of the csv file, strip the line endings, split the line with the separation character, and append to your list.
def file(file_name):
f = open(file_name, "r")
f.readline()
data = []
for row in f:
values = row.strip().split(",")
data.append(values)
f.close()
return data
I have the following data in a file called data.txt and would like to be able to add to the numbers at the end and replace them in the file without creating a new one:
Alfreda,art,2015,35
brook,biology,2015,3
charlie,chemistry,2015,140
dolly,Design,2015,120
Emilia,English,2015,150
Fiona,french,2015,40
Grace,Greek,2015,12
Hanna,history,2015,15
Here is the code I currently have:
with open("data.txt", "r") as f:
newline=[]
for word in f.line():
newline.append(word.replace(35,str(New))
with open("data.txt", "w") as f:
for line in newline :
f.writelines(line)
If you just want to add string to each line then update the file, this code can solve your problem but this is not optimal.
with open("data.txt", "r") as myFile:
newline=[]
# Use the readlines method to get all the lines
for line in myFile.readlines():
# Remove the \n character with the rstrip method
line = line.rstrip('\n')
newline.append(line+",35\n") # Don't forget to add \n
# Test
print newline
myFile.close()
with open("data.txt", "w") as myFile:
for line in newline :
myFile.writelines(line)
If this is not your problem, try to use the pickle module and work with objects, it will be easier.
I'm going to have to make some of your question up. If you have a file and you want to update it, the updates have to come from somewhere. The code in the question has a New variable but there is no indication of how New is supposed to get a value, or how the program is supposed to know which row to update.
I'm going to assume you have a file of updates called updates.txt that looks like this (and it is deliberately not in alphabetical order):
Emilia,45
Alfreda,35
So after your program runs the resulting file will have two rows different:
Alfreda,art,2015,70 ...this one
brook,biology,2015,3
charlie,chemistry,2015,140
dolly,Design,2015,120
Emilia,English,2015,195 ...and this one
Fiona,french,2015,40
Grace,Greek,2015,12
Hanna,history,2015,15
But the rest the same.
Since your sample data file is a .csv file I am using the Python csv module, rather than picking the data apart by hand. It doesn't matter much with simple data like this but it's a good module to know about.
import csv
marks = {}
# Read in existing data into a dictionary:
# key is name, value is a list [subject, year, score]
# like this: {"Alfreda": ["art",2015,35], ... }
# This is to make it easy to do random updates based on name
with open("data.txt", "r") as f:
for row in csv.reader(f):
name,subject,year,score = row
marks[name] = [subject,int(year),int(score)]
# Read in updates and apply each line to the corresponding entry in marks
with open("updates.txt", "r") as f:
for row in csv.reader(f):
name,added_score = row
try:
marks[name][2] += int(added_score) # for example marks["Alfreda"][2] += int("35")
except KeyError:
print(f"Name {name} not found to update, nothing done")
# Write out updated dictionary:
with open("data.txt", "w") as f:
writer = csv.writer(f,lineterminator="\n")
for name in sorted(marks.keys(), key=lambda n: n.lower()):
row=[name]+marks[name] # for example ["Alfreda"] + ["art",2015,70]
writer.writerow(row)
This line:
for name in sorted(marks.keys(), key=lambda n: n.lower()):
looks complicated but it is needed because you obviously expect the names Alfreda brook charlie dolly Emilia Fiona Grace Hanna to be in that order. But just doing the obvious
for name in sorted(marks.keys()):
will put them in the order Alfreda Emilia Fiona Grace Hanna brook charlie dolly.
In the interests of keeping the code simple and as close to your original as possible, it does no validity checks, so if this line
charlie,chemistry,2015,140
was wrongly entered as
charlie,chemistry,2015,14O
(with the letter O instead of a zero), the program will just fail. Ditto if the update file is missing a comma somewhere.
This works and will do what I think you want. But...
There are issues with the design. Your program reads in the data from data.txt, then overwrites it with new data. But suppose your program fails just after this line:
with open("data.txt", "w") as f:
Then you won't have your original data (because the call to open() truncated it), and you won't have the new data either (because you haven't written it out yet). Or suppose you accidentally run the program twice. There will be no way to tell you have done that.
You can provide some insurance against this sort of mishap by using the fileinput module, like this:
import fileinput
# Read in existing data
with fileinput.input("data.txt", inplace=True, backup=".bkp") as f:
for row in csv.reader(f):
name,subject,year,score = row
marks[name] = [subject,int(year),int(score)]
With this change, your updates will be in data.txt as before, but your original data will still be around, in a file called data.txt.bkp.
But that is just a fix. It avoids the real issue, which is that you really have a database application and you are trying to implement it using textfiles. The code above is all very well for an exercise, but it's not robust and it won't scale.
i'm new here and also new with programming with python
as an exercise i have to read data (lat & lon) from a txt file with many rows and convert them into shapefile with QGIS
After reading i find a way to extract data into array, as step1, but i have soem issues..
I use the following code
X=[]
Y=[]
f = open('D:/test_data/test.txt','r')
for line in f:
triplets=f.readline().split() #error
X=X.append(triplets[0])
Y=Y.append(triplets[1])
f.close()
for i in X:
print X[i]
with error:
ValueError: Mixing iteration and read methods would lose data
Propably it's a warning for losing the rest rows but i really don't want them for now.
for line in f: already iterates through the lines in the file, reading as it goes along. As such, it should be:
for line in f:
triplets = line.split()
Alternatively, you could do as below, though I recommend the method above.
with open('D:/test_data/test.txt','r') as f:
content = f.readlines()
for line in content:
triplets = line.split()
# append()
See Reading and Writing Files in python for more info.
Also, append() does what it sounds like, so you don't need assignment.
X.append(triplets[0]) # not X=X.append(triplets[0)
line already is the line. Get the triplets by
triplets = line.split()
I am trying to remove duplicates of 3-column tab-delimited txt file, but as long as the first two columns are duplicates, then it should be removed even if the two has different 3rd column.
from operator import itemgetter
import sys
input = sys.argv[1]
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
for line in input.splitlines():
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
file = open(output, "w")
file.write(data)
file.close()
First, I get error
key = ig(line.split())
IndexError: list index out of range
Also, I can't see how to save the result to output.txt
People say saving to output.txt is a really basic matter. But no tutorial helped.
I tried methods that use codec, those that use with, those that use file.write(data) and all didn't help.
I could learn MatLab quite easily. The online tutorial was fantastic and a series of Googling always helped a lot.
But I can't find a helpful tutorial of Python yet. This is obviously because I am a complete novice. For complete novices like me, what would be the best tutorial with 1) comprehensiveness AND 2) lots of examples 3) line by line explanation that dosen't leave any line without explanation?
And why is the above code causing error and not saving result?
I'm assuming since you assign input to the first command line argument with input = sys.argv[1] and output to the second, you intend those to be your input and output file names. But you're never opening any file for the input data, so you're callling .splitlines() on a file name, not on file contents.
Next, splitlines() is the wrong approach here anyway. To iterate over a file line-by-line, simply use for line in f, where f is an open file. Those lines will include the newline at the end of the line, so it needs to be stripped if it's not supposed to be part of the third columns data.
Then you're opening and closing the file inside your loop, which means you'll try to write the entire contents of data to the file every iteration, effectively overwriting any data written to the file before. Therefore I moved that block out of the loop.
It's good practice to use the with statement for opening files. with open(out_fn, "w") as outfile will open the file named out_fn and assign the open file to outfile, and close it for you as soon as you exit that indented block.
input is a builtin function in Python. I therefore renamed your variables so no builtin names get shadowed.
You're trying to directly write data to the output file. This won't work since data is a list of lines. You need to join those lines first in order to turn them in a single string again before writing it to a file.
So here's your code with all those issues addressed:
from operator import itemgetter
import sys
in_fn = sys.argv[1]
out_fn = sys.argv[2]
getkey = itemgetter(0, 1)
seen = set()
data = []
with open(in_fn, 'r') as infile:
for line in infile:
line = line.strip()
key = getkey(line.split())
if key not in seen:
data.append(line)
seen.add(key)
with open(out_fn, "w") as outfile:
outfile.write('\n'.join(data))
Why is the above code causing error?
Because you haven't opened the file, you are trying to work with the string input.txtrather than with the file. Then when you try to access your item, you get a list index out of range because line.split() returns ['input.txt'].
How to fix that: open the file and then work with it, not with its name.
For example, you can do (I tried to stay as close to your code as possible)
input = sys.argv[1]
infile = open(input, 'r')
(...)
lines = infile.readlines()
infile.close()
for line in lines:
(...)
Why is this not saving result?
Because you are opening/closing the file inside the loop. What you need to do is write the data once you're out of the loop. Also, you cannot write directly a list to a file. Hence, you need to do something like (outside of your loop):
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
All together
There are other ways of reading/writing files, and it is pretty well documented on the internet but I tried to stay close to your code so that you would understand better what was wrong with it
from operator import itemgetter
import sys
input = sys.argv[1]
infile = open(input, 'r')
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
lines = infile.readlines()
infile.close()
for line in lines:
print line
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
print data
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
PS: it seems to produce the result that you needed there Python to remove duplicates using only some, not all, columns