Python, removing specific long lines of text in file - python

I've only programmed in Python for the past 8 months, so please excuse my probably noob approach to python.
My problem is the following, which i hope someone could help me solve.
I have lots of data in a file, for instance something like this (just a snip):
SWITCH MGMT IP;SWITCH HOSTNAME;SWITCH MODEL;SWITCH SERIAL;SWITCH UPTIME;PORTS NOT IN USE
10.255.240.1;641_HX_3560X;WS-C3560X-24P-S;FDO1601V031;12 weeks, 3 days, 23 hours, 33 minutes;1
10.255.240.7;641_HX_LEFT_2960x;WS-C2960X-24PS-L;FOC1750S2E5;12 weeks, 4 days, 7 minutes;21
10.255.240.8;641_UX_BASEMENT_2960x;WS-C2960X-24PS-L;FOC1750S2AG;12 weeks, 4 days, 7 minutes;12
10.255.240.9;641_UX_SPECIAL_2960x;WS-C2960X-24PS-L;FOC1750S27M;12 weeks, 4 days, 8 minutes;25
10.255.240.2;641_UX_OFFICE_3560;WS-C3560-8PC-S;FOC1202U24E;2 years, 30 weeks, 3 days, 16 hours, 43 minutes;2
10.255.240.3;641_UX_SFO_2960x;WS-C2960X-24PS-L;FOC1750S2BR;12 weeks, 4 days, 7 minutes;14
10.255.240.65;641_HX_3560X;WS-C3560X-24P-S;FDO1601V031;12 weeks, 3 days, 23 hours, 34 minutes;1
10.255.240.5;641_HX_RIGHT_2960s;WS-C2960S-24PS-L;FOC1627X1BF;12 weeks, 4 days, 12 minutes;16
10.255.240.6;641_HX_LEFT_2960x-02;WS-C2960X-24PS-L;FOC1750S2C4;12 weeks, 4 days, 7 minutes;15
10.255.240.4;641_UX_BASEMENT_2960s;WS-C2960S-24PS-L;FOC1607Z27T;12 weeks, 4 days, 8 minutes;3
10.255.240.62;641_UX_OFFICE_3560CG;WS-C3560CG-8PC-S;FOC1646Y0U2;15 weeks, 5 days, 12 hours, 15 minutes;6
I want to run through all the data in the file and check if a serial number occurs more than once. If it does i want to remove the duplicate found. The reason why the result might contain the same switch or router multiple times is that it might have several layer 3 interfaces, where it can be managed.
So in the above example. After i've run through the data it should remove the line:
10.255.240.65;641_HX_3560X;WS-C3560X-24P-S;FDO1601V031;12 weeks, 3 days, 23 hours, 34 minutes;1
Since the second line in the file already contains the same switch and serial number.
I've spend several days trying to figure out, how to achieve this and it is starting to give me a headache.
My base code looks like this:
if os.stat("output.txt").st_size != 0:
with open('output.txt','r') as file:
header_line = next(file) # Start from line 2 in the file.
data = [] # Contains the data from the file.
sn = [] # Contains the serial numbers to check up against.
ok = [] # Will contain the clean data with no duplicates.
data.append(header_line.split(";")) # Write the head to data.
for line in file: # Run through the file data line for line.
serialchk = line.split(";") # Split the data into a list
data.append(serialchk) # Write the data to data list.
sn.append(serialchk[3]) # Write the serial number to sn list.
end = len(data) # Save the length of the data list, so i can run through the data
i = 0 # For my while loop, so i know when to stop.'
while i != end: # from here on out i am pretty lost on how to achieve my goal.
found = 0
for x in range(len(data)):
if sn[i] == data[x][3]:
found += 1
print data[x]
ok.append(data[x])
elif found > 1:
print "Removing:\r\n"
print data[x-1]
del ok[-1]
found = 0
i += 1
Is there a more pythonic way to do this? I am pretty sure with all the talented people here, that someone can give me clues on how to make this happen.
Thank you very much in advance.

You're making it way more complicated than it has to be, and not memory-friendly (you dont have to load the whole file in memory to filter duplicates).
The simple way is to read your file line by line, and for each line check if the serial number has already been seen. If yes, skip the line, else store the serial number and write the line to your output file:
seen = set()
with open('output.txt','r') as source, open("cleaned.txt", "w") as dest:
dest.write(next(source)) # Start from line 2 in the file.
for line in src:
sn = line.split(";")[3]
if sn not in seen:
seen.add(sn)
dest.write(line)
# else, well we just ignore the line ;)
NB : I assume you want to write back the deduplicated lines to a file. If you want to keep them in memory the algorithm is mostly the same, just append your deduplicated lines to a list instead - but beware of memory usage if you have huge files.

My suggestion:
if os.stat("output.txt").st_size != 0:
with open('output.txt','r') as file:
header_line = next(file) # Start from line 2 in the file.
srn = set() # create a set where the seen srn will be stored
ok = [] # Will contain the clean data with no duplicates.
ok.append(header_line.split(";")) # Write the head to ok.
for line in file: # Run through the file data line for line.
serialchk = line.split(";") # Split the data into a list
if serialchk[3] not in srn: # if the srn hasn't be seen
ok.append(serialchk) # add the row to ok
srn.add(serialchk[3]) # add the srn to seen set
else: # if the srn has already be seen
print "Removing: "+";".join(serialchk) # notify the user it has been skipped
You'll end up with ok containing only rows with uniq srn, and print the removed rows
Hopefully it might help

I will walk you through the changes I would make.
The first thing I would do would be to use the csv module to parse the input. Since you can iterate over the DictReader, I also opt for that for brevity. The list data will contain the final (cleaned) results.
from csv import DictReader
import os
if os.stat("output.txt").st_size != 0:
with open('output.txt', 'r') as f:
reader = DictReader(f, delimiter=';') # create the reader instance
serial_numbers = set()
data = []
for row in reader:
if row["SWITCH HOSTNAME"] in serial_numbers:
pass
else:
data.append(row)
serial_numbers.add(row["SWITCH HOSTNAME"])
The format of the data will have changed by my approach, from a list of lists to a list of dicts, but if you want to save the cleaned data into a new csv file, the DictWriter class should be an easy way to do that.

Related

Trying to skip over several lines but the skipped lines are still being worked on

My program takes an input file, reads the file using whitespace as the delimiter and puts the data into an array, then I want to iterate over each line and if certain strings are found write that info to another file.
When a specific string is found, I want to skip over several lines, meaning that these lines are NOT iterated over. I thought that if I increased the 'line' variable (i) that would do it, but despite the fact that i is increased by 50, those 50 lines are still being worked on, which is not what I want.
Hopefully I have explained this problem well. Thank you in advance for your feedback.
def create_outfile(infile):
gto_found = 0
outfile = "output.txt" # Output file
outfile = open(outfile,'w') # Open output file for writing
for i in range(len(infile)): # iterate over each line
if len(infile[i]) == 6:
if (infile[i][4][1:-1]) == "GTO" and gto_found == 0: # now skip
print (i)
print (infile[i])
debugPause = input("\nPausing to debug...\n")
i = i + 50 # Skip over the GTO section
gto_found = 1
print (i)
debugPause = input("\nPausing to debug...\n")
print (infile[i])
for j in range(len(infile[i])): # iterate over each element
# Command section
if (infile[i][j])[:5] == "#ACS_":
# Do some work
Unfortunately, python does not allow a for loop to jump up like that. The variable i cannot be edited inside the loop. This is same as this question here, so check it out. This other topic shows some work around that you could use.

Reading a numbers off a list from a txt file, but only upto a comma

This is data from a lab experiment (around 717 lines of data). Rather than trying to excell it, I want to import and graph it on either python or matlab. I'm new here btw... and am a student!
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
more numbers : see Screenshot of more data from my file
I just can't figure out how to read the line up until a comma. Specifically, I need the Load numbers for one of my arrays/list, so for example on the first line I only need 62.638 (which would be the first number on my first index on my list/array).
How can I get an array/list of this, something that iterates/reads the list and ignores strings?
Thanks!
NOTE: I use Anaconda + Jupyter Notebooks for Python & Matlab (school provided software).
EDIT: Okay, so I came home today and worked on it again. I hadn't dealt with CSV files before, but after some searching I was able to learn how to read my file, somewhat.
import csv
from itertools import islice
with open('Blue_bar_GroupD.txt','r') as BB:
BB_csv = csv.reader(BB)
x = 0
BB_lb = []
while x < 7: #to skip the string data
next(BB_csv)
x+=1
for row in islice(BB_csv,0,758):
print(row[0]) #testing if I can read row data
Okay, here is where I am stuck. I want to make an arraw/list that has the 0th index value of each row. Sorry if I'm a freaking noob!
Thanks again!
You can skip all lines till the first data row and then parse the data into a list for later use - 700+ lines can be easily processd in memory.
Therefor you need to:
read the file line by line
remember the last non-empty line before number/comma/dot ( == header )
see if the line is only number/comma/dot, else increase a skip-counter (== data )
seek to 0
skip enough lines to get to header or data
read the rest into a data structure
Create test file:
text = """
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
"""
with open ("t.txt","w") as w:
w.write(text)
Some helpers and the skipping/reading logic:
import re
import csv
def convert_row(row):
"""Convert one row of data into a list of mixed ints and others.
Int is the preferred data type, else string is used - no other tried."""
d = []
for v in row:
try:
# convert to int && add
d.append(float(v))
except:
# not an int, append as is
d.append(v)
return d
def count_to_first_data(fh):
"""Count lines in fh not consisting of numbers, dots and commas.
Sideeffect: will reset position in fh to 0."""
skiplines = 0
header_line = 0
fh.seek(0)
for line in fh:
if re.match(r"^[\d.,]+$",line):
fh.seek(0)
return skiplines, header_line
else:
if line.strip():
header_line = skiplines
skiplines += 1
raise ValueError("File does not contain pure number rows!")
Usage of helpers / data conversion:
data = []
skiplines = 0
with open("t.txt","r") as csvfile:
skip_to_data, skip_to_header = count_to_first_data(csvfile)
for _ in range(skip_to_header): # skip_to_data if you do not want the headers
next(csvfile)
reader = csv.reader(csvfile, delimiter=',',quotechar='"')
for row in reader:
row_data = convert_row(row)
if row_data:
data.append(row_data)
print(data)
Output (reformatted):
[['Load (lbf)', 'Time (s)', 'Crosshead (in)', 'Extensometer (in)'],
[62.638, 0.9, 0.0, 8e-05],
[122.998, 1.7, 0.001, 0.00012]]
Doku:
re.match
csv.reader
Method of file objekts (i.e.: seek())
With this you now have "clean" data that you can use for further processing - including your headers.
For visualization you can have a look at matplotlib
I would recommend reading your file with python
data = []
with open('my_txt.txt', 'r') as fd:
# Suppress header lines
for i in range(6):
fd.readline()
# Read data lines up to the first column
for line in fd:
index = line.find(',')
if index >= 0:
data.append(float(line[0:index]))
leads to a list containing your data of the first column
>>> data
[62.638, 122.998]
The MATLAB solution is less nice, since you have to know the number of data lines in your file (which you do not need to know in the python solution)
n_header = 6
n_lines = 2 % Insert here 717 (as you mentioned)
M = csvread('my_txt.txt', n_header, 0, [n_header 0 n_header+n_lines-1 0])
leads to:
>> M
M =
62.6380
122.9980
For the sake of clarity: You can also use MATLABs textscan function to achieve what you want without knowing the number of lines, but still, the python code would be the better choice in my opinion.
Based on your format, you will need to do 3 steps. One, read all lines, two, determine which line to use, last, get the floats and assign them to a list.
Assuming you file name is name.txt, try:
f = open("name.txt", "r")
all_lines = f.readlines()
grid = []
for line in all_lines:
if ('"' not in line) and (line != '\n'):
grid.append(list(map(float, line.strip('\n').split(','))))
f.close()
The grid will then contain a series of lists containing your group of floats.
Explanation for fun:
In the "for" loop, i searched for the double quote to eliminate any string as all strings are concocted between quotes. The other one is for skipping empty lines.
Based on your needs, you can use the list grid as you please. For example, to fetch the first line's first number, do
grid[0][0]
as python's list counts from 0 to n-1 for n elements.
This is super simple in Matlab, just 2 lines:
data = dlmread('data.csv', ',', 6,0);
column1 = data(:,1);
Where 6 and 0 should be replaced by the row and column offset you want. So in this case, the data starts at row 7 and you want all the columns, then just copy over the data in column 1 into another vector.
As another note, try typing doc dlmread in matlab - it brings up the help page for dlmread. This is really useful when you're looking for matlab functions, as it has other suggestions for similar functions down the bottom.

Pandas: how to read csv with multiple lines on the same cell?

I have a csv that I am not able to read using read_csv
Opening the csv with sublime text shows something like:
col1,col2,col3
text,2,3
more text,3,4
HELLO
THIS IS FUN
,3,4
As you can see, the text HELLO THIS IS FUN takes three lines, and pd.read_csv is confused as it thinks these are three new observations. How can I parse that correctly in Pandas?
Thanks!
It looks like you'll have to preprocess the data manually:
with open('data.csv','r') as f:
lines = f.read().splitlines()
processed = []
cum_c = 0
buffer = ''
for line in lines:
buffer += line # Append the current line to a buffer
c = buffer.count(',')
if cum_c == 2:
processed.append(line)
buffer = ''
elif cum_c > 2:
raise # This should never happen
This assumes that your data only contains unwanted newlines, e.g. if you had data with say, 3 elements in one row, 2 elements in the next, then the next row should either be blank or contain only 1 element. If it has 2 or more, i.e. it's missing a necessary newline, then an error is thrown. You can accommodate this case if necessary with a minor modification.
Actually, it might be more efficient to remove newlines instead, but it shouldn't matter unless you have a lot of data.

Want to skip last and first 5 lines while reading file in python

If I want to see only data between line number 5 to what comes before last 5 rows in a file. While I was reading that particular file.
Code I have used as of now :
f = open("/home/auto/user/ip_file.txt")
lines = f.readlines()[5:] # this will start from line 5 but how to set end
for line in lines:
print("print line ", line )
Please suggest me I am newbie for python.
Any suggestions are most welcome too.
You could use a neat feature of slicing, you can count from the end with negative slice index, (see also this question):
lines = f.readlines()[5:-5]
just make sure there are more than 10 lines:
all_lines = f.readlines()
lines = [] if len(all_lines) <= 10 else all_lines[5:-5]
(this is called a ternary operator)

Convert CSV to txt and start new line every 10 values using Python

I have a csv file with an array of values 324 rows and 495 columns. All the values for each row and col are the same.
I need to have this array split up so that every 10 values is put in a new row. So for each of the 324 rows, there will be 49 full columns with 10 values and 1 column with 5 values (495 col / 10 values = 49 new rows with 10 values and 1 new row with 5 values). Then go to the next row and so on for 324 rows.
The trouble i'm having is listed as follows:
line.split(",") does not seem to be doing anything
Everything after the line.split doesn't seem to do anything either
i'm not sure my for newrow in range...is correct
I haven't put in the write output to text file yet, i think it should be outFile.write(something goes here, not sure what)
i put "\n" after print statement, but it just printed it out
I'm a beginner programmer.
Script:
import string
import sys
# open csv file...in read mode
inFile= open("CSVFile", 'r')
outFile= open("TextFile.txt", 'w')
for line in inFile:
elmCellSize = line.split(",")
for newrow in range(0, len(elmCellSize)):
if (newrow/10) == int(newrow/10):
print elmCellSize[0:10]
outFile.close()
inFile.close()
You should really be using the csv module, but I can give some advice anyway.
One problem you're having is that when you say print elmCellSize[0:10], you are always taking the first 10 elements, not the most recent 10 elements. Depending on how you want to do this, you could keep a string to remember the most recent 10 elements. I'll show an example below, after mentioning a few things that you can fix with your code.
First note that line.split(',') returns a list. So your choice of variable name elmCellSize is a little misleading. If you were to say lineList = line.split(',') it might make more sense? Or if you were to say lineSize = len(line.split(',')) and use that?
Also (although I don't know anything about Python 2.x) I think xrange is a function for Python 2.x which is more efficient than range, although it works exactly the same way.
Instead of saying if (newrow/10) == int(newrow/10), you can actually say if index % 10 == 0, to check if index is a multiple of 10. % can be thought of as 'remainder', so it will give the remainder of newrow when divided by 10. (Ex: 5 % 10 = 5; 17 % 10 = 7; 30 % 10 = 0)
Now instead of printing [0:10], which will always print the first 10 elements, you want to print from the current index back 10 spaces. So you can say print lineList[index-10:index] in order to print the most recent 10 elements.
In the end you'll have something like
...
lineList = line.split(',') # Really, you should use csv reader
# Open the file to write to
with open('yourfile.ext', 'w') as f:
# iterate through the line
for index, value in enumerate(lineList):
if index % 10 == 0 and index != 0:
# Write the last 10 values to the file, separated by commas
f.write(','.join(lineList[index-10:index]))
# new line
f.write('\n')
# print
print lineList[index-10:index]
I'm certainly not an expert, but I hope this helps!
Ok, this script almost works, i think.
The problem right now is that it stops writing to the outFile after the 1st 49 rows. It makes the 10 columns for 49 rows, but there should be a 50th row with only 5 columns because each row from the CSV file is 495 columns. So the current script writes out the last 10 values to a new row 49 times, but it doesn;t get those extra 5. Plus, it has to do this another 323 times becasue the original CSV file has 324 rows.
So, i think the problem now is possibly in the last if statement, maybe an else statement is needed, but my elif statement did not do anything. I want it to say if the 6th value in the list is an end of line character ('\n'), then write the 5 values in the list prioir to the end of line...it didn't work.
Thanks for all the help so far, i appreciate it!
Here is the script:
import string
#import sys
#import csv
# open csv file...in read mode
inFile= open("CSVFile.csv", 'r')
outFile= open("TextFile.txt", 'w')
for line in inFile:
lineList = line.split(',') # Really, you should use csv reader
# Open the file to write to
with open('outFile', 'w') as outFile:
# iterate through the line
for index, value in enumerate(lineList):
if index % 10 == 0 and index != 0:
# Write the last 10 values to the file, separated by space
outFile.write('\t'.join(lineList[index-10:index]))
# new line
outFile.write('\n')
# print
print lineList[index-10:index]
elif lineList[6] == '\n':
# Write the last 5 values to the file, separated by space
outFile.write(' '.join(lineList[index-5:index]))
# new line
outFile.write('\n')
# print
print lineList[index-:index]
outFile.close()
inFile.close()

Categories