python tab delimited retrieve column and delete empty lines - python

I have a tab delimited text file that is consists of two columns, something like:
Apple123 2
Orange933 2
Banana33334 2
There maybe empty lines at the bottom. How can I:
1. Strip the empty lines, and
2. write to a file that consists only the first column?
My problem right now is that if I use line.strip() then the line consists of a list that has the length of 10 (for example for the first line) not 2. If I use csv.reader(..., dialect = excel-tab) then I can't use strip() so I can't get rid of the empty lines.

This should do the trick:
with open(infilename) as infile, open(outfilename) as outfile:
for line in infile:
line = line.strip()
if line:
outfile.write("{}\n".format(line.split("\t")[0]))

You could maybe do this with Python's basic string manipulation (str.split and so on):
infile = open("/path/to/myfile.txt")
outfile = open("/path/to/output.txt", "w") # Clears existing file, open for writing
for line in infile:
if len(line.strip()) == 0:
# skip blank lines
continue
# Get first column, write it to file
col1 = line.split("\t")[0]
outfile.write(col1 + "\n")
outfile.close()

Related

Issue removing multiple duplicate lines from a text file

I am trying to remove duplicate lines from a text file and keep facing issues... The output file keeps putting the first two accounts on the same line. Each account should have a different line... Does anyone know why this is happening and how to fix it?
with open('accounts.txt', 'r') as f:
unique_lines = set(f.readlines())
with open('accounts_No_Dup.txt', 'w') as f:
f.writelines(unique_lines)
accounts.txt:
#account1
#account2
#account3
#account4
#account5
#account6
#account7
#account5
#account8
#account4
accounts_No_Dup.txt:
#account4#account3
#account4
#account8
#account5
#account7
#account1
#account2
#account6
print(unique_lines)
{'#account4', '#account7\n', '#account3\n', '#account6\n', '#account5\n', '#account8\n', '#account4\n', '#account2\n', '#account1\n'}
The last line in your file is missing a newline (technically a violation of POSIX standards for text files, but so common you have to account for it), so "#account4\n" earlier on is interpreted as unique relative to "#account4" at the end. I'd suggest unconditionally stripping newlines, and adding them back when writing:
with open('accounts.txt', 'r') as f:
unique_lines = {line.rstrip("\r\n") for line in f} # Remove newlines for consistent deduplication
with open('accounts_No_Dup.txt', 'w') as f:
f.writelines(f'{line}\n' for line in unique_lines) # Add newlines back
By the by, on modern Python (CPython/PyPy 3.6+, 3.7+ for any interpreter), you can preserve order of first appearance by using a dict rather than a set. Just change the read from the file to:
unique_lines = {line.rstrip("\r\n"): None for line in f}
and you'll see each line the first time it appears, in that order, with subsequent duplicates being ignored.
Your problem is that set changes the order of your lines and your last element doesn't end with \n as you don't have an empty line at the end of your file.
Just add the separator or don't use set.
with open('accounts.txt', 'r') as f:
unique_lines = set()
for line in f.readlines():
if not line.endswith('\n'):
line += '\n'
unique_lines.add(line)
with open('accounts_No_Dup.txt', 'w') as f:
f.writelines(unique_lines)
You can easily do it using unique keyword
The code is as below
import pandas as pd
data = pd.read_csv('d:\\test.txt', sep="/n", header=None)
df = pd.DataFrame(data[0].unique())
with open('d:\\testnew.txt', 'a') as f:
f.write(df.to_string(header = False, index = False)))
Results: Test file to read has data
The result is it removed the duplicate lines

Searching rows of a file in another file and printing appropriate rows in python

I have a csv file like this: (no headers)
aaa,1,2,3,4,5
bbb,2,3,4,5,6
ccc,3,5,7,8,5
ddd,4,6,5,8,9
I want to search another csv file: (no headers)
bbb,1,2,3,4,5,,6,4,7
kkk,2,3,4,5,6,5,4,5,6
ccc,3,4,5,6,8,9,6,9,6
aaa,1,2,3,4,6,6,4,6,4
sss,1,2,3,4,5,3,5,3,5
and print rows in the second file(based on matching of the first columns) that exist in the first file. So results will be:
bbb,1,2,3,4,5,,6,4,7
ccc,3,4,5,6,8,9,6,9,6
aaa,1,2,3,4,6,6,4,6,4
I have following code, but it does not print anything:
labels = []
with open("csv1.csv", "r") as f:
f.readline()
for line in f:
labels.append((line.strip("\n")))
with open("csv2.csv", "r") as f:
f.readline()
for line in f:
if (line.split(",")[1]) in labels:
print (line)
If possible, could you tell me how to do this, please ? What is wrong with my code ? Thanks in advance !
This is one solution, although you may also look into csv-specific tools and pandas as suggested:
labels = []
with open("csv1.csv", "r") as f:
lines = f.readlines()
for line in lines:
labels.append(line.split(',')[0])
with open("csv2.csv", "r") as f:
lines = f.readlines()
with open("csv_out.csv", "w") as out:
for line in lines:
temp = line.split(',')
if any(temp[0].startswith(x) for x in labels):
out.write((',').join(temp))
The program first collects only labels from csv1.csv - note that you used readline, where the program seems to expected all the lines from the file read at once. One way to do it is by using readlines. The program also has to collect the lines from readlines - here it stores them in a list named lines. To collect the labels, the program loops through each line, splits it by a , and appends the first element to the array with labels, labels.
In the second part, the program reads all the lines from csv2.csv while also opening the file for writing the output, csv.out. It processes the lines from csv2.csv line by line while at the same time writing the target files to the output file.
To do that, the program again splits each line by , and looks if the label from csv2 is found in the labels array. If it is, that line is written to csv_out.csv.
Try using pandas, its a very effective way to read csv files into a data structure called dataframes.
EDIT
labels = []
with open("csv1.csv", "r") as f:
f.readline()
for line in f:
labels.append((line.split(',')[0])
with open("csv2.csv", "r") as f:
f.readline()
for line in f:
if (line.split(",")[0]) in labels:
print (line)
I it so that labels only contains the first part of the string so ['aaa','bbb', etc]
Then you want to check if line.split(",")[0] is in labels
Since you want to only match it based on the first column, you should use split and then get the first item from the split which is at index 0.

How to filter lines by column in python

I need to filter some lines of a .csv file:
2017/06/07 10:42:35,THREAT,url,192.168.1.100,52.25.xxx.xxx,Rule-VWIRE-03,13423523,,web-browsing,80,tcp,block-url
2017/06/07 10:43:35,THREAT,url,192.168.1.101,52.25.xxx.xxx,Rule-VWIRE-03,13423047,,web-browsing,80,tcp,allow
2017/06/07 10:43:36,THREAT,end,192.168.1.102,52.25.xxx.xxx,Rule-VWIRE-03,13423047,,web-browsing,80,tcp,block-url
2017/06/07 10:44:09,TRAFFIC,end,192.168.1.101,52.25.xxx.xxx,Rule-VWIRE-03,13423111,,web-browsing,80,tcp,allow
2017/06/07 10:44:09,TRAFFIC,end,192.168.1.103,52.25.xxx.xxx,Rule-VWIRE-03,13423111,,web-browsing,80,tcp,block-url
I want to filter lines containing the string "THREAT" in the second column AND lines containing the ips 192.168.1.100 and 192.168.1.101 in the fourth column.
This is my implementation so far:
import csv
file= open(file.log, 'r')
f= open(column, 'w')
lines = file.readlines()
for line in lines:
input = raw_input()
col = line.split(',')
if line.find(col[1])=="THREAT":
f.write (line)
if line.find(col[3]==192.168.1.100 && 192.168.101:
f.write (line)
else:
pass
f.close()
file.close()
What is wrong with the code? This is the output I'm expecting to get:
2017/06/07 10:42:35,THREAT,url,192.168.1.100,52.25.xxx.xxx,Rule-VWIRE-03,13423523,,web-browsing,80,tcp,block-url
2017/06/07 10:43:35,THREAT,url,192.168.1.101,52.25.xxx.xxx,Rule-VWIRE-03,13423047,,web-browsing,80,tcp,allow
You use str.find method, which returns index if found and -1 otherwise. In your case - if, for example, THREAT in line - it will return some non-zero number, but then you compare that number with string, which is obviously returns False.
Also, you can union those if statements.
So, taking into account the above - your if statements should be:
if col[1] == "THREAT" or col[3] in ["192.168.1.100", "192.168.1.101"]:
f.write(line)
In addition - i don't understand, why you use raw_input on each iteration and never use again that value?
I suggest you use this little optimized code:
import csv # not used in provide snippet, could be deleted
file_log = open("file.log", 'r') # better to use absoulete path
filtered_log = open("column", 'w') # same as previous
for line in file: # no need to read entire file, just iterate over it line by line directly
col = line.split(',')
if col and (col[1] == "THREAT" or col[3] in ["192.168.1.100", "192.168.1. 101"]):
filtered_log.write(line)
file_log.close()
filtered_log.close()
Python's csv module provides a reader object which can be used to iterate over a .csv file lines.
In each line, you can extract column by it's index and apply some comparation logic before printing the line.
This implementation will filter the file as needed:
import csv
ip_list = ['192.168.1.100', '192.168.1.101']
with open('file.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
for line in reader:
if (line[1]=="THREAT") and (line[3] in ip_list):
print(','.join(line))
As you can see, this implementation stores the ips in a list for comparing them using the python's in operator.

python writelines from a list made from .split()

I have a very long string with vertical and horizontal delimiters in this format:
[|Bob Hunter|555-5555|B|Polycity|AK|55555||#|Rob Punter|999-5555|B|Bolycity|AZ|55559|rpunter#email.com|#|....and so on...]
I would like to generate a list from this long string using split('#') and then write each element as a line to a new text file like so:
|Bob Hunter|555-5555|B|Polycity|AK|55555||
|Rob Punter|999-5555|B|Bolycity|AZ|55559|rpunter#email.com|
I will then import it into excel and delimit by the pipes.
f1 = open(r'C:\Documents\MyData.html','r')
f2 = open(r'C:\Documents\MyData_formatted.txt','w')
lines = f1.read().split("#")
for i in lines:
f2.writelines(i)
f2.close()
f1.close()
However, the txt file remains one line and only a partial amount of the data is written to the file (only about 25% is there). How can I get python to split the data by the # symbol and write each element of the resulting list to a file as a new line?
This is your corrected code, I changed line variable to records, because we're not dealing with lines and just to avoid confusion:
records = f1.read()
records = records[1:] # remove [
records = records[:-1] # remove ]
records = records.split("#")
for rec in records:
f2.write(rec + "\n")
And since you mentioned you need this data in excel, use csv files and from excel open your csv output file and excel will format your output as needed without you having to do that manually:
import csv
w = csv.writer(f2, dialect="excel")
records = [line.replace("|", ",") +"\n" for line in records]
for rec in records:
w.writerow([rec])
I think that before every # we should also delete |, because without that, after every splitted rocord we will get || as first characters in every line. That's why we should split |#, not only #.
Try this:
with open('input.txt','r') as f1:
text = f1.read().lstrip('[').rstrip(']').split("|#") #remove '[' and ']' from each side
with open('output.txt','w') as f2:
for line in text:
f2.write('%s\n' % line) #write to file as string with new line sign

for loop file read line and filter based on list remove unnecessary empty lines

I am reading a file and getting the first element from each start of the line, and comparing it to my list, if found, then I append it to the new output file that is supposed to be exactly like the input file in terms of the structure.
my_id_list = [
4985439
5605471
6144703
]
input file:
4985439 16:0.0719814
5303698 6:0.09407 19:0.132581
5605471 5:0.0486076
5808678 8:0.130536
6144703 5:0.193785 19:0.0492507
6368619 3:0.242678 6:0.041733
my attempt:
output_file = []
input_file = open('input_file', 'r')
for line in input_file:
my_line = np.array(line.split())
id = str(my_line[0])
if id in my_id_list:
output_file.append(line)
np.savetxt("output_file", output_file, fmt='%s')
Question is:
It is currently adding an extra empty line after each line written to the output file. How can I fix it? or is there any other way to do it more efficiently?
update:
output file should be for this example:
4985439 16:0.0719814
5605471 5:0.0486076
6144703 5:0.193785 19:0.0492507
try something like this
# read lines and strip trailing newline characters
with open('input_file','r') as f:
input_lines = [line.strip() for line in f.readlines()]
# collect all the lines that match your id list
output_file = [line for line in input_lines if line.split()[0] in my_id_list]
# write to output file
with open('output_file','w') as f:
f.write('\n'.join(output_file))
I don't know what numpy does to the text when reading it, but this is how you could do it without numpy:
my_id_list = {4985439, 5605471, 6144703} # a set is faster for membership testing
with open('input_file') as input_file:
# Your problem is most likely related to line-endings, so here
# we read the inputfile into an list of lines with intact line endings.
# To preserve the input, exactly, you would need to open the files
# in binary mode ('rb' for the input file, and 'wb' for the output
# file below).
lines = input_file.read().splitlines(keepends=True)
with open('output_file', 'w') as output_file:
for line in lines:
first_word = line.split()[0]
if first_word in my_id_list:
output_file.write(line)
getting the first word of each line is wasteful, since this:
first_word = line.split()[0]
creates a list of all "words" in the line when we just need the first one.
If you know that the columns are separated by spaces you can make it more efficient by only splitting on the first space:
first_word = line.split(' ', 1)[0]

Categories