Python Script Not Finding String that is Clearly Present in a file - python

I am attempting to analyze a csv file contaning a TCP stack trace. I'm checking conditionally if a line of the file contains a certain string, then adding it to the dictionary.
The strings I'm looking for are:
[SYN]
[SYN, ACK]
I have checked the file multiple times. Python can find the first string no problem, but cannot find the second. Here's the code that checks:
#variable declaration
synString = '[SYN]'
ackString = '[SYN, ACK]'
#some code
#iterate through csv:
with open('EECS325Hw3Lab3', newline = '') as captureFile:
captureReader = csv.reader(captureFile, delimiter=' ')
for row in captureReader:
#code that doesn't work.
if synString in row or ackString in row:
serverDict[currentServer].append(row)
And I know this doesn't work because when I print serverDict, I only see the [SYN] expression. What is happening here?

When you read it a with a csv.reader each row is a list of strings, and 'SYN' and 'ACK' will be consecutive elements of the list. Just use open.
for line in open(captureFile):
if synString in line or ackString in line:
serverDict[currentServer].append(line)
Each row will be a string, as you want it to be. You may want to strip the newlines, though.

Related

'\n' not sending file.write() to next line using seek() while looping

My problem is a simple one (too simple...). I am opening a new text file via with and attempting to write each row from a pandas.DataFrame to the file. Specifically, I'm trying to place column entries at very specific character positions on each line, as that is the required format for the people receiving my file.
df represents my pandas.DataFrame in the code below.
with open(os.path.join(a_directory_var, 'folder/myfile.txt'), 'x') as file:
for index, row in df.iterrows():
file.seek(1)
file.write(row['col1'])
file.seek(56)
file.write('|')
file.seek(61)
file.write(row['col2'])
file.seek(76)
file.write('|')
file.seek(81)
file.write('col3')
file.seek(96)
file.write('|\n')
Expected Output:
I expected my last line to place a pipe, and send file to the next line with '\n', so that the next call to file.write() would begin writing entries to the next line.
Actual Output: Characters from each row being written over themselves on the first line, over and over again. It may be worth noting that the resulting text file does have an empty second line.
In summation, I'm simply trying to write to a line, go to the next, write to that line, go to the next, etc, etc.
It looks like you're trying to write a fixed-width column format, with additional | characters as separators. As that is not a simple option in Pandas (such as df.to_csv(fp, sep='|'), you have to iterate over the rows, as you do, and write them one by one. But don't write each part separately: format the lines using Python formatting.
For example, something like this should get close to what you want (give or take a slight offset due to me not counting properly):
sep = "|"
with open(os.path.join(a_directory_var, 'folder/myfile.txt'), 'x') as fp:
for index, row in df.iterrows():
fp.write("{:56s}{:15s}{:15s}{:15s}{:15s}\n".format(
row['col1'], sep, row['col2'], sep, row['col3'], sep)

Remove duplicates in text file line by line

I'm trying to write a Python script that will remove duplicate strings in a text file. However, the de-duplication should only occur within each line.
For example, the text file might contain:
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;09 XYZ\XYZ\XYZ;12 EFG\EFG;þ
Thus, in the above example, the script should only remove the bold strings.
I've searched Stack Overflow and elsewhere to try to find a solution, but haven't had much luck. There seem to be many solutions that will remove duplicate lines, but I'm trying to remove duplicates within a line, line-by-line.
Update: Just to clarify - þ is the delimiter for each field, and ; is the delimiter for each item within each field. Within each line, I'm attempting to remove any duplicate strings contained between semicolons.
Update 2: Example edited to reflect that the duplicate value may not always follow directly after the first instance of the value.
#Prune's answer gives the idea but it needs to be modified like this:
input_file = """"þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ"""""
input = input_file.split("\n")
for line in input:
seen_item = []
for item in line.split(";"):
if item not in seen_item or item == "þ":
seen_item.append(item)
print(";".join(seen_item))
import re
with open('file', 'r') as f:
file = f.readlines()
for line in file:
print(re.sub(r'([^;]+;)(\1)', r'\1', line))
Read the file by lines; then replace the duplicates using re.sub.

Unexpected output from textfile - cleaning read in lines correctly

I am trying to use a very basic text file as a settings file. Three lines repeat in this order/format that govern some settings/input for my program. Text file is as follows:
Facebook
1#3#5#2
Header1#Header2#Header3#Header4
...
This is read in using the following Python code:
f = open('settings.txt', 'r')
for row in f:
platform = f.readline()
rows_to_keep = int(f.readline().split('#'))
row_headers = f.readline().split('#')
clean_output(rows_to_keep, row_headers, platform)
I would expect single string to be read in platform, an array of ints in the second and an array of strings in the third. These are then passed to the function and this is repeated numerous times.
However, the following three things are happening:
Int doesn't convert and I get a TypeError
First line in text file is ignored and I get rows to keep in platform
\n at the end of each line
I suspect these are related and so am only posting one question.
You cannot call int on a list, you need do do some kind of list comprehension like
rows_to_keep = [int(a) for a in f.readline().split('#')]
You're reading a line, then reading another line from the file. You should either do some kind of slicing (see Python how to read N number of lines at a time) or call a function with the three lines after every third iteration.
use .strip() to remove end of lines and other whitespace.
Try this:
with open('settings.txt', 'r') as f:
platform, rows_to_keep, row_headers = f.read().splitlines()
rows_to_keep = [int(x) for x in rows_to_keep.split('#')]
row_headers = row_headers.split('#')
clean_output(rows_to_keep, row_headers, platform)
There are several things going on here. First, when you do the split on the second line, you're trying to cast a list to type int. That won't work. You can, instead, use map.
rows_to_keep = map(int,f.readline().strip().split("#"))
Additionally, you see the strip() method above. That removes trailing whitespace chars from your line, ie: \n.
Try that change and also using strip() on each readline() call.
With as few changes as possible, I've attempted to solve your issues and show you where you went wrong. #Daniel's answer is how I would personally solve the issues.
f = open('settings.txt', 'r')
#See 1. We remove the unnecessary for loop
platform = f.readline()
#See 4. We make sure there are no unwanted leading or trailing characters by stripping them out
rows_to_keep = f.readline().strip().split('#')
#See 3. The enumerate function creates a list of pairs [index, value]
for row in enumerate(rows_to_keep):
rows_to_keep[row[0]] = int(row[1])
row_headers = f.readline().strip().split('#')
#See 2. We close the file when we're done reading
f.close()
clean_output(rows_to_keep, row_headers, platform)
You don't need (and don't want) a for loop on f, as well as calls to readline. You should pick one or the other.
You need to close f with f.close().
You cannot convert a list to an int, you want to convert the elements in the list to int. This can be accomplished with a for loop.
You probably want to call .strip to get rid of trailing newlines.

Python: Read in Data from File

I have to read data from a text file from the command line. It is not too difficult to read in each line, but I need a way to separate each part of the line.
The file contains the following in order for several hundred lines:
String (Sometimes more than 1 word)
Integer
String (Sometimes more than 1 word)
Integer
So for example the input could have:
Hello 5 Sample String 10
The current implementation I have for reading in each line is as follows... how can I modify it to separate it into what I want? I have tried splitting the line, but I always end up getting only one character of the first string this way with no integers or any part of the second string.
with open(sys.argv[1],"r") as f:
for line in f:
print(line)
The desired output would be:
Hello
5
Sample String
10
and so on for each line in the file. There could be thousands of lines in the file. I just need to separate each part so I can work with them separately.
The program can't magically split lines the way you want. You will need to read in one line at a time and parse it yourself based on the format.
Since there are two integers and an indeterminate number of (what I assume are) space-delimited words, you may be able to use a regular expression to find the integers then use them as delimiters to split up the line.

Exact number of characters in a JSON

I have a file with several JSON objects in each line. And, I need to know the length of each object, I mean the each number of characters. But when I run the code below, it provides a number of characters less than expected.
jsonFile = open(File, 'r')
line = jsonFile.readline()
len(line)
It considers some elements like "/n" as a one character but I want it to say there are two. Do you have any idea please ?
print os.path.getsize('myfile.json')
http://devdocs.io/python/library/os.path#os.path.getsize

Categories