How to ignore a line when using readlines in python? - python

a=open('D:/1.txt','r')
b=a.readlines()
now we get b which contains all the line in 1.txt.
But we know that in Python when we don't use a line we could use
\# mark
to ignore the specific line.
Is there any command we could use in TXT to ignore a specific line when use readlines?

Text files have no specific syntax, they are just a sequence of characters. It is up to your program to decide if any of these characters have particular meaning for your program.
For example if you wanted to read all lines, but discard those starting with '#' then you could filter those out using a list comprehension
with open('D:/1.txt','r') as a:
lines = [line for line in a if not line.startswith('#')]

Related

Return the exact lines of a Huge file after pattern matching without using FOR in Python3

I am new to Python. My problem here is that:
I want to match a pattern against a large file and return matching lines(not just the matched string) from it. I DO NOT want a FOR loop for this as my file is huge. I am using mmap for reading the file.
in the above file, if I search for bhuvi, I should get 2 rows, bhuvi and bhuvi Kumar
I used re.findall() for this, but it just returns the substrings, not the whole lines.
Can someone please suggest what I can do here?
If your input file is huge, you cannot use readlines, but nothing
prevents you from reading one line in a loop.
As the file object is iterable, you can write the loop as:
for line in fh:
and process the content of the input line inside the loop.
The file size is not important, as you do not attempt to read all lines at once.
To check for presence of your string (bhuvi) in the line use
re.search, not re.findall.
Actually you don't need any list of matches, it is enough to find
a single match (it works quicker).
Below you have an example program (Python 3.7), writing the lines contaning your
string, along with the line number:
import re
cnt = 0
with open('input.txt') as fh:
for line in fh:
line = line.rstrip()
cnt += 1
if re.search('bhuvi', line):
print(f'{cnt}: {line}')
Note that I used rstrip() to remove the trailing newline, if any.
Edit after your comment:
You wrote that the file to check is huge. So there is a risk that
if you try to read it whole into the computer memory, the program
runs out of memory.
In such a case you would have to read the file chunk by chunk and
perform search in each chunk separately.
There is also a risk that a row with the text you are looking for will be
partially read in one chunk and the rest in the next,
so you have to take some measure to avoid this in your program.
On the other hand, if there is no other way but using mmap,
try something like re.finditer(r'[^\n]*bhuvi[^\n]*', map), i.e. create
an iterator looking for:
A sequence of chars other than \n.
Your string.
Another sequence of chars other than \n.
This way the match object returned by the iterator will match the
whole line, not your string alone.

Amend list from file - Correct syntax and file format?

I currently have a list hard coded into my python code. As it keeps expanding, I wanted to make it more dynamic by reading the list from a file. I have read through many articles about how to do this, but in practice I can't get this working. So firstly, here is an example of the existing hardcoded list:
serverlist = []
serverlist.append(("abc.com", "abc"))
serverlist.append(("def.com", "def"))
serverlist.append(("hji.com", "hji"))
When I enter the command 'print serverlist' the output is shown below and my list works perfectly when I access it:
[('abc.com', 'abc'), ('def.com', 'def'), ('hji.com', 'hji')]
Now I've replaced the above code with the following:
serverlist = []
with open('/server.list', 'r') as f:
serverlist = [line.rstrip('\n') for line in f]
With the contents of server.list being:
'abc.com', 'abc'
'def.com', 'def'
'hji.com', 'hji'
When I now enter the command print serverlist, the output is shown below:
["'abc.com', 'abc'", "'def.com', 'def'", "'hji.com', 'hji'"]
And the list is not working correctly. So what exactly am I doing wrong? Am I reading the file incorrectly or am I formatting the file incorrectly? Or something else?
The contents of the file are not interpreted as Python code. When you read a line in f, it is a string; and the quotation marks, commas etc. in your file are just those characters as parts of a string.
If you want to create some other data structure from the string, you need to parse it. The program has no way to know that you want to turn the string "'abc.com', 'abc'" into the tuple ('abc.com', 'abc'), unless you instruct it to.
This is the point where the question becomes "too broad".
If you are in control of the file contents, then you can simplify the data format to make this more straightforward. For example, if you just have abc.com abc on the line of the file, so that your string ends up as 'abc.com abc', you can then just .split() that; this assumes that you don't need to represent whitespace inside either of the two items. You could instead split on another character (like the comma, in your case) if necessary (.split(',')). If you need a general-purpose hammer, you might want to look into JSON. There is also ast.literal_eval which can be used to treat text as simple Python literal expressions - in this case, you would need the lines of the file to include the enclosing parentheses as well.
If you are willing to let go of the quotes in your file and rewrite it as
abc.com, abc
def.com, def
hji.com, hji
the code to load can be reduced to a one liner using the fact that files are iterables
with open('servers.list') as f:
servers = [tuple(line.split(', ')) for line in f]
Remember that using a file as an iterator already strips off the newlines.
You can allow arbitrary whitespace by doing something like
servers = [tuple(word.strip() for word in line.split(',')) for line in f]
It might be easier to use something like regex to parse the original format. You could use an expression that captures the parts of the line you care about and matches but discards the rest:
import re
pattern = re.compile('\'(.+)\',\\s*\'(.+)\'')
You could then extract the names from the matched groups
with open('servers.list') as f:
servers = [pattern.fullmatch(line).groups() for line in f]
This is just a trivialized example. You can make it as complicated as you wish for your real file format.
Try this:
serverlist = []
with open('/server.list', 'r') as f:
for line in f:
serverlist.append(tuple(line.rstrip('\n').split(',')))
Explanation
You want an explicit for loop so you cycle through each line as expected.
You need list.append for each line to append to your list.
You need to use split(',') in order to split by commas.
Convert to tuple as this is your desired output.
List comprehension method
The for loop can be condensed as below:
with open('/server.list', 'r') as f:
serverlist = [tuple(line.rstrip('\n').split(',')) for line in f]

Retain only specified content in a string

I have data in the following form in a file:
<string1> abc:string2 <http://yago-knowledge.org/resource/wikicategory_Sports_clubs_established</text\u003e\n______<sha1\u003eqwjfowt5my8t6yuszdb88k2ehskjuh0</sha1\u003e\n____</revision\u003e\n__</page\u003e\n__<page\u003e\n____<title\u003ePortal:Tropical_cyclones/Anniversaries/August_22</title\u003e\n____<ns\u003e100</ns\u003e\n____<id\u003e7957689</id\u003e\n____<revision\u003e\n______<id\u003e446349886</id\u003e\n______<timestamp\u003e2011-08-23T17:38:19Z</timestamp\u003e\n______<contributor\u003e\n________<username\u003eLightbot</username\u003e\n________<id\u003e7178666</id\u003e\n______</contributor\u003e\n______<comment\u003eDelink_non-obscure_units._Conversions._Report_bugs_to_[[User_talk:Lightmouse>.
The delimiter in the above file is a tab (\t) i.e. string1 is separated from abc:string2by \t. Similarly for the rest of the strings.
Now I want to retain just alphabets, numbers, /, :,'.' and _ within the strings which are enclosed within <>. I want to delete all the characters apart from the specified ones from the strings which are enlosed in <>.
Is there some way by which I may achieve this using linux commands or python? I want to replace all the unwanted characters by an underscore.
<string1> abc:string2 <http://yago-knowledge.org/resource/wikicategory_Sports_clubs_established_text_u003e_n_______sha1_u003eqwjfowt5my8t6yuszdb88k2ehskjuh0_sha1_u003e_n_____revision_u003e_n___/page_u003e_n___page_u003e_n_____title_u003ePortal:Tropical_cyclones/Anniversaries/August_22_/title_u003e_n_____ns_u003e100_/ns_u003e_n_____id_u003e7957689_/id_u003e_n_____revision_u003e_n_______id_u003e446349886_/id_u003e_n_______timestamp_u003e2011-08-23T17:38:19Z_/timestamp_u003e_n_______contributor_u003e_n_________username_u003eLightbot_/username_u003e_n_________id_u003e7178666_/id_u003e_n_______/contributor_u003e_n_______comment_u003eDelink_non-obscure_units._Conversions._Report_bugs_to___User_talk:Lightmouse>.
Is there some way by which I may achieve this?
You can probably achieve this just with UNIX tools and some crazy regular expression, but I would write a small Python script for this:
Open two files (input and output) with open()
Iterate over the input file line by line: for line in input_file:
Split the line at tab: for part in line.split('\t'):
Check if a part is enclosed in <>: if part.startswith('<') and line.endswith('>'):
Filter characters with a regular expression: filtered_part = re.sub(r'[^a-zA-Z0-9/:._]', '', part)
Join the filtered parts back together: filtered_line = '\t'.join(filtered_parts)
Write the filtered line to the output file: output_file.write(filtered_line + '\n')
Following this outline, it should be easy for you to write a working script.

Find word in file, return definition

I'm making a small dictionary application just to learn Python. I have the function for adding words done (just need to add a check to prevent duplicates) but I'm trying to create the function for looking words up.
This is what my text files looks like when I append the words to the text file.
{word|Definition}
And I can check if the word exists by doing this,
if word in open("words/text.txt").read():
But how do I get the definition? I assume I need to use regex (which is why I split it up and placed it inside curly braces), I just have no idea how.
read() would read the entire file contents. You could do this instead:
for line in open("words/text.txt", 'r').readlines():
split_lines = line.strip('{}').split('|')
if word == split_lines[0]: #Or word in line would look for word anywhere in the line
return split_lines[1]
You can use dictionary if you want effective search.
with open("words/text.txt") as fr:
dictionary = dict(line.strip()[1:-1].split('|') for line in fr)
print(dictionary.get(word))
Also try to avoid syntax like below:
if word in open("words/text.txt").read().
Use context manager (with syntax) to ensure that file will be closed.
To get all definitions
f = open("words/text.txt")
for line in f:
print f.split('|')[1]

Search for strings listed in one file from another text file?

I want to find strings listed in list.txt (one string per line) in another text file in case I found it print 'string,one_sentence' in case didn't find 'string,another_sentence'. I'm using following code, but it is finding only last string in the strings list from file list.txt. Cannot understand what could be the reason?
data = open('c:/tmp/textfile.TXT').read()
for x in open('c:/tmp/list.txt').readlines():
if x in data:
print(x,',one_sentence')
else:
print(x,',another_sentence')
When you read a file with readlines(), the resulting list elements do have a trailing newline characters. Likely, these are the reason why you have less matches than you expected.
Instead of writing
for x in list:
write
for x in (s.strip() for s in list):
This removes leading and trailing whitespace from the strings in list. Hence, it removes trailing newline characters from the strings.
In order to consolidate your program, you could do something like this:
with open('c:/tmp/textfile.TXT') as f:
haystack = f.read()
if not haystack:
sys.exit("Could not read haystack data :-(")
with open('c:/tmp/list.txt') as f:
for needle in (line.strip() for line in f):
if needle in haystack:
print(needle, ',one_sentence')
else:
print(needle, ',another_sentence')
I did not want to make too drastic changes. The most important difference is that I am using the context manager here via the with statement. It ensures proper file handling (mainly closing) for you. Also, the 'needle' lines are stripped on the fly using a generator expression. The above approach reads and processes the needle file line by line instead of loading the whole file into memory at once. Of course, this only makes a difference for large files.
readlines() keeps a newline character at the end of each string read from your list file. Call strip() on those strings to remove those (and every other whitespace) characters.

Categories