I am parsing a bunch of HTML and am encountering a lot of "\n" and "\t" inside the code. So I am using
"something\t\n here".replace("\t","").replace("\n","")
This works, but I'm using it often. Is there a way to define a string function, along the lines of replace itself (or find, index, format, etc.) that will pretty my code a little, something like
"something\t\n here".noTabsOrNewlines()
I tried
class str:
def noTabNewline(self):
self.replace("\t","").replace("\n","")
but that was no good. Thanks for any help.
While you could do something along these lines (https://stackoverflow.com/a/4698550/1867876), the more Pythonic thing to do would be:
myString = "something\t\n here"
' '.join(myString.split())
You can see this thread for more information:
Strip spaces/tabs/newlines - python
you can try encoding='utf-8'. otherwise in my opinion there is no other way otherthan replacing it . python also replaces it spaces with '/xa0' so in anyway you have to replace it. our you can read it line by line via (readline()) instead of just read() it .
Related
I have a python program that reads a csv file, makes some changes, then writes to an HTML file. The issue is a block of code where I'm trying to search for a string assigned to one variable, then replace it with another string assigned to another variable. I am able to read a line in the csv file that looks like this:
Link:,www.google.com
And I am successful in writing an html file with the following:
<tr><td>Link:</td><td>www.google.com</td></tr>
Essentially I want to go further with an added step to find www.google.com between the anchor tags and replace it with "GOOGLE".
I've researched 'find and replace' functions built into python, and I came up with the substitution function inside the regular expressions module (re.sub()). This might not be the best way to do it and I'm trying to figure out if there's a better function/module out there I should look into.
python
for line in file:
newHTML.write(re.sub(var1,var2,line,flags=re.MULTILINE), end='')
newHTML.write(re.sub(var3,var4,line,flags=re.MULTILINE), end='')
The error I am receiving is:
newHTML.write(re.sub(var1,var2,line,flags=re.MULTILINE), end='')
TypeError: write() takes no keyword arguments
If I comment out this code, the rest of the program runs fine albeit without finding and replacing these variables.
Perhaps re.sub() doesn't go well with write()?
The error says what the problem is: as #furas commented, write() is not the same as print(), and doesn't accept the end='' keyword argument. file.write() by default doesn't include newlines if you don't explicitly put any \n's, so it should work if you change the line to:
newHTML.write(re.sub(var1,var2,line,flags=re.MULTILINE))
Also, regex and HTML aren't the best of friends... Your case is simple enough that using regex is fine, but you mentioned looking for a better module to generate HTML. This SO question had some good suggestions in the answers. Notable mentions for creating HTML templates on there were xml.etree, jinja2 (Flask's default engine), and yattag.
I am new to programming and have already checked other people's questions to make sure that I am using a good method to replace tabs with spaces, know my regex is correct, and also understand what exactly my error is ("Unhashable type 'list'). But even still, I'm at a loss of what to do. Any help would be great!
I have a large file that I have broken up into lines. Ultimately I will need to access the first 3 elements of each line. Currently when I print a line, without the additional re.sub line of code, I get something like this: ['blah\tblah\tblah'], when I want ['blah blah blah'].
My code to do this is
f = open(text.txt)
raw = f.read()
raw = raw.lower()
lines = raw.splitlines()
lines = re.sub(r'\t', lines, '\s')
print lines[0:2] #just to see the first few examples
f.close()
When I print the first few lines without the regex sub bit, it works fine. And then when I add that line in attempt to change the lines, I get the error. I understand that lists are changeable and thus can't be a hashed... but I'm not trying to work with a hash. I'm just trying to replace \t with \s in a large text file to make the program easier to work with. I don't think there is a problem with how I am changing \t's to \s's, because according to this error, any way I change it will break my code. What do I do?! Any help is super appreciated. :')
You need to change the order of params present inside the re.sub function. And also note that you can't use regex \s as a second param in re.sub function. Syntax of re.sub must be re.sub(regex,replacement,string) .
lines = raw.splitlines()
lines = [re.sub(r'\t', ' ', line) for line in lines]
raw.splitlines() returns a list which was then assigned to a variable called lines. So you need to apply the re.sub function to each item present in the list, since re.sub won't directly be applied on a list.
I am using a simple '.replace()' function on a string to replace some text with nothing as so:
.replace("('ws-stage-stat', ", '')
I have also tried using a regex to do this, like so:
match3a = re.sub("\(\'ws-stage-stat\', ", "", match3a)
This string is extracted from the source code for the following webpage at line 684:
http://www.whoscored.com/Regions/252/Tournaments/26
I have extracted and cleaned up the rest of the code into some usable data, but this one last bit won't co-operate and stubbornly refuses to be replaced. This seems like a very straight forward problem, but it just won't work for me.
Any ideas?
Thanks
The first replacement should work. Make sure that you're assigning the result of the replacement somewhere, for example:
mystring = mystring.replace("('ws-stage-stat', ", '')
I think you aren't escaping the regex correctly.
This is code my "Patterns" app spit out:
re.sub("\\(\\'ws-stage-stat\\', ", "", match3a)
A quick test showed me that it works correctly.
So suppose I have a text file of the following contents:
Hello what is up. ^M
^M
What are you doing?
I want to remove the ^M and replace it with the line that follows. So my output would look like:
Hello what is up. What are you doing?
How do I do the above in Python? Or if there's any way to do this with unix commands then please let me know.
''.join(somestring.split(r'\r'))
or
somestring.replace(r'\r','')
This assumes you have carriage return characters in your string, and not the literal "^M". If it is the literal string "^M" then substiture r'\r' with "^M"
If you want the newlines gone then use r'\r\n'
This is very basic string manipulation in python and it is probably worth looking at some basic tutorials http://mihirknows.blogspot.com.au/2008/05/string-manipulation-in-python.html
And as the first commenter said its always helpful to give some indication of what you have tried so far, and what you don't understand about the problem, rather than asking for an straight answer.
Try:
>>> mystring = mystring.replace("\r", "").replace("\n", "")
(where "mystring" contain your text)
use replace
x='Hello what is up. ^M\
^M\
What are you doing?'
print x.replace('^M','') # the second parameter insert what you want replace it with
I have a bunch of legacy code for encoding raw emails that contains a lot of print statements such as
print >>f, "Content-Type: text/plain"
This is all well and good for emails, but we're now leveraging the same code for outputting HTTP request. The problem is that the Python print statement outputs '\n' whilst HTTP requires '\r\n'.
It looks like Python (2.6.4 at least) generates a trailing PRINT_NEWLINE byte code for a print statement which is implemented as
ceval.c:1582: err = PyFile_WriteString("\n", w);
Thus it appears there's no easy way to override the default newline behaviour of print. I have considered the following solutions
After writing the output simply do a .replace('\n', '\r\n'). This will interfere with HTTP messages that use multipart encoding.
Create a wrapper around the destination file object and proxy the .write method
def write(self, data):
if data == '\n':
data = '\r\n'
return self._file.write(data)
Write a regular expression that translates print >>f, text to f.write(text + line_end) where line_end can be '\n' or '\r\n'.
I believe the third option would be the most appropriate. It would be interesting to hear what your Pythonic approach to the problem would be.
You should solve your problem now and for forever by defining a new output function. Were print a function, this would have been much easier.
I suggest writing a new output function, mimicing as much of the modern print function signature as possible (because reusing a good interface is good), for example:
def output(*items, end="\n", file=sys.stdout):
pass
Once you have replaced all prints in question, you no longer have this problem -- you can always change the behavior of your function instead! This is a big reason why print was made a function in Python 3 -- because in Python 2.x, "all" projects invariably go through the stage where all the print statements are no longer flexible, and there is no easy way out.
(Not sure how/if this fits with the wrapper you intend to use, but in case...)
In Python 2.6 (and many preceding versions), you can suppress the newline by adding a comma at the end of the print statement, as in:
data = 'some msg\r\n'
print data, # note the comma
The downside of using this approach however is that the print syntax and behavior is changed in Python3.
In python2.x, I think you can do:
print >>f "some msg\r\n",
to supress the trailing new line.
In python3.x, it's a lot simpler:
print("some msg", end = "\r\n", file = f)
I think I would define a new function writeline in an inherited file/stream class and update the code to use writeline instead of print. The file object itself can hold the line ending style as a member. That should give you some flexibility in behavior and also make the code a little clearer i.e. f.writeline(text) as opposed to f.write(text+line_end).
I also prefer your third solution, but no need to use f.write, any user written function/callable would do. Thus the next changes would become easy. If you use an object you may even hide target file inside it thus removing some syntaxic noise like file or kind of newline.
Too bad print is a statement in python 2.x, with python 3.x print could simply be overloaded by something user defined.
Python has modules both to handle email and http headers in an easy compliant way. I suggest you use them instead of solving already solved problems again.