Remove initial space from lines, keeping others spaces in python/pandas - python

I need to remove the initial space from lines, like show below:
From
(space)09/Mar/21 16,997,520.6
To
09/Mar/21 16,997,520.6
I've tryed this : remove spaces from beginning of lines , but removed all spaces.

Assuming from your question that you have a file loaded, with multiple lines (where 09/Mar/21 16,997,520.6 is one of those lines). Have you tried something like:
for line in file:
line = line.strip()
<--- Do stuff with the line --->

Use .lstrip(' ')
string = ' 09/Mar/21 16,997,520.6'
print(string.lstrip(' '))
>>> 09/Mar/21 16,997,520.6
.lstrip() method will remove whatever is passed into it at the beginning of string.
For example:
print(string.lstrip(' 09'))
>>> /Mar/21 16,997,520.6

python
" 09/Mar/21 16,997,520.6".lstrip()
or pandas specific
df = pd.DataFrame([" 09/Mar/21 16,997,520.6"], columns = ['dummy_column'])
df['dummy_column'].str.lstrip()

Related

prevent new line character from being read in literally to python script

I have a string that I want to pass to a python script, e.g.
$printf "tas\nty\n"
yields
tas
ty
however when I pipe (e.g. printf "tas\nty\n" | ./pumpkin.py) where pumpkin.py is :
#!/usr/bin/python
import sys
data = sys.stdin.readlines()
print data
I get the output
['tas\n', 'ty\n']
How do I prevent the newline character from being read by python?
You can strip all white spaces (at the beginning and in the end) using strip :
data = [s.strip() for s in sys.stdin.readlines()]
If you need to strip only \n in the end you can do:
data = [s.rstrip('\n') for s in sys.stdin.readlines()]
Or use splitlines method:
data = sys.stdin.read().splitlines()
http://www.tutorialspoint.com/python/string_splitlines.htm

Casting str in Scientific Notation to Float fails only on Positive numbers in Python

I am reading in a couple of columns from a text file (delimited by 3 spaces for some reason)
The columns are in scientific notation. The first column has a mixture of positive and negative numbers.
When casting to float in this segment:
count = 0
curfile = open(curFile, "r")
for row in curfile.readlines():
if count > (row_first-1) and count < row_last:
line = row.split(' ')
x.append(float(line[0]))
print line[0]
y.append(float(line[1]))
count = count + 1
else:
count = count + 1
it fails when switching from a negative row to a positive row for the first column
-3.0000000E+00 5.5100000E+02 2.6600000E+02 0.0000000E+00
-1.0000000E+00 5.5100000E+02 2.6600000E+02 0.0000000E+00
1.0000000E+00 5.5100000E+02 2.6600000E+02 0.0000000E+00
so in this case it will successfully convert -1(the second row) but not 1(the third row).
when appending values to Y which are all positive, this problem does not occur.
I thought maybe there was a blank space before the positive numbers where the
"-" sign was but using lstrip() on the string did not help.
I am totally perplexed by this issue and would appreciate any ideas.
Edit: The exception that gets thrown when I run the program:
Traceback (most recent call last):
File "U:\scripts\flow3d_script\flow_3d.py", line 93, in <module>
x.append(float(line[0]))
ValueError: could not convert string to float:
This is mostly a guess, but...
Could there be three spaces at the start of the negative lines too? That would cause the split to return a list with an empty string at the start.
To resolve this you should lstrip() the entire line before splitting. And by the way, if you just do split() with no arguments it defaults to "split by any number of spaces", so you don't have to worry about the number of spaces.
Based on my understanding of the problem you are trying to solve, the code below does it in a clearer manner:
x = []
y = []
# Using `with` to ensure that the resources
# are cleaned up after execution
with open('test.txt', 'r') as curfile:
# Looping through each line of
# the opened file
for row in curfile:
# Check if row/line is empty
# and only execute the code
# if it is not
if row not in ['\n', '\r\n']:
# The output of row.split() is
# a list, we store this list in
# split_row
split_row = row.split(' ')
# Here we're appending the first
# column's value into the list
# x
x.append(float(split_row[0]))
# printing the output for debugging
# purposes
print split_row[0]
# Appending the second column's value
# into the list y
y.append(float(split_row[1]))
# printing the output for
# debugging purposes
print split_row[1]
As to why are you are facing issue with your code? I believe it's due to reading an extra blank line at the bottom of the original file, this is solved with if row not in ['\n', '\r\n']: in the code above.
Without having access to your full source code and your dataset it's a bit hard to tell exactly what's going wrong from that Traceback. It seems like from the samples you've posted that your code should work; However here is what I think you're trying to achieve:
Example:
#!/usr/bin/env python
from __future__ import print_function
raw = """\
-3.0000000E+00 5.5100000E+02 2.6600000E+02 0.0000000E+00\n
-1.0000000E+00 5.5100000E+02 2.6600000E+02 0.0000000E+00\n
1.0000000E+00 5.5100000E+02 2.6600000E+02 0.0000000E+00\n
"""
lines = filter(None, map(str.strip, raw.split("\n")))
data = [map(float, filter(None, line.split())) for line in lines]
for row in data:
print(row)
Demo:
$ python foo.py
[-3.0, 551.0, 266.0, 0.0]
[-1.0, 551.0, 266.0, 0.0]
[1.0, 551.0, 266.0, 0.0]
Note: That a lot of what I'm doing here is actually cleaning up the data into a form that can be more easily manipulated. I would however use the csv module here (even if you specify a delimiter of 3 spaces).
You might try just using split() vs split(' ').
The documentation of split:
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace
with a None separator returns [].
Now try with your data with leading or trailing spaces:
>>> txt='''\
... -3.0000000E+00 5.5100000E+02 2.6600000E+02 0.0000000E+00
... -1.0000000E+00 5.5100000E+02 2.6600000E+02 0.0000000E+00
... 1.0000000E+00 5.5100000E+02 2.6600000E+02 0.0000000E+00 '''
Now try split():
>>> [line.split() for line in txt.splitlines()]
[['-3.0000000E+00', '5.5100000E+02', '2.6600000E+02', '0.0000000E+00'], ['-1.0000000E+00', '5.5100000E+02', '2.6600000E+02', '0.0000000E+00'], ['1.0000000E+00', '5.5100000E+02', '2.6600000E+02', '0.0000000E+00']]
Note the leading and trailing spaces have been removed.
Now try with split( ):
>>> [line.split(' ') for line in txt.splitlines()]
[['', '', '', ' -3.0000000E+00', '5.5100000E+02', '2.6600000E+02', '0.0000000E+00', ''], ['-1.0000000E+00', '5.5100000E+02', '2.6600000E+02', '0.0000000E+00', ''], ['', ' 1.0000000E+00', '5.5100000E+02', '2.6600000E+02', '0.0000000E+00', '']]
The leading spaces or trailing spaces are preserved, and this is screwing up a call to float.

How to remove newlines and indents from a string in Python?

In my Python script, I have a SQL statement that goes on forever like so:
query = """
SELECT * FROM
many_many
tables
WHERE
this = that,
a_bunch_of = other_conditions
"""
What's the best way to get this to read like a single line? I tried this:
def formattedQuery(query):
lines = query.split('\n')
for line in lines:
line = line.lstrip()
line = line.rstrip()
return ' '.join(lines)
and it did remove newlines but not spaces from the indents. Please help!
You could do this:
query = " ".join(query.split())
but it will not work very well if your SQL queries contain strings with spaces or tabs (for example select * from users where name = 'Jura X'). This is a problem of other solutions which use string.replace or regular expressions. So your approach is not too bad, but your code needs to be fixed.
What is actually wrong with your function - you return the original, the return values of lsplit and rsplit are abandoned. You could fix it like this:
def formattedQuery(query):
lines = query.split('\n')
r = []
for line in lines:
line = line.lstrip()
line = line.rstrip()
r.append(line)
return ' '.join(r)
Another way of doing it:
def formattedQuery(q): return " ".join([s.strip() for s in q.splitlines()])
Another one line:
>>> import re
>>> re.sub(r'\s', ' ', query)
'SELECT * FROM many_many tables WHERE this = that, a_bunch_of = other_conditions'
This replaces all white spaces characters in the string query by a single ' ' white space.
string.translate can remove characters (just provide None for the second argument so it doesn't also convert characters):
import string
string.translate(query, None, "\n\t")

delete only lines after match1 up to match2

I have checked and played with various examples and it appears that my problem is a bit more complex than what I have been able to find. What I need to do is search for a particular string and then delete the following line and keep deleting lines until another string is found. So an example would be the following:
a
b
color [
0 0 0,
1 1 1,
3 3 3,
] #color
y
z
Here, "color [" is match1, and "] #color" is match2. So then what is desired is the following:
a
b
color [
] #color
y
z
This "simple to follow" code example will get you started .. you can tweak it as needed. Note that it processes the file line-by-line, so this will work with any size file.
start_marker = 'startdel'
end_marker = 'enddel'
with open('data.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if end_marker in line:
ignoreLines = False
if not ignoreLines:
print line,
It uses startdel and enddel as "markers" for starting and ending the ignoring of data.
Update:
Modified code based on a request in the comments, this will now include/print the lines that contain the "markers".
Given this input data (borrowed from #drewk):
Beginning of the file...
stuff
startdel
delete this line
delete this line also
enddel
stuff as well
the rest of the file...
it yields:
Beginning of the file...
stuff
startdel
enddel
stuff as well
the rest of the file...
You can do this with a single regex by using nongreedy *. E.g., assuming you want to keep both the "look for this line" and the "until this line is found" lines, and discard only the lines in between, you could do:
>>> my_regex = re.compile("(look for this line)"+
... ".*?"+ # match as few chars as possible
... "(until this line is found)",
... re.DOTALL)
>>> new_str = my_regex.sub("\1\2", old_str)
A few notes:
The re.DOTALL flag tells Python that "." can match newlines -- by default it matches any character except a newline
The parentheses define "numbered match groups", which are then used later when I say "\1\2" to make sure that we don't discard the first and last line. If you did want to discard either or both of those, then just get rid of the \1 and/or the \2. E.g., to keep the first but not the last use my_regex.sub("\1", old_str); or to get rid of both use my_regex.sub("", old_str)
For further explanation, see: http://docs.python.org/library/re.html or search for "non-greedy regular expression" in your favorite search engine.
This works:
s="""Beginning of the file...
stuff
look for this line
delete this line
delete this line also
until this line is found
stuff as well
the rest of the file... """
import re
print re.sub(r'(^look for this line$).*?(^until this line is found$)',
r'\1\n\2',s,count=1,flags=re.DOTALL | re.MULTILINE)
prints:
Beginning of the file...
stuff
look for this line
until this line is found
stuff as well
the rest of the file...
You can also use list slices to do this:
mStart='look for this line'
mStop='until this line is found'
li=s.split('\n')
print '\n'.join(li[0:li.index(mStart)+1]+li[li.index(mStop):])
Same output.
I like re for this (being a Perl guy at heart...)

Python Textwrap - forcing 'hard' breaks

I am trying to use textwrap to format an import file that is quite particular in how it is formatted. Basically, it is as follows (line length shortened for simplicity):
abcdef <- Ok line
abcdef
ghijk <- Note leading space to indicate wrapped line
lm
Now, I have got code to work as follows:
wrapper = TextWrapper(width=80, subsequent_indent=' ', break_long_words=True, break_on_hyphens=False)
for l in lines:
wrapline=wrapper.wrap(l)
This works nearly perfectly, however, the text wrapping code doesn't do a hard break at the 80 character mark, it tries to be smart and break on a space (at approx 20 chars in).
I have got round this by replacing all spaces in the string list with a unique character (#), wrapping them and then removing the character, but surely there must be a cleaner way?
N.B Any possible answers need to work on Python 2.4 - sorry!
A generator-based version might be a better solution for you, since it wouldn't need to load the entire string in memory at once:
def hard_wrap(input, width, indent=' '):
for line in input:
indent_width = width - len(indent)
yield line[:width]
line = line[width:]
while line:
yield '\n' + indent + line[:indent_width]
line = line[indent_width:]
Use it like this:
from StringIO import StringIO # Makes strings look like files
s = """abcdefg
abcdefghijklmnopqrstuvwxyz"""
for line in hard_wrap(StringIO(s), 12):
print line,
Which prints:
abcdefg
abcdefghijkl
mnopqrstuvw
xyz
It sounds like you are disabling most of the functionality of TextWrapper, and then trying to add a little of your own. I think you'd be better off writing your own function or class. If I understand you right, you're simply looking for lines longer than 80 chars, and breaking them at the 80-char mark, and indenting the remainder by one space.
For example, this:
s = """\
This line is fine.
This line is very long and should wrap, It'll end up on a few lines.
A short line.
"""
def hard_wrap(s, n, indent):
wrapped = ""
n_next = n - len(indent)
for l in s.split('\n'):
first, rest = l[:n], l[n:]
wrapped += first + "\n"
while rest:
next, rest = rest[:n_next], rest[n_next:]
wrapped += indent + next + "\n"
return wrapped
print hard_wrap(s, 20, " ")
produces:
This line is fine.
This line is very lo
ng and should wrap,
It'll end up on a
few lines.
A short line.

Categories