reading an array with missing data and spaces in the first column - python

I have a .txt file I want to read using pyhon. The file is an array. It contains data on comets. I copied 3 rows out of the 3000 rows.
P/2011 U1 PANSTARRS 1.54 0.5 14.21 145.294 352.628 6098.07
P/2011 VJ5 Lemmon 4.12 0.5 2.45 139.978 315.127 5904.20 *
149P/Mueller 4 3.67 0.1 5.32 85.280 27.963 6064.72
I am reading the array using the the following code:
import numpy as np
list_comet = np.genfromtxt('jfc_master.txt', dtype=None)
I am facing 2 different problems:
First, in row 1 the name of the comet is: P/2011 U1 PANSTARRS. If I type:
list_comet[0][1] the result will be P/2011. How should I tell python how to read the name of each comet? Note that the longest name is 31 characters. So what is the command to tell python that column 1 is 31 characters long?
Second, in row 2 that value of the last column is *. When I read the file I am receiving an error which says that:
Line #2941 (got 41 columns instead of 40)
(note that the above data is not the complete data, the total number of columns I have in my original data is 38). I guess I am receiving this error due to the * found in certain rows. How can I fix this problem?

You didn't mention what data structure you're looking for, i.e. what operations you intend to perform on the parsed data. In the simplest case, you could massage the file into a list of 8-tuples - the last element being either '*' or an empty string. That is as simple as
import string
def tokenize(s):
if s[-1] == '*':
return string.rsplit(s, None, 7)
else:
return string.rsplit(s, None, 6) + ['']
tokens = (tokenize(line.rstrip()) for line in open('so21712204.txt'))
To be fair, this doesn't make tokens a list of 8-tuples but rather a generator (which is more space efficient) of lists, each of which having 8 elements.

Related

Reading input lines for int objects separated with whitespace?

I'm trying to solve a programming problem that involves returning a boolean for an uploaded profile pic, matching its resolution with the one that I provide as input and returning a statement that I've described below. This is one such test case that is giving me errors:
180
3
640 480 CROP IT
320 200 UPLOAD ANOTHER
180 180 ACCEPTED
The first line reads the dimension that needs to be matched, the second line represents the number of test cases and the rest comprise of resolutions with whitespace separators. For each of the resolutions, the output shown for each line needs to be printed.
I've tried this, since it was the most natural thing I could think of and being very new to Python I/O:
from sys import stdin, stdout
dim = int(input())
n = int(input())
out = ''
for cases in range(0, n):
in1 = int(stdin.readline().rstrip('\s'))
in2 = int(stdin.readline().rstrip('\s'))
out += str(prof_pic(in1, in2, dim))+'\n'
stdout.write(out)
ValueError: invalid literal for int() with base 10 : '640 480\n'
prof_pic is the function that I'm abstaining from describing here to prevent the post getting too long. But I've written in such a way that the width and height params both get compared with dim and return an output. The problem is with reading those lines. What is the best way to read such lines with differing separators?
You can try this it is in python 3.x
dimention=int(input())
t=int(input())
for i in range(t):
a=list(map(int,input().split()))
Instead of:
in2 = int(stdin.readline().rstrip('\s'))
you may try:
in2 = map( int, stdin.readline().split()[:2])
and you get
in2 = [640, 480]
You're calling readline. As the name implies, this reads in a whole line. (If you're not sure what you're getting, you should try printing it out.) So, you get something like this:
640 480 CROP IT
You can't call int on that.
What you want to do is split that line into separate pieces like this:
['640', '480', 'CROP IT']
For example:
line = stdin.readline().rstrip('\s')
in1, in2, rest = line.split(None, 2)
Now you can convert those first two into ints:
in1 = int(in1)
in2 = int(in2)

Python MemoryError: cannot allocate array memory

I've got a 250 MB CSV file I need to read with ~7000 rows and ~9000 columns. Each row represents an image, and each column is a pixel (greyscale value 0-255)
I started with a simple np.loadtxt("data/training_nohead.csv",delimiter=",") but this gave me a memory error. I thought this was strange since I'm running 64-bit Python with 8 gigs of memory installed and it died after using only around 512 MB.
I've since tried SEVERAL other tactics, including:
import fileinput and read one line at a time, appending them to an array
np.fromstring after reading in the entire file
np.genfromtext
Manual parsing of the file (since all data is integers, this was fairly easy to code)
Every method gave me the same result. MemoryError around 512 MB. Wondering if there was something special about 512MB, I created a simple test program which filled up memory until python crashed:
str = " " * 511000000 # Start at 511 MB
while 1:
str = str + " " * 1000 # Add 1 KB at a time
Doing this didn't crash until around 1 gig. I also, just for fun, tried: str = " " * 2048000000 (fill 2 gigs) - this ran without a hitch. Filled the RAM and never complained. So the issue isn't the total amount of RAM I can allocate, but seems to be how many TIMES I can allocate memory...
I google'd around fruitlessly until I found this post: Python out of memory on large CSV file (numpy)
I copied the code from the answer exactly:
def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
def iter_func():
with open(filename, 'r') as infile:
for _ in range(skiprows):
next(infile)
for line in infile:
line = line.rstrip().split(delimiter)
for item in line:
yield dtype(item)
iter_loadtxt.rowlength = len(line)
data = np.fromiter(iter_func(), dtype=dtype)
data = data.reshape((-1, iter_loadtxt.rowlength))
return data
Calling iter_loadtxt("data/training_nohead.csv") gave a slightly different error this time:
MemoryError: cannot allocate array memory
Googling this error I only found one, not so helpful, post: Memory error (MemoryError) when creating a boolean NumPy array (Python)
As I'm running Python 2.7, this was not my issue. Any help would be appreciated.
With some help from #J.F. Sebastian I developed the following answer:
train = np.empty([7049,9246])
row = 0
for line in open("data/training_nohead.csv")
train[row] = np.fromstring(line, sep=",")
row += 1
Of course this answer assumed prior knowledge of the number of rows and columns. Should you not have this information before-hand, the number of rows will always take a while to calculate as you have to read the entire file and count the \n characters. Something like this will suffice:
num_rows = 0
for line in open("data/training_nohead.csv")
num_rows += 1
For number of columns if every row has the same number of columns then you can just count the first row, otherwise you need to keep track of the maximum.
num_rows = 0
max_cols = 0
for line in open("data/training_nohead.csv")
num_rows += 1
tmp = line.split(",")
if len(tmp) > max_cols:
max_cols = len(tmp)
This solution works best for numerical data, as a string containing a comma could really complicate things.
This is an old discussion, but might help people in present.
I think I know why str = str + " " * 1000 fails fester than str = " " * 2048000000
When running the first one, I believe OS needs to allocate in memory the new object which is str + " " * 1000, and only after that it reference the name str to it. Before referencing the name 'str' to the new object, it cannot get rid of the first one.
This means the OS needs to allocate about the 'str' object twice in the same time, making it able to do it just for 1 gig, instead of 2 gigs.
I believe using the next code will get the same maximum memory out of your OS as in single allocation:
str = " " * 511000000
while(1):
l = len(str)
str = " "
str = " " * (len + 1000)
Feel free to roccet me if I am wrong

Padding Function (Python) string.zfill

I would like to change the below Python function to cover all situations in which my business_code will need padding. The string.zfill Python function handles this exception, padding to the left until a given width is reached but I have never used it before.
#function for formating business codes
def formatBusinessCodes(code):
""" Function that formats business codes. Pass in a business code which will convert to a string with 6 digits """
busCode=str(code)
if len(busCode)==1:
busCode='00000'+busCode
elif len(busCode)==2:
busCode='0000'+busCode
else:
if len(busCode)==3:
busCode='000'+busCode
return busCode
#pad extra zeros
df2['business_code']=df2['business_code'].apply(lambda x: formatBusinessCodes(x))
businessframe['business_code']=businessframe['business_code'].apply(lambda x: formatBusinessCodes(x))
financialframe['business_code']=financialframe['business_code'].apply(lambda x: formatBusinessCodes(x))
The code above handles a business_code of length 6 but I'm finding that the business_codes vary in length < and > 6. I'm validating data state by state. Each state varies in their business_code lengths (IL - 6 len, OH - 8 len). All codes must be padded evenly. So a code for IL that is 10 should produce 000010, etc. I need to handle all exceptions. Using a command line parsing parameter (argparse), and string.zfill.
You could use str.format:
def formatBusinessCodes(code):
""" Function that formats business codes. Pass in a business code which will convert to a string with 6 digits """
return '{:06d}'.format(code)
In [23]: formatBusinessCodes(1)
Out[25]: '000001'
In [26]: formatBusinessCodes(10)
Out[26]: '000010'
In [27]: formatBusinessCodes(123)
Out[27]: '000123'
The format {:06d} can be understood as follows:
{...} means replace the following with an argument from format,
(e.g. code).
: begins the format specification
0 enables zero-padding
6 is the width of the string. Note that numbers larger than 6
digits will NOT be truncated, however.
d means the argument (e.g. code) should be of integer type.
Note in Python2.6 the format string needs an extra 0:
def formatBusinessCodes(code):
""" Function that formats business codes. Pass in a business code which will convert to a string with 6 digits """
return '{0:06d}'.format(code)
parser.add_argument('-b',help='Specify length of the district code')
businessformat=args.d
businessformat=businessformat.strip()
df2['business_code']=df2['business_code'].apply(lambda x: str(x))
def formatBusinessCodes(code):
bus=code bus.zfill(4)
return bus
formatBusinessCodes(businessformat)

How should I use Numpy's vstack method?

Firstly, here is the relevant part of the code:
stokes_list = np.zeros(shape=(numrows,1024)) # 'numrows' defined earlier
for i in range(numrows):
epoch_name = y['filename'][i] # 'y' is an array from earlier
os.system('pdv -t {0} > temp.txt '.format(epoch_name)) # 'pdv' is a command from another piece of software - here I copy the output into a temporary file
stokes_line = np.genfromtxt('temp.txt', usecols=3, dtype=[('stokesI','float')], skip_header=1)
stokes_list = np.vstack((stokes_line,stokes_line))
So, basically, every time the code loops around, stokes_line pulls one of the columns (4th one) from the file temp.txt, and I want it to add a line to stokes_list each time.
For example, if the first stokes_line is
1.1 2.2 3.3
and the second is
4.4 5.5 6.6
then stokes_list will be
1.1 2.2 3.3
4.4 5.5 6.6
and will keep growing...
It's not working at the moment, because I think that the line:
stokes_list = np.vstack((stokes_line,stokes_line))
is not correct. It's only stacking 2 lists - which makes sense as I only have 2 arguments. I basically would like to know how I keep stacking again and again.
Any help would be very gratefully received!
If it is needed, here is an example of the format of the temp.txt file:
File: t091110_065921.SFTC Src: J1903+0925 Nsub: 1 Nch: 1 Npol: 4 Nbin: 1024 RMS: 0.00118753
0 0 0 0.00148099 -0.00143755 0.000931365 -0.00296775
0 0 1 0.000647476 -0.000896698 0.000171287 0.00218597
0 0 2 0.000704697 -0.00052846 -0.000603842 -0.000868739
0 0 3 0.000773361 -0.00234724 -0.0004112 0.00358033
0 0 4 0.00101559 -0.000691062 0.000196023 -0.000163109
0 0 5 -0.000220367 -0.000944024 0.000181002 -0.00268215
0 0 6 0.000311783 0.00191545 -0.00143816 -0.00213856
vstacking again and again is not good, because it copies the whole arrays.
Create a normal Python list, .append to it and then pass it whole to np.vstack to create a new array once.
stokes_list = []
for i in xrange(numrows):
...
stokes_line = ...
stokes_list.append(stokes_line)
big_stokes = np.vstack(stokes_list)
You already know the final size of the stokes_list array since you know numrows. So it seems you don't need to grow an array (which is very inefficient). You can simply assign the correct row at each iteration.
Simply replace your last line by :
stokes_list[i] = stokes_line
By the way, about your non-working line I think you meant :
stokes_list = np.vstack((stokes_list, stokes_line))
where you're replacing stokes_list by its new value.

Unable to have a command line parameter in Python

I run
import sys
print "x \tx^3\tx^3+x^3\t(x+1)^3\tcube+cube=cube+1"
for i in range(sys.argv[2]): // mistake here
cube=i*i*i
cube2=cube+cube
cube3=(i+1)*(i+1)*(i+1)
truth=(cube2==cube3)
print i, "\t", cube, "\t", cube + cube, "\t", cube3, "\t", truth
I get
Traceback (most recent call last):
File "cube.py", line 5, in <module>
for i in range(sys.argv[2]):
IndexError: list index out of range
How can you use command line parameter as follows in the code?
Example of the use
python cube.py 100
It should give
x x^3 x^3+x^3 (x+1)^3 cube+cube=cube+1
0 0 0 1 False
1 1 2 8 False
2 8 16 27 False
--- cut ---
97 912673 1825346 941192 False
98 941192 1882384 970299 False
99 970299 1940598 1000000 False
Use:
sys.argv[1]
also note that arguments are always strings, and range expects an integer.
So the correct code would be:
for i in range(int(sys.argv[1])):
You want int(sys.argv[1]) not 2.
Ideally you would check the length of sys.argv first and print a useful error message if the user doesn't provide the proper arguments.
Edit: See http://www.faqs.org/docs/diveintopython/kgp_commandline.html
Here are some tips on how you can often solve this type of problem yourself:
Read what the error message is telling you: "list index out of range".
What list? Two choices (1) the list returned by range (2) sys.argv
In this case, it can't be (1); it's impossible to get that error out of
for i in range(some_integer) ... but you may not know that, so in general, if there are multiple choices within a line for the source of an error, and you can't see which is the cause, split the line into two or more statements:
num_things = sys.argv[2]
for i in range(num_things):
and run the code again.
By now we know that sys.argv is the list. What index? Must be 2. How come that's out of range? Knowledge-based answer: Because Python counts list indexes from 0. Experiment-based answer: Insert this line before the failing line:
print list(enumerate(sys.argv))
So you need to change the [2] to [1]. Then you will get another error, because in range(n) the n must be an integer, not a string ... and you can work through this new problem in a similar fashion -- extra tip: look up range() in the docs.
I'd like to suggest having a look at Python's argparse module, which is a giant improvement in parsing commandline parameters - it can also do the conversion to int for you including type-checking and error-reporting / generation of help messages.
Its sys.argv[1] instead of 2. You also want to makes sure that you convert that to an integer if you're doing math with it.
so instead of
for i in range(sys.argv[2]):
you want
for i in range(int(sys.argv[1])):

Categories