I have a large CSV that is a result of a python script I wrote. Each row contains a list of entries, that when I wrote were strings or ints. Please note that the files from my script are sometimes created on either linux or windows platform (which might be the problem, hence the mention. I'm new at multi-platform python, so please forgive me).
Now, I'm trying to read the .csv in but some of the ints come in as long objects, according to type(whatiwant). I've tried everything I and my google fu can think of to convert these objects to int (int(), str(). replace for " ", "L", and "/r", "/n"). Nevertheless, when I test the list via for loop and type(), output says some things are still long objects.
What am I missing here? I tried looking for background info on long objects but couldn't find anything useful, hence the post.
I'm new at all this, so again, please forgive my ignorance.
When it rains, it pours. Sorry for screwing up the edit rofl:
I'm reading in the values like this (which are writte in rows, as a list containing values that are ints and strings):
Input = [["header"|"subheader"], [15662466|2831811638],
[5662466|27044023]...]
data = []
people_list = []
for entry in input:
data.append(entry)
for row in data:
holder = row.split("|")
person = str(holder([1])
people_list.append(person.replace.("\r", "").replace("\n","").replace("L", "")
people_list.pop(0)
for person in people_list:
strperson = str(person)
intperson = int(strperson)
print intperson
print type(intperson)
output:
2831811683
<type 'long'>
27044023
<type 'int'>
They are being treated as longs. Python as two number types: ints, which have a maximum and minimum value, and longs, which are unbounded. It's not really a problem if the numerical data is a long instead of an int.
A long is a datatype that is just longer than an int.
More formally long is the datatype used when integers would have caused an integer overflow, so anything more than sys.maxint automatically converts your int to a long.
Docs: https://docs.python.org/2/library/stdtypes.html#typesnumeric
Note that in Python 3 there is no significance between the two, as Python3 unifies the two types.
Related
I'm trying to get some data stored in a .bson file into a jupyter notebook.
Per this answer and this answer, the accepted answer is basically to use the bson module from PyMongo, and then the following code
FILE = "file.bson"
with open(FILE, 'rb') as f:
data = bson.decode_all(f.read())
Now, data is a list of length 1.
data[0] is a dictionary.
The first key in this dictionary is a
data[0]["a"] is a dictionary with keys tag and data, and
data[0]["a"]["data"] is exactly what is should be, a list of integers that I can work with in python.
On the other hand, the second key in this dictionary is b
but now data[0]["b"] is a dictionary with keys tag, type, size, and data
and
data[0]["b"]["data"] is type bytes, and I'm not sure how to work with it.
I have never worked with bson before, so any input is appreciated. However, some of my questions are
Does anyone have a good ref on how to work with bson in python?
Does anyone know why a gets read in a readable way (not bytes), but b gets read in with more keys, but not readable (bytes as opposed to integers)
I was really hoping read_all would take care of everything; does anyone know why it doesn't / what I should do differently? I've tried applying read_all again to the stuff still in bytes, but I get the error message InvalidBSON: invalid message size
Does anyone have a solution for my goal, of getting the information from data[0]["b"]["data"] in a usable format (i.e. a list of integers)?
I am trying to figure out a way to write 21*(10^21) in Python. I've tried two approaches. The first one is simply by doing the following:
>>> print(21*(10**21))
21000000000000000000000
This works just fine. However, the problem I'm trying to solve requires me to get to this number by iteration, i.e, by building up from 1*(10^1) all the way to 21*(10^21). So I tried the following:
>>> temp = 20*(10**20)
>>> print(21*temp*10/20)
2.1e+22
Now, I want the entire number to show up, and not in the 'e' form, so I converted it to int. But this prints the wrong answer:
>>> print(int(21*temp*10/20))
20999999999999997902848
I know that integers don't have a limit in Python 3 (which is what I'm using) so this baffles me. I thought this may be because the /20 part causes conversion to float, but the number 21*(10^21) falls within the limits of float, so converting back to int shouldn't be a problem.
I've tried searching for this error online with no luck. Any help would be appreciated.
I have read in a csv file ('Air.csv') and have performed some operations to get rid of the header (not important). Then, I used dB_a.append(row[1]) to put this column of the csv data into an array which I could later plot.
This data is dB data, and I want to convert this to power using the simple equation P = 10^(dB/10) for every value. I am new to Python, so I don't quite understand how operations within arrays, lists, etc. works. I think there is something I need to do to iterate over that full data set, which was my attempt at a for loop, but I am still receiving errors. Any suggestions?
Thank you!
frequency_a=[]
dB_a=[]
a = csv.reader(open('Air.csv'))
for row in itertools.islice(a, 18, 219):
frequency_a.append(row[0])
dB_a.append(row[1])
#print(frequency_a)
print(dB_a)
for item in dB_a:
power_a = 10**(dB_a/10)
print(power_a)
In your for loop, item is the iterator, so you need to use that. So instead of:
power_a = 10**(dB_a/10)
use:
power_a = 10**(item/10)
A nicer way to create a new list with that data could be:
power_a = [10**(db/10) for db in dB_a]
EDIT: The other issue as pointed out in the comment, is that the values are strings. The .csv file is essentially a text file, so a collection of string, rather than integers. What you can do is convert them to numeric values using int(db) or float(db), depending whether you have whole or floating point numbers.
EDIT2: As pointed out by #J. Meijers, I was using multiplication instead of exponentiation - this has been fixed in the answer.
To build on the answer #ed Jaras posted.
power_a = [10*(db/10) for db in dB_a]
is not correct, since this divides by 10, and then multiplies by the same.
It should be:
power_a = [10**(db/10) for db in dB_a]
Credits still go to #Ed Jaras though
Note:
If you're wondering what this [something for something in a list] is, it is a list comprehension. They are amazingly elegant constructs that python allows.
What is basically means is [..Add this element to the result.. for ..my element.. in ..a list..].
You can even add conditionals to them if you want.
If you want to read more about them, I suggest checking out:
http://www.secnetix.de/olli/Python/list_comprehensions.hawk
Addition:
#k-schneider: You are probably doing numerical operations (dividing, power, etc. ) on a string, this is because when importing a csv, it is possible for fields to be imported as a string.
To make sure that you are working with integers, you can cast db to a string by doing:
str(db)
I want the following records (currently displaying as 3.200000e+18 but actually (hopefully) each a different long integer), created using pd.read_excel(), to be interpreted differently:
ipdb> self.after['class_parent_ref']
class_id
3200000000000515954 3.200000e+18
3200000000000515951 NaN
3200000000000515952 NaN
3200000000000515953 NaN
3200000000000515955 3.200000e+18
3200000000000515956 3.200000e+18
Name: class_parent_ref, dtype: float64
Currently, they seem to 'come out' as scientifically notated strings:
ipdb> self.after['class_parent_ref'].iloc[0]
3.2000000000005161e+18
Worse, though, it's not clear to me that the number has been read correctly from my .xlsx file:
ipdb> self.after['class_parent_ref'].iloc[0] -3.2e+18
516096.0
The number in Excel (the data source) is 3200000000000515952.
This is not about the display, which I know I can change here. It's about keeping the underlying data in the same form it was in when read (so that if/when I write it back to Excel, it'll look the same and so that if I use the data, it'll look like it did in Excel and not Xe+Y). I would definitely accept a string if I could count on it being a string representation of the correct number.
You may notice that the number I want to see is in fact (incidentally) one of the labels. Pandas correctly read those in as strings (perhaps because Excel treated them as strings?) unlike this number which I entered. (Actually though, even when I enter ="3200000000000515952" into the cell in question before redoing the read, I get the same result described above.)
How can I get 3200000000000515952 out of the dataframe? I'm wondering if pandas has a limitation with long integers, but the only thing I've found on it is 1) a little dated, and 2) doesn't look like the same thing I'm facing.
Thank you!
Convert your column values with NaN into 0 then typcast that column as integer to do so.
df[['class_parent_ref']] = df[['class_parent_ref']].fillna(value = 0)
df['class_parent_ref'] = df['class_parent_ref'].astype(int)
Or in reading your file, specify keep_default_na = False for pd.read_excel() and na_filter = False for pd.read_csv()
Forgive me for the poor title, I really can't come up with a proper title.
Here is my problem. Say I was given a list of strings:
['2010.01.01',
'1b',
'`abc'
'12:20:33.000']
And I want to do a 'type check' so that given the first string it returns type date, second one boolean, third one a symbol, forth one a time... etc. The returned value can be a string or anything since all I want to do is to cast the correct ctypes.
Is there any way to do it?
ps: my python is 2.5
>>> str = ['2010.01.01',
... '1b',
... '`abc'
... '12:20:33.000']
>>> [type(x) for x in str]
[<type 'str'>, <type 'str'>, <type 'str'>]
Suppose that you use an eval for determine this list.
If you are completely certain you can trust the content -- that it's not, say, coming from a user who could sneak code into the list somehow -- you could map the list onto eval, which will catch native types like numbers. However there is no simple way to know what those strings should all mean -- for example, if you try to evel '2010.01.01' python will think you're trying to parse a number and then fail because of the extra decimals.
So you could try a two stage strategy: first cast the list to strings vs numbers using eval:
def try_cast (input_string):
try:
val = eval(input_string)
val_type = type(val)
return val, val_type
except:
return input_string, type('')
cast_list = map (try_cast, original_list)
that would give a list of tuples where the second item is the type and the first is the converted item. For more specialized things like dates you'd need to use the same strategy on the strings left over after the first pass, using a try/except block to attempt converting them to dates using time.strptime(). You'll need to figure out what time formats to expect and generate a parse expression for each one (you can check the python docs or something like http://www.tutorialspoint.com/python/time_strptime.htm) You'd have to try all the options and see which ones correctly converted -- if one worked, the value is a date; if not, its just a string.