I am running into a weird inconsistency. So I had to learn the difference between immutable and mutable data types. For my purpose, I need to convert my pandas DataFrame into Numpy apply operations and convert it back, as I do not wish to alter my input.
so I am converting like follows:
mix=pd.DataFrame(array,columns=columns)
def mix_to_pmix(mix,p_tank):
previous=0
columns,mix_in=np.array(mix) #<---
mix_in*=p_tank
previous=0
for count,val in enumerate(mix_in):
mix_in[count]=val+previous
previous+=val
return pd.DataFrame(mix_in,columns=columns)
This works perfectly fine, but the function:
columns,mix_in=np.array(mix)
seems to not be consistent as in the case:
def to_molfrac(mix):
columns,mix_in=np.array(mix)
shape=mix_in.shape
for i in range(shape[0]):
mix_in[i,:]*=1/max(mix_in[i,:])
for k in range(shape[1]-1,0,-1):
mix_in[:,k]+=-mix_in[:,k-1]
mix_in=mix_in/mix_in.sum(axis=1)[:,np.newaxis]
return pd.DataFrame(mix_in,columns=columns)
I receive the error:
ValueError: too many values to unpack (expected 2)
The input of the latter function is the output of the previous function. So it should be the same case.
It's impossible to understand the input of to_molfrac and mix_to_pmix without an example.
But the pandas objects has a .value attribute which allows you to access the underlying numpy array. So, its probably better to use mix_in = mix.values instead.
columns, values = df.columns, df.values
Related
My code:
import pandas as pd
datos=pd.read_csv('/Users/rafaelsuarez/Documents/Data/UCELL.csv', sep=',' , encoding='latin-1')
df=pd.DataFrame(datos)
df['RNC']=df['RNC'].map('RNC_{}'.format)
h=hex(df['LAC'])
The data is:
MCC,MNC,LAC,CELLID,CELLNAME,RNC,NODEBNAME,AZIMUTH_ANTENNA,LON,LAT
730,09,119,20011,AIS_3G_001_1_B1,PTM01,AIS_3G_001,20,-72.6906,-45.4044
730,09,119,20014,AIS_3G_001_1_B2,PTM01,AIS_3G_001,20,-72.6906,-45.4044
I need to convert 'LAC' in hexa.
The error arises as the function hex() operates on integers, whereas you are applying it to a pandas Series object.
Compare the behaviour of math.sin(df['LAC']) which throws an error similar to yours as math.sin() operates on single numbers (in this case floating point numbers), and np.sin(df['LAC']), which operates on a numpy array (or similar object like a pandas Series).
One way to achieve what you want is to apply hex() to each element of the Series using a list comprehension:
h = [hex(x) for x in df['LAC']]
I have a large DataFrame object (1,440,000,000 rows). I operate at memory (swap includet) limit.
I need to extract a subset of the rows with certain value of a field. However if i do like that:
>>> SUBSET = DATA[DATA.field == value]
I end with either MemoryError exception or crash.
Is there any way to filter rows explicitely - without calculating intermediate mask (DATA.field == value)?
I have found DataFrame.filter() and DataFrame.select() methods, but they operate on column labels/row indices rather than on the row data.
Use query, it should be a bit faster:
df = df.query("field == value")
If by any change all the data in the DataFrame are of same types, use numpy array instead, it's more memory efficient and faster. You can convert your dataframe to numpy matrix by df.as_matrix().
Also that you might wanna check how much memory the dataframe already takes by:
import sys
sys.getsizeof()
that returns the size in bytes.
I have a function that takes a row of the daraframe (pd.Series) and returns one list. The idea is to apply it to dataframe and generate a new pd.Series of lists, one per each row:
sale_candidats = closings.apply(get_candidates_3, axis=1,
sales=sales_ts,
settings=settings,
reduce=True)
However, it seems that pandas try to map the list it returns (for the first row, probably) to original row, and raises an error (even despite reduce=True):
ValueError: Shape of passed values is (10, 8), indices imply (10, 23)
When I convert function to return set instead of the list, the whole thing starts working - except returning a data frame with the same shape and index/columns name as an original data frame, except that every cell is filled with corresponding row's set().
Looks a lot like a bug to me... how can I return one pd.Series instead?
Seems that this behaviour is, indeed, a bug in the latest version of pandas. take a look at the issue:
https://github.com/pandas-dev/pandas/pull/18577
You could just apply the function in a for loop, because that's all that apply does. You wouldn't notice a large speed penalty.
My current code is
from numpy import *
def buildRealDataObject(x):
loc = array(x[0])
trueClass = x[1]
evid = ones(len(loc))
evid[isnan(loc)] = 0
loc[isnan(loc)] = 0
return DataObject(location=loc, trueClass=trueClass, evidence=evid)
if trueClasses is None:
trueClasses = zeros(len(dataset), dtype=int8).tolist()
realObjects = list(map(lambda x: buildRealDataObject(x), zip(dataset, trueClasses)))
and it is working. What I expect is to create for each row of the DataFrame dataset each combined with the corresponding entry of trueClasses a realObject. I am not really sure though why it is working because if run list(zip(dataset, trueClasses)) I just get something like [(0, 0.0), (1, 0.0)]. The two columns of dataset are called 0 and 1. So my first question is: Why is this working and what is happening here?
However I think this might still be wrong on some level, because it might only work due to "clever implicit transformation" on side of pandas. Also, for the line evid[isnan(loc)] = 0 I now got the error
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
How should I rewrite this code instead?
Currently the zip works on columns instead of rows. Use one of the method from Pandas convert dataframe to array of tuples to make the zip work on rows instead of columns. For example substitute
zip(dataset, trueClasses)
with
zip(dataset.values, trueClasses)
Considiering this post, if you have already l = list(data_train.values) for some reason, then zip(l, eClass) is faster than zip(dataset.values, trueClasses). However, if you don't then the transformation takes too much time to make it worth it in my tests.
I am reading in census data using the matplotlib cvs2rec function - works fine gives me a nice ndarray.
But there are several columns where all the values are '"none"" with dtype |04. This is cuasing problems when I lode into Atpy "TypeError: object of NoneType has no len()". Something like '9999' or other missing would work for me. Mask is not going to work in this case because I am passing the real array to ATPY and it will not convert MASK. The Put function in numpy will not work with none values wich is the best way to change values(I think). I think some sort of boolean array is the way to go but I can't get it to work.
So what is a good/fast way to change none values and/or uninitialized numpy array to something like '9999'or other recode. No Masking.
Thanks,
Matthew
Here is a solution to this problem, although if your data is a record array you should only apply this operation to your column, rather than the whole array:
import numpy as np
# initialise some data with None in it
a = np.array([1, 2, 3, None])
a = np.where(a == np.array(None), 9999, a)
Note that you need to cast None into a numpy array for this to work
you can use mask array when you do calculation. and when pass the array to ATPY, you can call filled(9999) method of the mask array to convert the mask array to normal array with invalid values replaced by 9999.