how can a specific cell be accessed in a vaex data frame? - python

vaex is a library similar to pandas, that provides a dataframe class
I'm looking for a way to access a specific cell by row and column
for example:
import vaex
df = vaex.from_dict({'a': [1,2,3], 'b': [4,5,6]})
df.a[0] # this works in pandas but not in vaex

In this specific case you could do df.a.values[0], but if this was a virtual column, it would lead to the whole column being evaluated. What would be faster to do (say in a case of > 1 billon rows, and a virtual column), is to do:
df['r'] = df.a + df.b
df.evaluate('r', i1=2, i2=3)[0]
This will evaluate the virtual column/expression r, from row 2 to 3 (an array of length 1), and get the first element.
This is rather clunky, and there is an issue open on this: https://github.com/vaexio/vaex/issues/238
Maybe you are surprised that vaex does not have something as 'basic' as this, but vaex is often used for really large datasets, where you don't access individual rows that often, so we don't run into this a lot.

#Maarten Breddels is the owner of Vaex, so I would take his word. But it's possible he wrote that answer before Vaex added slicing, which in this case would be much less "clunky" as described.
import vaex
df = vaex.example()
df.x[:1].values # Access row 0
df.x[1:3].values # Access rows 1 and 2

Related

Filter nan values out of rows in pandas

I am working on a calculator to determine what to feed your fish as a fun project to learn python, pandas, and numpy.
My data is organized like this:
As you can see, my fishes are rows, and the different foods are columns.
What I am hoping to do, is have the user (me) input a food, and have the program output to me all those values which are not nan.
The reason why I would prefer to leave them as nan rather than 0, is that I use different numbers in different spots to indicate preference. 1 is natural diet, 2 is ok but not ideal, 3 is live only.
Is there anyway to do this using pandas? Everywhere I look online helps me filter rows out of columns, but it is quite difficult to find info on filter columns out of rows.
Currently, my code looks like this:
import pandas as pd
import numpy as np
df = pd.read_excel(r'C:\Users\Daniel\OneDrive\Documents\AquariumAiMVP.xlsx')
clownfish = df[0:1]
angelfish = df[1:2]
damselfish = df[2:3]
So, as you can see, I haven't really gotten anywhere yet. I tried filtering out the nulls using, the following idea:
clownfish_wild_diet = pd.isnull(df.clownfish)
But it results in an error, saying:
AttributeError: 'DataFrame' object has no attribute 'clownfish'
Thanks for the help guys. I'm a total pandas noob so it is much appreciated.
You can use masks in pandas:
food = 'Amphipods'
mask = df[food].notnull()
result_set = df[mask]
df[food].notnull() returns a mask (a Series of boolean values indicating if the condition is met for each row), and you can use that mask to filter the real DF using df[mask].
Usually you can combine these two rows to have a more pythonic code, but that's up to you:
result_set = df[df[food].notnull()]
This returns a new DF with the subset of rows that meet the condition (including all columns from the original DF), so you can use other operations on this new DF (e.g selecting a subset of columns, drop other missing values, etc)
See more about .notnull(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html

How can I add an incomplete row in a Panda DataFrame?

I'm trying to copy a row from a DataFrame to another. The issue comes from that the origin has not as many columns as the destination, leading to a situation looking like :
origin = pd.DataFrame([[1,2],
[3,4]],columns=['A','B'])
destination = pd.DataFrame(columns=['A', 'B', 'C'])
copy = origin[0:1].to_dict()
destination.loc[0] = copy
I'm getting a 'ValueError: cannot set a row with mismatched columns'
I tested with two identical df, and it worked fine. What would be the best way to do what I'm trying? I was thinking of dynamically add NaNs for the additional destination columns, but it doesn't seem very pythonic.
Please note that I'm trying to avoid any append(), as I will perform the task frequently, and I read in Pandas doc that it would probably give perfomance issues.
Thanks for your help!
Insert Series
destination.loc[0]=pd.DataFrame(copy).iloc[0]
destination
Out[672]:
A B C
0 1.0 2.0 NaN

Multiplying matrices of the same shape

I have two dataframes with the exact same index titles and the exact same column titles but different values within those tables. Also the number of rows and columns is exactly the same. Let's call them df1, and df2.
df1 = {'A':['a1','a2','a3','a4'],'B':['b1','b2','b3','b4'],'C':['c1','c2','c3','c4']}
df2 = {'A':['d1','d2','d3','d4'],'B':['e1','e2','e3','e4'],'C':['f1','f2','f3','f4']}
I want to perform several operations on these matrices i.e.
Multiplication - create the following matrix:
df2 = {'A':['a1*d1','a2*d2','a3*d3','a4*d4'],'B':['b1*e1','b2*e2','b3*e3','b4*e4'],'C':['c1*f1','c2*f2','c3*f3','c4*f4']}
as well as Addition, Substraction, Division using the exact same logic.
Please note that the question is more about the generic code which can be replicated since the matrix which I am using has hundreds of rows and columns.
This is pretty trivial to achieve using the pandas library. The data type of the columns is unclear from OPs question, but if they are numeric, then then the code below will run.
Try:
import pandas as pd
pd.DataFrame(df1) * pd.DataFrame(df2)
If you don't want to import panda just for this operation you can do this with the following code:
df1_2 = {key: [x*y for x,y in zip(df1[key],df2[key])] for key in df1.keys()}
NOTE : This works only if the values are numerics. If not use concatenation for strings like x'*'y, replace + with your desired operation.

How to add values to a new column in pandas dataframe?

I want to create a new named column in a Pandas dataframe, insert first value into it, and then add another values to the same column:
Something like:
import pandas
df = pandas.DataFrame()
df['New column'].append('a')
df['New column'].append('b')
df['New column'].append('c')
etc.
How do I do that?
If I understand, correctly you want to append a value to an existing column in a pandas data frame. The thing is with DFs you need to maintain a matrix-like shape so the number of rows is equal for each column what you can do is add a column with a default value and then update this value with
for index, row in df.iterrows():
df.at[index, 'new_column'] = new_value
Dont do it, because it's slow:
updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.
Better to create a list of data and create DataFrame by contructor:
vals = ['a','b','c']
df = pandas.DataFrame({'New column':vals})
If in case you need to add random values to the newly created column, you could also use
df['new_column']= np.random.randint(1, 9, len(df))

MultiIndexing rows vs. columns in pandas DataFrame

I am working with multiindexing dataframe in pandas and am wondering whether I should multiindex the rows or the columns.
My data looks something like this:
Code:
import numpy as np
import pandas as pd
arrays = pd.tools.util.cartesian_product([['condition1', 'condition2'],
['patient1', 'patient2'],
['measure1', 'measure2', 'measure3']])
colidxs = pd.MultiIndex.from_arrays(arrays,
names=['condition', 'patient', 'measure'])
rowidxs = pd.Index([0,1,2,3], name='time')
data = pd.DataFrame(np.random.randn(len(rowidxs), len(colidxs)),
index=rowidxs, columns=colidxs)
Here I choose to multiindex the column, with the rationale that pandas dataframe consists of series, and my data ultimately is a bunch of time series (hence row-indexed by time here).
I have this question because it seems there is some asymmetry between rows and columns for multiindexing. For example, in this document webpage it shows how query works for row-multiindexed dataframe, but if the dataframe is column-multiindexed then the command in the document has to be replaced by something like df.T.query('color == "red"').T.
My question might seem a bit silly, but I'd like to see if there is any difference in convenience between multiindexing rows vs. columns for dataframes (such as the query case above).
Thanks.
A rough personal summary of what I call the row/column-propensity of some common operations for DataFrame:
[]: column-first
get: column-only
attribute accessing as indexing: column-only
query: row-only
loc, iloc, ix: row-first
xs: row-first
sortlevel: row-first
groupby: row-first
"row-first" means the operation expects row index as the first argument, and to operate on column index one needs to use [:, ] or specify axis=1;
"row-only" means the operation only works for row index and one has to do something like transposing the dataframe to operate on the column index.
Based on this, it seems multiindexing rows is slightly more convenient.
A natural question of mine: why don't pandas developers unify the row/column propensity of DataFrame operations? For example, that [] and loc/iloc/ix are two most common ways of indexing dataframes but one slices columns and the others slice rows seems a bit odd.

Categories