Currently, when using the lookup method on a df one obtains zip-like selection, e.g.
df.lookup(["one","two"],["a","b"])
will select two values: one with rowlabel "one" and collabel "a" and another with "two" and "b".
Now, when using the method a warning appears that the method will not be available in future versions and that one should use the "loc" method.
I really don't know how to obtain the same "zip-like" behavior with loc. Can anyone explain/help?
Consider the following alternatives (equally in result) to deprecated df.lookup:
[df.loc[p] for p in zip(["one","two"], ["a","b"])]
list(map(df.at.__getitem__, zip(["one","two"], ["a","b"])))
Related
Pandas isin method (from Dataframe and Series) requires a set or a "list like" container as an argument.
I need to test elements of a Series for appartenance to a set of words, a small minority of which are not entirely defined : a star replaces one or more letter at a given position in the word. An example set would be : { 'abcd', 'dbca', 'd1*' }
I figured that list-like meant that the container implements the __contains__ method, so I created a custom container containing a set for entirely defined words (that's to benefit from the speed of look-ups) and a list for the rest of the words with stars that are tested separately. That logic is handled by a custom __contains__ method.
But my custom_set is rejected because it is not deemed "list-like".
Any idea what pandas would consider as list-like ?
Or of an alternative approach that also relies on built-in pandas functions (my understanding is that Dataframe's methods are more performant than applying custom functions
In PySpark one can use column objects and strings to select columns. Both ways return the same result. Is there any difference? When should I use column objects instead of strings?
For example, I can use a column object:
import pyspark.sql.functions as F
df.select(F.lower(F.col('col_name')))
# or
df.select(F.lower(df['col_name']))
# or
df.select(F.lower(df.col_name))
Or I can use a string instead and get the same result:
df.select(F.lower('col_name'))
What are the advantages of using column objects instead of strings in PySpark
Read this PySpark style guide from Palantir here which explains when to use F.col() and not and best practices.
Git Link here
In many situations the first style can be simpler, shorter and visually less polluted. However, we have found that it faces a number of limitations, that lead us to prefer the second style:
If the dataframe variable name is large, expressions involving it quickly become unwieldy;
If the column name has a space or other unsupported character, the bracket operator must be used instead. This generates inconsistency, and df1['colA'] is just as difficult to write as F.col('colA');
Column expressions involving the dataframe aren't reusable and can't be used for defining abstract functions;
Renaming a dataframe variable can be error-prone, as all column references must be updated in tandem.
Additionally, the dot syntax encourages use of short and non-descriptive variable names for the dataframes, which we have found to be harmful for maintainability. Remember that dataframes are containers for data, and descriptive names is a helpful way to quickly set expectations about what's contained within.
By contrast, F.col('colA') will always reference a column designated colA in the dataframe being operated on, named df, in this case. It does not require keeping track of other dataframes' states at all, so the code becomes more local and less susceptible to "spooky interaction at a distance," which is often challenging to debug.
It depends on how the functions are implemented in Scala.
In scala, the signature of the function is part of the function itself. For example, func(foo: str) and func(bar: int) are two different functions and Scala can make the difference whether you call one or the other depending on the type of argument you use.
F.col('col_name')), df['col_name'] and df.col_name are the same type of object, a column. It is almost the same to use one syntax or another. A little difference is that you could write for example :
df_2.select(F.lower(df.col_name)) # Where the column is from another dataframe
# Spoiler alert : It may raise an error !!
When you call df.select(F.lower('col_name')), if the function lower(smth: str) is not defined in Scala, then you will have an error. Some functions are defined with str as input, others take only columns object. Try it to know if it works and then uses it. otherwise, you can make a pull request on the spark project to add the new signature.
I created 2 named selections
df.select(df.x => 2,name='bigger')
df.select(df.x < 2,name='smaller')
and it's cool, I can use the selection parameter so many (ie statistical) functions offer, for example
df.count('*',selection='bigger')
but is there also a way to use the named selection in a filter? Something like
df['bigger']
Well that syntax df['bigger'] is accessing a column (or expression) in vaex that is called 'bigger'.
However, you can do: df.filter('bigger') and will give you a filtered dataframe.
Note that, while similar in some ways, filters and selections are a bit different, and each has its own place when using vaex.
I found out that I have problem understanding when I should be accessing data from dataframe(df) using df[data] or df.data .
I mostly use the [] method to create new columns, but I can also access data using both df[] and df.data, but what's the difference and how can I better grasp those two ways of selecting data? When one should be used over the other one?
If I understand the Docs correctly, they are pretty much equivalent, except in these cases:
You can use [the .] access only if the index element is a valid python identifier, e.g. s.1 is not allowed.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
However, while
indexing operators [] and attribute operator . provide quick and easy
access to pandas data structures across a wide range of use cases [...]
in production you should really use the optimized panda data access methods such as .loc, .iloc, and .ix, because
[...] since the type of the data to be accessed isn’t known
in advance, directly using standard operators has some optimization
limits. For production code, we recommended that you take advantage of
the optimized pandas data access methods.
Using [] will use the value of the index.
a = "hello"
df[a] # It will give you content at hello
Using .
df.a # content at a
The difference is that with the first one you can use a variable.
In [60]: print(row.index)
Int64Index([15], dtype='int64')
I already know that the row number is 15, in this example. But I don't need to know it's type or anything else. The variable which stores it would do something based on what that number is.
There was something else about pandas I was wondering about. Suppose that this only returns one row:
row = df[df['username'] == "unique name"]
Is it any less proper to use methods like loc, iloc, etc? They still work and everything, but I was curious if it would still be done this way in larger projects.. Are there preferred methods if it's just one row, as opposed to a list of rows.
If you write
row.index[0]
then you will get, in your case, the integer 15.
Note that if you check
dir(pd.Int64Index)
You can see that it includes the list-like method __getitem__.