I found out that I have problem understanding when I should be accessing data from dataframe(df) using df[data] or df.data .
I mostly use the [] method to create new columns, but I can also access data using both df[] and df.data, but what's the difference and how can I better grasp those two ways of selecting data? When one should be used over the other one?
If I understand the Docs correctly, they are pretty much equivalent, except in these cases:
You can use [the .] access only if the index element is a valid python identifier, e.g. s.1 is not allowed.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
However, while
indexing operators [] and attribute operator . provide quick and easy
access to pandas data structures across a wide range of use cases [...]
in production you should really use the optimized panda data access methods such as .loc, .iloc, and .ix, because
[...] since the type of the data to be accessed isn’t known
in advance, directly using standard operators has some optimization
limits. For production code, we recommended that you take advantage of
the optimized pandas data access methods.
Using [] will use the value of the index.
a = "hello"
df[a] # It will give you content at hello
Using .
df.a # content at a
The difference is that with the first one you can use a variable.
Related
Pandas isin method (from Dataframe and Series) requires a set or a "list like" container as an argument.
I need to test elements of a Series for appartenance to a set of words, a small minority of which are not entirely defined : a star replaces one or more letter at a given position in the word. An example set would be : { 'abcd', 'dbca', 'd1*' }
I figured that list-like meant that the container implements the __contains__ method, so I created a custom container containing a set for entirely defined words (that's to benefit from the speed of look-ups) and a list for the rest of the words with stars that are tested separately. That logic is handled by a custom __contains__ method.
But my custom_set is rejected because it is not deemed "list-like".
Any idea what pandas would consider as list-like ?
Or of an alternative approach that also relies on built-in pandas functions (my understanding is that Dataframe's methods are more performant than applying custom functions
In PySpark one can use column objects and strings to select columns. Both ways return the same result. Is there any difference? When should I use column objects instead of strings?
For example, I can use a column object:
import pyspark.sql.functions as F
df.select(F.lower(F.col('col_name')))
# or
df.select(F.lower(df['col_name']))
# or
df.select(F.lower(df.col_name))
Or I can use a string instead and get the same result:
df.select(F.lower('col_name'))
What are the advantages of using column objects instead of strings in PySpark
Read this PySpark style guide from Palantir here which explains when to use F.col() and not and best practices.
Git Link here
In many situations the first style can be simpler, shorter and visually less polluted. However, we have found that it faces a number of limitations, that lead us to prefer the second style:
If the dataframe variable name is large, expressions involving it quickly become unwieldy;
If the column name has a space or other unsupported character, the bracket operator must be used instead. This generates inconsistency, and df1['colA'] is just as difficult to write as F.col('colA');
Column expressions involving the dataframe aren't reusable and can't be used for defining abstract functions;
Renaming a dataframe variable can be error-prone, as all column references must be updated in tandem.
Additionally, the dot syntax encourages use of short and non-descriptive variable names for the dataframes, which we have found to be harmful for maintainability. Remember that dataframes are containers for data, and descriptive names is a helpful way to quickly set expectations about what's contained within.
By contrast, F.col('colA') will always reference a column designated colA in the dataframe being operated on, named df, in this case. It does not require keeping track of other dataframes' states at all, so the code becomes more local and less susceptible to "spooky interaction at a distance," which is often challenging to debug.
It depends on how the functions are implemented in Scala.
In scala, the signature of the function is part of the function itself. For example, func(foo: str) and func(bar: int) are two different functions and Scala can make the difference whether you call one or the other depending on the type of argument you use.
F.col('col_name')), df['col_name'] and df.col_name are the same type of object, a column. It is almost the same to use one syntax or another. A little difference is that you could write for example :
df_2.select(F.lower(df.col_name)) # Where the column is from another dataframe
# Spoiler alert : It may raise an error !!
When you call df.select(F.lower('col_name')), if the function lower(smth: str) is not defined in Scala, then you will have an error. Some functions are defined with str as input, others take only columns object. Try it to know if it works and then uses it. otherwise, you can make a pull request on the spark project to add the new signature.
I created 2 named selections
df.select(df.x => 2,name='bigger')
df.select(df.x < 2,name='smaller')
and it's cool, I can use the selection parameter so many (ie statistical) functions offer, for example
df.count('*',selection='bigger')
but is there also a way to use the named selection in a filter? Something like
df['bigger']
Well that syntax df['bigger'] is accessing a column (or expression) in vaex that is called 'bigger'.
However, you can do: df.filter('bigger') and will give you a filtered dataframe.
Note that, while similar in some ways, filters and selections are a bit different, and each has its own place when using vaex.
If I have a "reference" to a dataframe, there appears to be no way to append to it in pandas because neither append nor concat support the inplace=True parameter.
An (overly) simple example:
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df = chosen_df.append(chosen_row)
Now because Python does something akin to copy reference by value, chosen_df will initially be a reference to whichever candidate dataframe passed some_test.
But the update semantics of pandas mean that the referenced dataframe is not updated by the result of the append function; a new label is created instead. I believe, if there was the possibility to use inplace=True this would work, but it looks like that isn't likely to happen, given discussion here https://github.com/pandas-dev/pandas/issues/14796
It's worth noting that with a simpler example using lists rather than dataframes does work, because the contents of lists are directly mutated by append().
So my question is --- How could an updatable abstraction over N dataframes be achieved in Python?
The idiom is commonplace, useful and trivial in languages that allow references, so I'm guessing I'm missing a Pythonic trick, or thinking about the whole problem with the wrong hat on!
Obviously the pure illustrative example can be resolved by duplicating the append in the body of an if...else and concretely referencing each underlying dataframe in turn. But this isn't scalable to more complex examples and it's a generic solution akin to references I'm looking for.
Any ideas?
There is a simple way to do this specifically for pandas dataframes - so I'll answer my own question.
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df.loc[max_idx+1] = chosen_row
The calculation of max_idx very much depends on the structure of chosen_df. In the simplest case when it is a dataframe with a sequential index starting at 0, then you can simply use the length of the index to calculate it.
If chosen_df is non-sequential you'll need call max() on the index column rather than rely on the length of the index.
If chosen_df is a slice or groupby object then you'll need to calculate the index off the max parent dataframe to ensure it's truly the max across all rows.
The function names() in R gets or sets the names of an object. What is the Python equivalent to this function, including import?
Usage:
names(x)
names(x) <- value
Arguments:
(x) an R object.
(value) a character vector of up to the same length as x, or NULL.
Details:
Names() is a generic accessor function, and names<- is a generic replacement function. The default methods get and set the "names" attribute of a vector (including a list) or pairlist.
Continue R Documentation on Names( )
In Python (pandas) we have .columns function which is equivalent to names() function in R:
Ex:
# Import pandas package
import pandas as pd
# making data frame
data = pd.read_csv("Filename.csv")
# Extract column names
list(data.columns)
not sure if there is anything directly equivalent, especially for getting names. some objects, like dicts, provide .keys() method that allows getting things out
sort of relevant are the getattr and setattr primitives, but it's pretty rare to use these in production code
I was going to talk about Pandas, but I see user2357112 has just pointed that out already!
There is no equivalent. The concept does not exist in Python. Some specific types have roughly analogous concepts, like the index of a Pandas Series, but arbitrary Python sequence types don't have names for their elements.