What is the Python equivalent for the R function names( )? - python

The function names() in R gets or sets the names of an object. What is the Python equivalent to this function, including import?
Usage:
names(x)
names(x) <- value
Arguments:
(x) an R object.
(value) a character vector of up to the same length as x, or NULL.
Details:
Names() is a generic accessor function, and names<- is a generic replacement function. The default methods get and set the "names" attribute of a vector (including a list) or pairlist.
Continue R Documentation on Names( )

In Python (pandas) we have .columns function which is equivalent to names() function in R:
Ex:
# Import pandas package
import pandas as pd
# making data frame
data = pd.read_csv("Filename.csv")
# Extract column names
list(data.columns)

not sure if there is anything directly equivalent, especially for getting names. some objects, like dicts, provide .keys() method that allows getting things out
sort of relevant are the getattr and setattr primitives, but it's pretty rare to use these in production code
I was going to talk about Pandas, but I see user2357112 has just pointed that out already!

There is no equivalent. The concept does not exist in Python. Some specific types have roughly analogous concepts, like the index of a Pandas Series, but arbitrary Python sequence types don't have names for their elements.

Related

What are the advantages of using column objects instead of strings in PySpark

In PySpark one can use column objects and strings to select columns. Both ways return the same result. Is there any difference? When should I use column objects instead of strings?
For example, I can use a column object:
import pyspark.sql.functions as F
df.select(F.lower(F.col('col_name')))
# or
df.select(F.lower(df['col_name']))
# or
df.select(F.lower(df.col_name))
Or I can use a string instead and get the same result:
df.select(F.lower('col_name'))
What are the advantages of using column objects instead of strings in PySpark
Read this PySpark style guide from Palantir here which explains when to use F.col() and not and best practices.
Git Link here
In many situations the first style can be simpler, shorter and visually less polluted. However, we have found that it faces a number of limitations, that lead us to prefer the second style:
If the dataframe variable name is large, expressions involving it quickly become unwieldy;
If the column name has a space or other unsupported character, the bracket operator must be used instead. This generates inconsistency, and df1['colA'] is just as difficult to write as F.col('colA');
Column expressions involving the dataframe aren't reusable and can't be used for defining abstract functions;
Renaming a dataframe variable can be error-prone, as all column references must be updated in tandem.
Additionally, the dot syntax encourages use of short and non-descriptive variable names for the dataframes, which we have found to be harmful for maintainability. Remember that dataframes are containers for data, and descriptive names is a helpful way to quickly set expectations about what's contained within.
By contrast, F.col('colA') will always reference a column designated colA in the dataframe being operated on, named df, in this case. It does not require keeping track of other dataframes' states at all, so the code becomes more local and less susceptible to "spooky interaction at a distance," which is often challenging to debug.
It depends on how the functions are implemented in Scala.
In scala, the signature of the function is part of the function itself. For example, func(foo: str) and func(bar: int) are two different functions and Scala can make the difference whether you call one or the other depending on the type of argument you use.
F.col('col_name')), df['col_name'] and df.col_name are the same type of object, a column. It is almost the same to use one syntax or another. A little difference is that you could write for example :
df_2.select(F.lower(df.col_name)) # Where the column is from another dataframe
# Spoiler alert : It may raise an error !!
When you call df.select(F.lower('col_name')), if the function lower(smth: str) is not defined in Scala, then you will have an error. Some functions are defined with str as input, others take only columns object. Try it to know if it works and then uses it. otherwise, you can make a pull request on the spark project to add the new signature.

String representation of attribute in Python

Is there a way to get the string representation of a single NamedTuple (or perhaps another type's) attribute?
The snippet below instantiates a pandas.Series indexing a value using a str key matching where the value came from.
If you later want to change the name of attr_1 to some_useful_attribute_name, and want this reflected in the pandas series index, the code below requires two changes.
import pandas as pd
from typing import NamedTuple
class MyTuple(NamedTuple):
attr_1: int
attr_2: float
mt = MyTuple(1, 1)
ser = pd.Series(data=[mt.attr_1], index=['attr_1'])
People often miss the second change when the series instantiation is far away from the class definition.
Refactoring tools such as Pycharm's help somewhat, as they can identify all the attr_1 strings. However, if instead of attr_1 the string is more common, such as level or whatever else, it becomes rather tedious to identify the correct strings to change.
Instead of the last line of code above I'd therefore like to do something like
ser = pd.Series(data=[mt.attr_1], index=[repr(mt.attr_1)])
except that of course repr() doesn't give me the name of the attribute, but rather the string representation of its contents.
I am also aware of the NamedTuple._asdict() method. However the use case here is one where only selected attributes go into the series rather than the entire dictionary.
EDIT:
Might something be achieved by combining Enum and NamedTuple in a clever way?
Thanks for the help.

What are the differences from list(df['column']) and df['column'].to_list()?

When I want a list from a DataFrame column (pandas 1.0.1), I can do:
df['column'].to_list()
or I can use:
list(df['column'])
The two alternatives works well, but what are the differences between them?
Is one method better than the other?
list receives an iterable and returns a pure python list. It is a built-in python way to convert any iterable into a pure python list.
to_list is a method from the core pandas object classes which converts their objects to pure python lists. The difference is that the implementation is done by pandas core developers, which may optimize the process according to their understanding, and/or add extra functionalities in the conversion that a pure list(....) wouldn't do.
For example, the source_code for this piece is:
def tolist(self):
'''(...)
'''
if self.dtype.kind in ["m", "M"]:
return [com.maybe_box_datetimelike(x) for x in self._values]
elif is_extension_array_dtype(self._values):
return list(self._values)
else:
return self._values.tolist()
Which basically means to_list will likely end up using either a normal list comprehension - analogous to list(...) but enforcing that the final objects are of panda's datetime type instead of python's datetime -; a straight pure list(...) conversion; or using numpy's tolist() implementation.
The differences between the latter and python's list(...) can be found in this thread.

dataframes selecting data by [] and by .(attibute)

I found out that I have problem understanding when I should be accessing data from dataframe(df) using df[data] or df.data .
I mostly use the [] method to create new columns, but I can also access data using both df[] and df.data, but what's the difference and how can I better grasp those two ways of selecting data? When one should be used over the other one?
If I understand the Docs correctly, they are pretty much equivalent, except in these cases:
You can use [the .] access only if the index element is a valid python identifier, e.g. s.1 is not allowed.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
However, while
indexing operators [] and attribute operator . provide quick and easy
access to pandas data structures across a wide range of use cases [...]
in production you should really use the optimized panda data access methods such as .loc, .iloc, and .ix, because
[...] since the type of the data to be accessed isn’t known
in advance, directly using standard operators has some optimization
limits. For production code, we recommended that you take advantage of
the optimized pandas data access methods.
Using [] will use the value of the index.
a = "hello"
df[a] # It will give you content at hello
Using .
df.a # content at a
The difference is that with the first one you can use a variable.

Python function argument list from a dictionary

I'm still relatively new to Python, and sometimes something that should be relatively simple escapes me.
I'm storing the results of a POST operation to a database table as a character string formatted as a dictionary definition. I'm then taking that value and using eval() to convert it to an actual dict object, which is working great as it preserves the data types (dates, datetimes, integers, floats, strings etc.) of the dictionary data elements.
What has me flummoxed is using the resulting dictionary to construct a set of keyword arguments that can then be passed to a function or method. So far, I haven't been able to make this work, let alone figure out what the best/most Pythonic way to approach this. The dictionary makes it easy to iterate over the dictionary elements and identify key/value pairs but I'm stuck at that point not knowing how to use these pairs as a set of keyword arguments in the function or method call.
Thanks!
I think you're just looking for func(**the_dict)?
Understanding kwargs in Python
You are looking for **kwargs. It unpacks a dictionary into keyword arguments, just like you want. In the function call, just use this:
some_func(**my_dict)
Where my_dict is the dictionary you mentioned.
#tzaman and #Alex_Thornton - thanks - your answers led me to the solution, but your answers weren't clear re the use of the **kwargs in the function call, not the function definition. It took me a while to figure that out. I had only seen **kwargs used in the function/method definition before, so this usage was new to me. The link that #tzaman included triggered the "aha" moment.
Here is the code that implements the solution:
def do_it(model=None, mfg_date=None, mileage=0):
# Proceed with whatever you need to do with the
# arguments
print('Model: {} Mfg date: {} Mileage: {}'.format(model, mfg_date, mileage)
dict_string = ("{'model':'Mustang,"
"'mfg_date':datetime.datetime.date(2012, 11, 24),"
"'mileage':23824}")
dict_arg = eval(dict_string)
do_it(**dict_arg) # <---Here is where the **kwargs goes - IN THE CALL

Categories