Calculating the basic statistics, I get the following working well:
import pandas as pd
max(df[Price])
min(df[Price])
But, this is returning an error:
mean(df[Price])
NameError: name 'mean' is not defined
I'm just trying to understand the logic of this.
This one works well:
df[Price].mean()
What kind of statistics work after the dot and which ones must wrap the column?
min() and max() are functions provided as Python built-ins.
You can use them on any iterable, which includes Pandas series, which is why what you're doing works.
Pandas also provides .min() and .max() as methods on series and dataframes, so e.g. df["Price"].min() would also work. The full list of Series functions is here; the full list of DataFrame functions is here.
If you do want to use a free function called mean(), e.g. when you have something that's not a Pandas series and you don't want to convert it to one, one actually does exist in the Python standard library, but you will have to import it:
from statistics import mean
Related
I have some code which I want to use in a dynamic python function. This code adds a field to an existing dataframe and does some adjustments to it. However, I got the error "TypeError: string indices must be integers". What am I doing incorrectly?
See below the function plus the code for calling the function.
import pandas as pd
#function
def create_new_date_column_in_df_based_on_other_date_string_column(df,df_field_existing_str,df_field_new):
df[df_field_new] = df[df_field_existing_str]
df[df_field_new] = df[df_field_existing_str].str.replace('12:00:00 AM','')
df[df_field_new] = df[df_field_existing_str].str.strip()
df[df_field_new]=pd.to_datetime(df[df_field_existing_str]).dt.strftime('%m-%d-%Y')
return df[df_field_new]
#calling the function
create_new_date_column_in_df_based_on_other_date_string_column(df='my_df1',df_field_existing_str='existingfieldname',df_field_new='newfieldname')
The parameters df you are giving the function is of type str, and so is df_field_existing_str.
What basically you're doing is trying to slice a string/get a specific characters by using the [] (or the .__getitem__() method) with another string which is impossible.
You are not using a DataFrame here, only strings, thus you are getting this TypeError.
In PySpark one can use column objects and strings to select columns. Both ways return the same result. Is there any difference? When should I use column objects instead of strings?
For example, I can use a column object:
import pyspark.sql.functions as F
df.select(F.lower(F.col('col_name')))
# or
df.select(F.lower(df['col_name']))
# or
df.select(F.lower(df.col_name))
Or I can use a string instead and get the same result:
df.select(F.lower('col_name'))
What are the advantages of using column objects instead of strings in PySpark
Read this PySpark style guide from Palantir here which explains when to use F.col() and not and best practices.
Git Link here
In many situations the first style can be simpler, shorter and visually less polluted. However, we have found that it faces a number of limitations, that lead us to prefer the second style:
If the dataframe variable name is large, expressions involving it quickly become unwieldy;
If the column name has a space or other unsupported character, the bracket operator must be used instead. This generates inconsistency, and df1['colA'] is just as difficult to write as F.col('colA');
Column expressions involving the dataframe aren't reusable and can't be used for defining abstract functions;
Renaming a dataframe variable can be error-prone, as all column references must be updated in tandem.
Additionally, the dot syntax encourages use of short and non-descriptive variable names for the dataframes, which we have found to be harmful for maintainability. Remember that dataframes are containers for data, and descriptive names is a helpful way to quickly set expectations about what's contained within.
By contrast, F.col('colA') will always reference a column designated colA in the dataframe being operated on, named df, in this case. It does not require keeping track of other dataframes' states at all, so the code becomes more local and less susceptible to "spooky interaction at a distance," which is often challenging to debug.
It depends on how the functions are implemented in Scala.
In scala, the signature of the function is part of the function itself. For example, func(foo: str) and func(bar: int) are two different functions and Scala can make the difference whether you call one or the other depending on the type of argument you use.
F.col('col_name')), df['col_name'] and df.col_name are the same type of object, a column. It is almost the same to use one syntax or another. A little difference is that you could write for example :
df_2.select(F.lower(df.col_name)) # Where the column is from another dataframe
# Spoiler alert : It may raise an error !!
When you call df.select(F.lower('col_name')), if the function lower(smth: str) is not defined in Scala, then you will have an error. Some functions are defined with str as input, others take only columns object. Try it to know if it works and then uses it. otherwise, you can make a pull request on the spark project to add the new signature.
I am trying to create a pandas DataFrame with the hypothesis library for code testing purporses with the following code:
from hypothesis.extra.pandas import columns, data_frames
from hypothesis.extra.numpy import datetime64_dtypes
#given(data_frames(index=datetime64_dtypes(max_period='Y', min_period='s'),
columns=columns("A B C".split(), dtype=int)))
The error I receive is the following:
E TypeError: 'numpy.dtype' object is not iterable
I suspect that this is because when I construct the DataFrame for index= I only pass a datetime element and not a ps.Series all with type datetime for example. Even if this is the case (I am not sure), still I am not sure how to work with the hypothesis library in order to achieve my goal.
Can anyone tell me what's wrong with the code and what the solution would be?
The reason for the above error was because, data_frames requires an index containing a strategy elements inside such as indexes for an index= input. Instead, the above datetime64_dtypes only provides a strategy element, but not in an index format.
To fix this we provide the index first and then the strategy element inside the index like so:
from hypothesis import given, strategies
#given(data_frames(index=indexes(strategies.datetimes()),
columns=columns("A B C".split(), dtype=int)))
Note that in order to get datetime we use datetimes().
The function names() in R gets or sets the names of an object. What is the Python equivalent to this function, including import?
Usage:
names(x)
names(x) <- value
Arguments:
(x) an R object.
(value) a character vector of up to the same length as x, or NULL.
Details:
Names() is a generic accessor function, and names<- is a generic replacement function. The default methods get and set the "names" attribute of a vector (including a list) or pairlist.
Continue R Documentation on Names( )
In Python (pandas) we have .columns function which is equivalent to names() function in R:
Ex:
# Import pandas package
import pandas as pd
# making data frame
data = pd.read_csv("Filename.csv")
# Extract column names
list(data.columns)
not sure if there is anything directly equivalent, especially for getting names. some objects, like dicts, provide .keys() method that allows getting things out
sort of relevant are the getattr and setattr primitives, but it's pretty rare to use these in production code
I was going to talk about Pandas, but I see user2357112 has just pointed that out already!
There is no equivalent. The concept does not exist in Python. Some specific types have roughly analogous concepts, like the index of a Pandas Series, but arbitrary Python sequence types don't have names for their elements.
I have a data frame and I want to create a new column whose values are defined by values located in other columns (in the same row). It is very simple if I use simple operations (+, -, * and even abs). For example:
df['new_col'] = abs(df['col1']*df['col2'] - df['col3'])
Then I defined my own function in Python and tried to do the following:
df['new_col'] = my_func(df['col1'], df['col2'], df['col3'], df['col4']).
Unfortunately it did not work. I think the reason why it didn't work is because my_func contains asin, atan and other functions that cannot be applied to series. For example, if I try abs(df['col1']) I get no complains, but if I try asin(df['col1']) I get an error message:
TypeError: only length-1 arrays can be converted to Python scalars
Is there a trick that will let me use asin (or my own function my_func) in the same way abs or + are used?
Use numpy's universal functions.
For your case, use numpy.arcsin instead of math.asin.