Xarray: Filtering on float - python

This subject refers to this one I closed earlier:
NetCDF4 file with Python - Filter before dataframing
After applying the solution of the other topic to reduce an xarray size
data_9 = ds.sel(time=datetime.time(9))
I have an xarray this way:
But I still can and need to reduce it on latitude and longitude
For example I want only longitude between -4 and 44
I tried to apply the function sel again but it doesn't seem to work this time :'(
data_9 = ds.sel(time=datetime.time(9)).sel(lon>-4).sel(lon<44)
Doing this it can't recognise lon...
NameError: name 'lon' is not defined
Can someone helps on this too?
Thanks

It seems you have to use where instead of sel here. We can create a condition array just like in numpy and give it to where. The second parameter drop=True removes the data where our condition is falsy. Without it, you would get nans there instead of getting a trimmed dataset.
I am using the same demo dataset used in the other question you linked.
import xarray as xr
import datetime
# Load a demo dataset.
ds = xr.tutorial.load_dataset('air_temperature')
data_9 = ds.sel(time=datetime.time(9))
cond = (-4 < data_9.lon) & (data_9.lon < 44)
data_9 = data_9.where(cond, drop=True)

Xarray's sel methods can take multiple selectors and windows in the form of slices:
ds_subset = ds.sel(time=datetime.time(9), lon=slice(-4, 44))

Related

How to fix "DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance..."

I have been trying to implement the Apriori Algorithm in python. There are several examples online, they all use similar methods and mostly the same example dataset. The reference link: https://www.kaggle.com/code/rockystats/apriori-algorithm-or-market-basket-analysis/notebook
(starting from the line [26])
I have a different dataset that has the same structure as the example datasets online. I keep getting the
"DeprecationWarning: DataFrames with non-bool types result in worse
computationalperformance and their support might be discontinued in
the future.Please use a DataFrame with bool type"
error.
Here is my code:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
df1 = pd.read_csv(r'C:\Users\USER\dataset', sep=';')
df=df1.fillna(0)
basket = pd.pivot_table(data=df, index='cust_id', columns='Product', values='quantity', aggfunc='count',fill_value=0.0)
def convert_into_binary(x):
if x > 0:
return 1
else:
return 0
basket_sets = basket.applymap(convert_into_binary)
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)
print(frequent_itemsets)
# association rule
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
print(rules)
In addition, in the last step of my code, I get an empty dataframe; I can see the column headings of the dataset but the output is empty.
Empty DataFrame Columns: [antecedents, consequents, antecedent
support, consequent support, support, confidence, lift, leverage,
conviction] Index: []
I am not sure if this issue is related to this error that I am having. I am new to python and I would really appreciate assistance and support on this issue.
I ran into the same issue even after converting my dataframe fields to 0 and 1.
The fix was just making sure the apriori module knows the dataframe is of boolean type, so in your case you should run this :
frequent_itemsets = apriori(basket_sets.astype('bool'), min_support=0.07, use_colnames=True)
In addition, in the last step of my code, I get an empty dataframe; I can see the column headings of the dataset but the output is empty.
Try using a smaller min_support

Replace unknown values (with different median values)

I have a particular problem, I would like to clean and prepare my data and I have a lot of unknown values for the "highpoint_metres" column of my dataframe (members). As there is no missing information for the "peak_id", I calculated the median value of the height according to the peak_id to be more accurate.
I would like to do two steps: 1) add a new column to my "members" dataframe where there would be the value of the median but different depending on the "peak_id" (value calculated thanks to the code in the question). 2) That the code checks that the value in highpoint_metres is null, if it is, that the value of the new column is put instead. I don't know if this is clearer
code :
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
print(members)
mediane_peak_id = members[["peak_id","highpoint_metres"]].groupby("peak_id",as_index=False).median()
And I don't know how to continue from there (my level of python is very bad ;-))
I believe that's what you're looking for:
import numpy as np
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
median_highpoint_by_peak = members.groupby("peak_id")["highpoint_metres"].transform("median")
is_highpoint_missing = np.isnan(members.highpoint_metres)
members["highpoint_meters_imputed"] = np.where(is_highpoint_missing, median_highpoint_by_peak, members.highpoint_metres)
so one way to go about replacing 0 with median could be:
import numpy as np
df[col_name] = df[col_name].replace({0: np.median(df[col_name])})
You can also use apply function:
df[col_name] = df[col_name].apply(lambda x: np.median(df[col_name]) if x==0 else x)
Let me know if this helps.
So adding a little bit more info based on Marie's question.
One way to get median is through groupby and then left join it with the original dataframe.
df_gp = df.groupby(['peak_id']).agg(Median = (highpoint_metres, 'median')).reset_index()
df = pd.merge(df, df_gp, on='peak_id')
df = df.apply(lambda x['highpoint_metres']: x['Median'] if x['highpoint_metres']==np.nan else x['highpoint_metres'])
Let me know if this solves your issue

Heat map from pandas DataFrame - 2D array

I have a data visualisation-based question. I basically want to create a heatmap from a pandas DataFrame, where I have the x,y coordinates and the corresponding z value. The data can be created with the following code -
data = ([[0.2,0.2,24],[0.2,0.6,8],[0.2,2.4,26],[0.28,0.2,28],[0.28,0.6,48],[0.28,2.4,55],[0.36,0.2,34],[0.36,0.6,46],[0.36,2.4,55]])
data=np.array(data)
df=pd.DataFrame(data,columns=['X','Y','Z'])
Please note that I have converted an array into a DataFrame just so that I can give an example of an array. My actual data set is quite large and I import into python as a DataFrame. After processing the DataFrame, I have it available as the format given above.
I have seen the other questions based on the same problem, but they do not seem to be working for my particular problem. Or maybe I am not applying them correctly. I want my results to be similar to what is given here https://plot.ly/python/v3/ipython-notebooks/cufflinks/#heatmaps
Any help would be welcome.
Thank you!
Found one way of doing this -
Using Seaborn.
import seaborn as sns
data = ([[0.2,0.2,24],[0.2,0.6,8],[0.2,2.4,26],[0.28,0.2,28],[0.28,0.6,48],[0.28,2.4,55],[0.36,0.2,34],[0.36,0.6,46],[0.36,2.4,55]])
data=np.array(data)
df=pd.DataFrame(data,columns=['X','Y','Z'])
df=df.pivot('X','Y','Z')
diplay_df = sns.heatmap(df)
Returns the following image -
sorry for creating another question.
Also, thank you for the link to a related post.
How about using plotnine, A Grammar of Graphics for Python
data
data = ([[0.2,0.2,24],[0.2,0.6,8],[0.2,2.4,26],[0.28,0.2,28],[0.28,0.6,48],[0.28,2.4,55],[0.36,0.2,34],[0.36,0.6,46],[0.36,2.4,55]])
data=np.array(data)
df=pd.DataFrame(data,columns=['X','Y','Z'])
Prepare data
df['rows'] = ['row' + str(n) for n in range(0,len(df.index))]
dfMelt = pd.melt(df, id_vars = 'rows')
Make heatmap
ggplot(dfMelt, aes('variable', 'rows', fill='value')) + \
geom_tile(aes(width=1, height=1)) + \
theme(axis_ticks=element_blank(),
panel_background = element_blank()) + \
labs(x = '', y = '', fill = '')

Replicate Scipy's RegressionResults.predict functionality

Here's my sample program:
import numpy as np
import pandas as pd
import statsmodels
from statsmodels.formula.api import ols
df = pd.DataFrame({"z": [1,1,1,2,2,2,3,3,3],
"x":[0,1,2,0,1,2,0,1,2],
"y":[0,2,4,3,5,7,7,9,11]
})
model = ols("y ~ x + z + I(z**2)", df).fit()
model.params
newdf = pd.DataFrame({"z": [4,4,4,5,5,5],
"x":[0,1,2,0,1,2]
})
model.predict(newdf)
You'll notice, if you run this, that model.params is a pandas Series with indices the same as the right-hand side of the formula, except with an additional entry: "Intercept"
> Out[2]:
> Intercept -2.0
> x 2.0
> z 1.5
> I(z ** 2) 0.5
> dtype: float64
And, using some internal functionality I can't determine, the RegressionResults object's .predict() can recognize the column headers from newdf, match them up (including the patsy syntax "I(z**2)"), add the intercept, and return an answer Series. (this is the last line of my sample code)
This seems convenient! Better than writing out my formula again in python/numpy code whenever I want to evaluate slight variations on it. I feel like there should be some way for me to construct a similar pd.Series for formula coefficients, instead of having created it through a model and fit. Then I should be able to apply this to an appropriate dataframe as a way of evaluating functions.
My attempts to figure out how statsmodel is doing this haven't worked, I haven't found anything obvious in the related function doc pages, in patsy, nor can I seem to enter this section of the source code while debugging.
Anyone have any idea how to set this up?
I eventually pieced together one way of doing this.
def predict(coeffs,datadf:pd.DataFrame)->np.array:
"""Apply a series (or df) of coefficents indexed by model terms to new data
:param coeffs: a series whose elements are coefficients and index are the formula terms
or a df whose column names are formula terms, and each row is a set of coefficients
:param datadf: the new data to predict on
"""
desc = patsy.ModelDesc([],[patsy.Term([]) if column=="Intercept" else patsy.Term([patsy.EvalFactor(column)]) for column in coeffs.index] )
dmat = patsy.dmatrix(desc,datadf)
return np.dot(dmat, coeffs.T)
newdf["y"] = predict(model.params,newdf)
The reason this seemed so appealing to me, in case anyone is baffled, is that I was fitting data piecewise using df.groupby("column").apply(FitFunction). It seemed like having FitFunction() return the model.params series would be the cleanest approach within the pandas paradigm.

python pandas: how to avoid chained assignment

I have a pandas dataframe with two columns: x and value.
I want to find all the rows where x == 10, and for all these rows set value = 1,000. I tried the code below but I get the warning that
A value is trying to be set on a copy of a slice from a DataFrame.
I understand I can avoid this by using .loc or .ix, but I would first need to find the location or the indices of all the rows which meet my condition of x ==10. Is there a more direct way?
Thanks!
import numpy as np
import pandas as pd
df=pd.DataFrame()
df['x']=np.arange(10,14)
df['value']=np.arange(200,204)
print df
df[ df['x']== 10 ]['value'] = 1000 # this doesn't work
print df
You should use loc to ensure you're working on a view, on your example the following will work and not raise a warning:
df.loc[df['x'] == 10, 'value'] = 1000
So the general form is:
df.loc[<mask or index label values>, <optional column>] = < new scalar value or array like>
The docs highlights the errors and there is the intro, granted some of the function docs are sparse, feel free to submit improvements.

Categories