Python xarray: grouping by multiple parameters - python

When using the xarray package for Python 2.7, is it possible to group over multiple parameters like you can in pandas? In essence, an operation like:
data.groupby(['time.year','time.month']).mean()
if you wanted to get mean values for each year and month of a dataset.

Unfortunately, xarray does not support grouping with multiple arguments yet. This is something we would like to support and it would be relatively straightforward, but nobody has had the time to implement it yet (contributions would be welcome!).

An easy way around is to construct a multiindex and group by that "new" coordinate:
da_multiindex = da.stack(my_multiindex=['time.year','time.month'])
da_mean = da.groupby("my_multiindex").mean()
da_mean.unstack() # go back to normal index

Related

How to use named selection for filtering in Vaex

I created 2 named selections
df.select(df.x => 2,name='bigger')
df.select(df.x < 2,name='smaller')
and it's cool, I can use the selection parameter so many (ie statistical) functions offer, for example
df.count('*',selection='bigger')
but is there also a way to use the named selection in a filter? Something like
df['bigger']
Well that syntax df['bigger'] is accessing a column (or expression) in vaex that is called 'bigger'.
However, you can do: df.filter('bigger') and will give you a filtered dataframe.
Note that, while similar in some ways, filters and selections are a bit different, and each has its own place when using vaex.

Find row based on multiple conditions (column values greater than)

My issue is that I need to identify the patient "ID" if anything critical (high conc. XT or increase in Crea) is observed in their blood sample.
Ideally, the sick patients "ID" should be categorized into one of the three groups which could be called Bad_30, Bad_40, and Bad_40. If the patients don't make it into one of the "Bad" groups, then they are non-criticals
See answer
This might be the way:
critical = df[(df['hour36_XT']>=2.0) | (df['hour42_XT']>=1.5) | (df['hour48_XT']>=0.5)]
not_critical = df[~df.index.isin(critical.index)]
Before using this, you will have to convert the data type of all values to float. You can do that by using dtype=np.float32 while defining the data frame
You can put multiple conditions within one df.loc bracket. I tried this on your dataset and it worked as expected:
newDf = df.loc[(df['hour36_XT'] >= 2.0) & (df['hour42_XT'] >= 1.0) & (df['hour48_XT'] >= 0.5)])
print(newDF['ID'])
Explanation: I'm creating a new dataframe using your conditions and then printing out the IDs of the resulting dataframe.
Words of advice: You should avoid iterating over Pandas dataframe rows, and once you learn to utilize Pandas you'll be surprised how rarely you need to do this. This should be the first lesson when starting to use Pandas, but it's so ingrained in us programmers that we tend to skip over the powerful abilities of the Pandas package and immediately turn to using row iterations. If you rely on row iteration when working with Pandas you'll likely find an annoying slowness when you start working with larger datasets and/or more complex operations. I recommend reading up on this, I'm a beginner myself and have found this article to be a good reference point.

How to do a calculation on each line of a pandas dataframe in python?

I am brand new to python, I am attempting to convert the function I made in R to Python, R function described here:
How to optimize this process?
From my reading it looks like the best way to do this in python would be to use a for loop that would take the following form
for line 1 in probe test
find user in U_lookup
find movie in M_lookup
take the value found in U_lookup and retrieve that line number from knn_text
take the values found in that row of knn_text, and retrieve the line numbers from dfm
for those line numbers in dfm, retrieve column=U_lookup
take the average of the non zero values found
save value into pandas datafame in new column for that line
Is this the most efficient (in terms of speed of calculation) way to complete an operation like this? Coming from R so I wasn't sure if there was better functionality for something like this within the pandas package or not.
As a followup, is there an equivalent in python to the function dput() in R? dput essentially provides code to easily share a subset of data for questions like this.
You can use df.apply(my_func, axis=1) to apply the function/calculation to each row of a dataframe.
Where, my_func would contain the required calculations

Pandas: Applying custom aggregation function (not w/ groupby)

We can think of applying two types of functions to a Pandas Series: transformations and aggregations. They make this distinction in the documentation; transformations map individual values in the Series while aggregations somehow summarize the entire Series (e.g. mean).
It is clear how to apply transformations using apply, but I have not be successful in implementing a custom aggregation. Note that groupby is not involved, and aggregation does not require a groupby.
I am working with the following case: I have a Series in which each row is a list of strings. One way I could aggregate this data is to count up the number of appearances of each string, and return the 5 most common terms.
def top_five_strings(series):
counter = {}
for row in series:
for s in row:
if s in counter:
counter[s] += 1
else:
counter[s] = 1
return sorted(s.items(), key=lambda x: x[1])[:5]
If I call this function as top_five_strings(series), it works fine, analogous to as if I had called np.mean(series) on a numeric series. However, the difference is I can also do series.agg(np.mean) and get the same result. If I do series.agg(top_five_strings), I instead get the top five letters of in each row of the Series (which makes sense if you make a single row the argument of the function).
I think the critical difference is that np.mean is a NumPy ufunc, but I haven't been able to work out how the _aggregate helper function works in the Pandas source.
I'm left with 2 questions:
1) Can I implement this by making my Python function a ufunc (and if so, how)?
2) Is this a stupid thing to do? I haven't found anyone else out there trying to do something like this. It seems to me like it would be quite nice, however, to be able to implement custom aggregations as well as custom transformations within the Pandas framework (e.g. I get a Series as a result as one might with df.describe).

Plot Subset of Dataframe without Being Redundant

a bit of a Python newb here. As a beginner it's easy to learn different functions and methods from training classes but it's another thing to learn how to "best" code in Python.
I have a simple scenario where I'm looking to plot a portion of a dataframe spdf. I only want to plot instances where speed is greater than 0 and use datetime as my X-axis. The way I've managed to get the job done seems awfully redundant to me:
ts = pd.Series(spdf[spdf['speed']>0]['speed'].values, index=spdf[spdf['speed']>0]['datetime'])
ts.dropna().plot(title='SP1 over Time')
Is there a better way to plot this data without specifying the subset of my dataframe twice?
You don't need to build a new Series. You can plot using your original df
df[df['col'] > 0]].plot()
In your case:
spdf[spdf['speed'] > 0].dropna().plot(title='SP1 over Time')
I'm not sure what your spdf object is or how it was created. If you'll often need to plot using the 'datetime' column you can set that to be the index of the df.If you're reading the data from a csv you can do this using the parse_dates keyword argument or it you already have the dfyou can change the index using df.set_index('datetime'). You can use df.info() to see what is currently being used at your index and its datatype.

Categories