I´m not sure if the title was well picked, sorry for that. If this was already covered please let me know where I couldn´t find it.
For an analysis that I am doing, I am working in JupyterLab mainly scanpy. I want to see the number of cells that are coexpressing certain genes in a leiden clustering. So far I was trying with pandas crosstab function and I get the number for each cluster.
However, I have two conditions and there I´m struggling to separate the samples to get the cell counts separately.
The code I am using to get the total cell number which works fine.
pd.crosstab(adata_proc.obs['leiden_r05'], adata_proc.obs['CoEx'])
The code where I am struggling to get the numbers for the samples. I know that the aggfunc = ','.join is not the correct way but this is to explain what the problem is.
pd.crosstab(adata_proc.obs['leiden_r05'], adata_proc.obs['CoEx'], adata_proc.obs['sample'], aggfunc = ','.join)
I can get the name of the conditions out in the table but I don´t want this. I want the numbers for the 2 conditions. How is this possible? Maybe there is a way to do this in a separate function?
Edit:
Using crosstab, you'll need to add the 'CoEx' column to the index, and use the 'sample' as the column of interest:
pd.crosstab(index=[adata_proc.obs['leiden_r05'],adata_proc.obs['CoEx']], columns=[adata_proc.obs['sample']])
I suggest using the .groupby function:
adata_proc.obs.groupby(['leiden_r05','CoEx'])["sample"].value_counts()
Another option (a bit of an abuse) is the pivot_table interface. In your case it be:
pd.pivot_table(adata_proc.obs, index=["leiden_r05"], columns=["CoEx","sample"],values='barcode', aggfunc=len, fill_value=0)
*The 'values' argument is here only to reduce the amounts of columns, an artifact of using an unfit method
Related
There are SO questions similar to this one however I can't find a way here.
My data frame is:
I just want to get the given dataframe in the form:
I did this:
new = pd.pivot_table(ratings,values='Book-Rating',index='User-ID',columns='ISBN')
The issue is that the duplicate ISBNs are apparently being dropped here. However I want all the data, not ISBN is showing more than 1 rating as duplicates have been dropped.
I also tried pivot but from an SO answer I found out pivot_table and using aggfunc will cater the duplicates. However I am not sure what value to pass in to aggfunc.
Also, the records in the data are more than 400k and I get 'index is out of bounds with axis of size' error as well.
Any guidance in the correct direction is appreciated, I can provide the dataframe as well if required. Let me know.
I have data frames which includes the actors and the movies. And my goal is try to find correlation between those.
I have to make adjacency matrices between the actors and movies. I want to make a new dataframe with actors as column names and movies as index names. And put '1' if the actor is playing in this movie and '0' if not. This is the output I'm trying to reach:
I don't know how to do that, I only saw pandas.crosstab function but did not understand. I'm open to any suggestions. Thank you
Edit: I may not have been able to open the question properly because I was so nervous. I can edit if there is any mistake.
If there are no duplicates in your data you can use value_counts to find existing combinations (and their occurences) and then unstack them. Non-existing combinations have to be filled with zeros:
df[["Movie_name", "Actor_Name"]].value_counts().unstack().fillna(0)
I'm new to coding and having a hard time expressing/searching for the correct terms to help me along with this task. In my work I get some pretty large excel-files from people out in the field monitoring birds. The results need to be prepared for databases, reports, tables and more. I was hoping to use Python to automate some tasks for this.
How can I use Python (pandas?) to find certain rows/columns based on a common name/ID but with a unique suffix , and aggregate/sum the results that belongs together under that common name? As an example in the table provided I need get all the results from sub-localities e.g. AA3_f, AA3_lf and AA3_s expressed as the sum (total of gulls for each species) of the subs in a new row for the main Locality AA3.
Can someone please provide some code for this task, or help me in some other way? I have searched and watched many tutorials on python, numpy, pandas and also matplotlib .. still clueless on how to set this up
any help appreciated
Thanks!
Update:
#Harsh Nagouda, thanks for your reply. I tried your example using groupby function, but I having trouble dividing into correct groups. The "Locality" column has only unique values/ID because they all have a suffix (they are sub categories).
I tried to solve this by slicing the strings:
eng.Locality.str.slice(0,4,1)
i managed to slice off the suffices so that the remainders = AA3_ , AA4_ and so on.
Then i tried to do this slicing in the groupby function. That failed. Then I tried to slice using pandas.Dataframe.apply(). That failed as well.
eng["Locality"].apply(eng.Locality.str.slice(0,4,1))
sum = eng.groupby(["Locality"].str.slice(0,4,1)).sum()
Any more help out there? As you can see above - I need it :-)
In your case, the pd.groupby option seems to be a good fit for the problem. The groupby function does exactly what it means, it groups parts of the dataframe you like it to.
Since you mentioned a case based on grouping by localities and finding the sum of those values, this snippet should help you out:
sum = eng.groupby(["Locality"]).sum()
Additional commands and sorting styles can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
I finally figured out a way to get it done. Maybe not the smoothest way, but at least I get the end result I need:
Edited the Locality-ID to remove suffix:eng["Locality"]=eng["Locality].str.slice(0,4,1)
Used the groupby function:sum = eng.groupby(["Locality"]).sum()
End result:
Table
I'm learning dataframe now. I've been stuck in how to get a subset of a dataframe or table with its label index. I know it's a very simple question but I couldn't find the solution in pandas documentation. Hope someone could help me. Appreciate your help.
So, I have a dataframe named df_teams like below:
enter image description here
If I want to get a subtable of a specific team 'Warriors', I can use df_teams[df_teams['nickname']=='Warriors'], resulting a row in the form of dataframe. My question is, what if I want to get a subtable of more teams, say I want information of both 'Warriors' and 'Hawks' to form a new table? Can I do something similar by using logical index and finishing in one line of code?
You could do a bitwise or on the two conditions using the '|' character.
df_teams[(df_teams['nickname']=='Warriors')|(df_teams['nickname']=='Hawks')]
Alternatively if you have a list of values you want to check against you could instead use the isin method to return rows that have one of the values present in the list.
E.g
df_teams[df_teams['nickname'].isin(['Warriors','Hawks'])]
I need to do an apply on a dataframe using inputs from multiple rows. As a simple example, I can do the following if all the inputs are from a single row:
df['c'] = df[['a','b']].apply(lambda x: awesome stuff, axis=1)
# or
df['d'] = df[['b','c']].shift(1).apply(...) # to get the values from the previous row
However, if I need 'a' from the current row, and 'b' from the previous row, is there a way to do that with apply? I could add a new 'bshift' column and then just use df[['a','bshift']] but it seems there must be a more direct way.
Related but separate, when accessing a specific value in the df, is there a way to combine labeled indexing with integer-offset? E.g. I know the label of the current row but need the row before. Something like df.at['labelIknow'-1, 'a'] (which of course doesn't work). This is for when I'm forced to iterate through rows. Thanks in advance.
Edit: Some info on what I'm doing etc. I have a pandas store containing tables of OHLC bars (one table per security). When doing backtesting, currently I pull the full date range I need for a security into memory, and then resample it into a frequency that makes sense for the test at hand. Then I do some vectorized operations for things like trade entry signals etc. Finally I loop over the data from start to finish doing the actual backtest, e.g. checking for trade entry exit, drawdown etc - this looping part is the part I'm trying to speed up.
This should directly answer your question and let you use apply, although I'm not sure it's ultimately any better than a two-line solution. It does avoid creating extra variables at least.
df['c'] = pd.concat([ df['a'], df['a'].shift() ], axis=1).apply(np.mean,axis=1)
That will put the mean of 'a' values from the current and previous rows into 'c', for example.
This isn't as general, but for simpler cases you can do something like this (continuing the mean example):
df['c'] = ( df['a'] + df['a'].shift() ) / 2
That is about 10x faster than the concat() method on my tiny example dataset. I imagine that's as fast as you could do it, if you can code it in that style.
You could also look into reshaping the data with stack() and hierarchical indexing. That would be a way to get all your variables into the same row but I think it will likely be more complicated than the concat method or just creating intermediate variables via shift().
For the first part, I don't think such a thing is possible. If you update on what you actually want to achieve, I can update this answer.
Also looking at the second part, your data structure seems to be relying an awfully lot on the order of rows. This is typically not how you want to manage your databases. Again, if you tell us what your overall goal is, we may hint you towards a solution (and potentially a better way to structure the data base).
Anyhow, one way to get the row before, if you know a given index label, is to do:
df.ix[:'labelYouKnow'].iloc[-2]
Note that this is not the optimal thing to do efficiency-wise, so you may want to improve your your db structure in order to prevent the need for doing such things.