I have encountered the following problem when trying to make a boxplot of one column in a pandas.DataFrame vs another one. Here is the code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(60))
df.columns = ['Values']
names = ('one','two','three')*int(df.shape[0]/3)
df['Names'] = names
df.plot(x='Names', y='Values', kind='box')
df.boxplot(column='Values', by='Names')
I expect two plot to be the same, but I get:
Is it an expected behavior and if so, how the expression for the first plot should be changed to match the second one?
.boxplot() and .plot(kind='box')/.plot.box() are separate implementations. Problem with .plot(kind='box')/.plot.box() is that although the argument by exists, it is not implemented and therefore ignored (see this issue for example, and they never managed to document it properly), meaning that you won't be able to reproduce the result you get with .boxplot().
Tl;dr .plot(kind='box')/.plot.box() implemented poorly, use .boxplot() instead.
Related
I have dataframes that follow name syntax of 'df#' and I would like to be able to loop through these dataframes in a function. In the code below, if function "testing" is removed, the loop works as expected. When I add the function, it gets stuck on the "test" variable with keyerror = "iris1".
import statistics
iris1 = sns.load_dataset('iris')
iris2 = sns.load_dataset('iris')
def testing():
rows = []
for i in range(2):
test=vars()['iris'+str(i+1)]
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing()
The reason this will be valuable is because I am subsetting my dataframe df multiple times to create quick visualizations. So in Jupyter, I have one cell where I create visualizations off of df1,df2,df3. In the next cell, I overwrite df1,df2,df3 based on different subsetting rules. This is advantageous because I can quickly do this by calling a function each time, so the code stays quite uniform.
Store the datasets in a dictionary and pass that to the function.
import statistics
import seaborn as sns
datasets = {'iris1': sns.load_dataset('iris'), 'iris2': sns.load_dataset('iris')}
def testing(data):
rows = []
for i in range(1,3):
test=data[f'iris{i}']
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing(datasets)
No...
You should NEVER make a sentence like I have dataframes that follow name syntax of 'df#'
Then you have a list of dataframes, or a dict of dataframe, depending how you want to index them...
Here I would say a list
Then you can forget about vars(), trust me you don't need it... :)
EDIT :
And use list comprehensions, your code could hold in three lines :
import statistics
list_iris = [sns.load_dataset('iris'), sns.load_dataset('iris')]
rows = [
(statistics.mean(test['sepal_length']), statistics.mean(test['sepal_width']))
for test in list_iris
]
Storing as a list or dictionary allowed me to create the function. There is still a problem of the nubmer of dataframes in the list varies. It would be nice to be able to just input n argument specifying how many objects are in the list (I guess I could just add a bunch of if statements to define the list based off such an argument). **EDIT: Changing my code so that I don't use df# syntax, instead just putting it directly into a list
The problem I was experiencing is still perplexing. I can't for the life of me figure out why the "test" variable performs as expected outside of a function, but inside of a function it fails. I'm going to go the route of creating a list of dataframes, but am still curious to understand why it fails inside of the function.
I agree with #Icarwiz that it might not be the best way to go about it but you can make it work with.
test=eval('iris'+str(i+1))
I have a pandas dataframe named df which has several columns, one of these is called 'height'. I want to modify heights higher than 3 and replace them by 2. I used this code :
df.loc[df['height']>3,'height'] = 2
this does not make any modification.
I use this code :
df.loc[df['height']>3,['height']] = 2
this works perfectly fine. I sort of understand that in the first version, a copy version of df was created but, it is not clear why, and I fine weird to have the [] make such a change.
Could someone explain what is happening ?
The problem is that you didn't provide a minimal, reproducible example, so trying to help you becomes a guessing game:
Assuming you have a simple dataframe like
import pandas as pd
df = pd.DataFrame({'height':range(6)})
then both your code lines work perfectly fine, except that the additional indexing operator [ ] in the second version is completely unnecessary and that there is no copy version of df involved in neither case.
I am stuck with an issue when it comes to taking slices of my data in python (I come from using Matlab).
So here is the code I'm using,
import scipy.io as sc
import math as m
import numpy as np
from scipy.linalg import expm, sinm, cosm
import matplotlib.pyplot as plt
import pandas as pd
import sys
data = pd.read_excel('DataDMD.xlsx')
print(data.shape)
print(data)
The out put looks like so,
Output
So I wish to take certain rows only (or from my understand in Python slices) of this data matrix. The other problem I have is that the top row of my matrix becomes almost like the titles of the columns instead of actually data points. So I have two problems,
1) I don't need the top of the matrix to have any 'titles' or anything of that sort because it's all numeric and all symbolizes data.
2) I only need to take the 6th row of the whole matrix as a new data matrix.
3) I plan on using matrix multiplication later so is panda allowed or do I need numpy?
So this is what I've tried,
data.iloc[0::6,:]
this gives me something like this,
Output2
which is wrong because I don't need the values of 24.8 to be the 'title' but be the first row of the new matrix.
I've also tried using np.array for this but my problem is when I try to using iloc, it says (which makes sense)
'numpy.ndarray' object has no attribute 'iloc'
If anyone has any ideas, please let me know! Thanks!
To avoid loading the first record as the header, try using the following:
pd.read_excel('DataDMD.xlsx', header=None)
The read_excel function has an header argument; the value for the header argument indicates which row of the data should be used as header. It gets a default value of 0. Use None as a value for the header argument if none of the rows in your data functions as the header.
There are many useful arguments, all described in the documentation of the function.
This should also help with number 2.
Hope this helps.
Good luck!
I've been experimenting with Seaborn's lmplot() and Statsmodels .ols() functions for simple linear regression plots and their associated p-values, r-squared, etc.
I've noticed that when I specify which columns I want to use for lmplot, I can specify a column even if it has multiple words for it:
import seaborn as sns
import pandas as pd
input_csv = pd.read_csv('./test.csv',index_col = 0,header = 0)
input_csv
sns.lmplot(x='Age',y='Count of Specific Strands',data = input_csv)
<seaborn.axisgrid.FacetGrid at 0x2800985b710>
However, if I try to use ols, I'm getting an error for inputting in "Count of Specific Strands" as my dependent variable (I've only listed out the last couple of lines in the error):
import statsmodels.formula.api as smf
test_results = smf.ols('Count of Specific Strands ~ Age',data = input_csv).fit()
File "<unknown>", line 1
Count of Specific Strands
^
SyntaxError: invalid syntax
Conversely, if I specify the "Counts of Specific Strand" as shown below, the regression works:
test_results = smf.ols('input_csv.iloc[:,1] ~ Age',data = input_csv).fit()
test_results.summary()
Does anyone know why this is? Is it just because of how Statsmodels was written? Is there an alternative to specify the dependent variable for regression analysis that doesn't involve iloc or loc?
This is due to the way the formula parser patsy is written: see this link for more information
The authors of patsy have, however, thought of this problem: (quoted from here)
This flexibility does create problems in one case, though – because we
interpret whatever you write in-between the + signs as Python code,
you do in fact have to write valid Python code. And this can be tricky
if your variable names have funny characters in them, like whitespace
or punctuation. Fortunately, patsy has a builtin “transformation”
called Q() that lets you “quote” such variables
Therefore, in your case, you should be able to write:
smf.ols('Q("Count of Specific Strands") ~ Age',data = input_csv).fit()
I'm currently following tutorial in which you get data form quandl. I know the basics of python, but I can't figure out what is the type returned by quandl.get. I have searched for it, but everything I found refers to pandas. Do you know what type is it and where can I read some documentation about it?
Here is the code I don't undestand:
df = quandl.get('WIKI/GOOGL')
df = df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]
I also dont get why there are double brackets in the second line, but I hope it will be clear after I can read something about it.
According to Quandl's documentation, it is by default a pandas data frame, and you can set returns = "numpy" to get a numpy object as an alternative.
Definition:
quandl.get(":database_code/:dataset_code", returns = ":return_format")
Parameters:
The parameter returns can take the following values:
"numpy" : returns a numpy object
"pandas" : returns a pandas data frame
When returns is omitted, a pandas dataframe is returned.
The two sets of square brackets mean the dataframe is a 2 dimensional array ie lots of colums- inner bracket and one row-outer bracket. This reffers to the title row of the dataframe.