I am trying to automate the plotting procedure of a large dataframe matrix. The goal is to plot each column with an other column. Each column represents a variable. See also the image below.
F.e: sex vs age, sex vs BMI, sex vs smoke, sex vs type and so on.
For the sake of clearity, I have simplified the problem to image below:
enter image description here
Initially, I tried to plot each combination by hand. But this is rather a time-consuming excersize and not what I want.
I tried also this (not working):
variables = ["Sex", "Age", "BMI"]
for variable in variables:
plt.scatter(df.variable, df.variable)
plt.xlabel('variable')
plt.ylabel('variable')
plt.title('variable vs. variable')
plt.show()
Any help is welcome!
PS: If it would be a simple excersize to incorporate a linear regression on the combination of variables as well, that would also be appreciated.
Greetings,
Nadia
What you coded plots each column against itself. What you described is a nested loop. A simple upgrade is
col_choice = ["Sex", "Age", "BMI"]
for pos, axis1 in enumerate(col_choice): # Pick a first col
for axis2 in enumerate(col_choice[pos+1:]): # Pick a later col
plt.scatter(df.loc[:, axis1], df.loc[:, axis2])
I think this generates a series acceptable to scatter.
Does that help? If you want to be more "Pythonic", then look into itertools.product to generate your column choices.
You could do something like this:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Create dummy dataframe, or load your own with pd.read_csv()
columns = ["sex", "age", "BMI", "smoke", "type"]
data = pd.DataFrame(np.array([[1,0,0,1,0], [23,16,94,18,24], [32, 26, 28, 23, 19], [0,1,1,1,0], [1,2,2,2,1]]).T, columns=columns)
x_col = "sex"
y_columns = ["age", "BMI", "smoke"]
for y_col in y_columns:
figure = plt.figure
ax = plt.gca()
ax.scatter(data[x_col], data[y_col])
ax.set_xlabel(x_col)
ax.set_ylabel(y_col)
ax.set_title("{} vs {}".format(x_col, y_col))
plt.legend()
plt.show()
Basically, if you have your dataset saved as a .csv file, you can load it with pandas using pd.read_csv(), and use the column names as keys to access the corresponding rows, and iterate on that (here I created a dummy dataframe just for the sake of it).
Regarding the linear regression part, you should check out the scikit-learn library. It has a lot of regression models for many different tasks like regression, classification and clustering
Related
I need to check data distributions of all my numeric columns in a dataset. I chose Fitter library to do so. I loop over all columns but have only one plot+summary table as an outcome instead. What is wrong with my code?
from fitter import Fitter
import numpy as np
df_numeric = df.select_dtypes(include=np.number).sample(n=5000)
num_cols = df_numeric.columns.tolist()
distr = ['cauchy',
'chi2',
'expon',
'exponpow',
'gamma',
'beta',
'lognorm',
'logistic',
'norm',
'powerlaw',
'rayleigh',
'uniform']
for col in num_cols:
modif_col = df_numeric[col].fillna(0).values
dist_fitter = Fitter(modif_col, distributions=distr)
dist_fitter.fit()
dist_fitter.summary()
Maybe there is another approach to check distributions in a loop?
It looks like your code is correctly looping over all the numeric columns in the dataframe, fitting different distributions to each column using the Fitter library, and then printing a summary of the fitting results. However, you're only seeing one plot and summary table as the outcome because you're overwriting the plot and summary table for each iteration of the loop.
To see a separate plot and summary table for each column, you should move the calls to dist_fitter.summary() and dist_fitter.plot() inside the loop and make sure to give each plot and summary table a unique name or title, so you can distinguish them when viewing them.
Here is the code example you can use it
import matplotlib.pyplot as plt
for col in num_cols:
modif_col = df_numeric[col].fillna(0).values
dist_fitter = Fitter(modif_col, distributions=distr)
dist_fitter.fit()
plt.figure()
dist_fitter.plot()
plt.title(col)
plt.show()
print(col)
dist_fitter.summary()
I have Daily Crude oil prices downloaded from FRED, about 10k observations, some values are blank(code cleans them). I believe that I cannot share excel sheets here, so I will just give you a screenshot of what the data looks like:
I calculate the differences and returns and clean up the data but I am kind of stuck.
Here is what the code looks like to get you started:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("DCOILWTICO.csv")
nan_value = float("NaN")
data.replace("", nan_value, inplace=True)
data.replace(".", nan_value, inplace=True)
data['Previous'] = data['DCOILWTICO'].shift(1)
data.dropna(subset=['Previous'],inplace=True)
data.replace("", nan_value, inplace=True)
data.replace(".", nan_value, inplace=True)
data['DCOILWTICO'] = data['DCOILWTICO'].astype(float)
data['Previous'] = data['Previous'].astype(float)
data['Diff'] = data['DCOILWTICO'] - data['Previous']
data['Return'] = (data['DCOILWTICO'] - data['Previous'])/data['Previous']
Here comes the question: I am trying to duplicate the graph below.(which I believe was generated using Mathematica) The difficult part is to be able to create the bins in the right way. Looking at the graph it looks like there are around 200 bins. On the x-axis are the returns and on the y axis are the frequencies(which have been binned).
I think you are asking how to make equally spaced bins in logspace. If so then use the np.geomspace function (geometric space), rather than np.linspace (linear space).
plt.figure()
bins = np.geomspace(data['returns'].min(), data['returns'].max(), 200)
plt.hist(data['returns'], bins = bins)
I have survey dataset about different age of people over using various social media platform. I want to calculate the average number of people over social media app usage. Here is how example data looks like:
here is reproducible pandas dataframe:
df=pd.DataFrame({'age': np.random.randint(10,100,size=10),
'web1a': np.random.choice([1, 2], size=(10,)),
'web1b': np.random.choice([1, 2], size=(10,), p=[1./3, 2./3]),
'web1c': np.random.choice([1, 2], size=(10,)),
'web1d': np.random.choice([1, 2], size=(10,))})
here is what I tried:
df.pivot_table(df, values='web1a', index='age', aggfunc='mean')
but it is not efficient and didn't produce my desired output. Any idea to get this done? Thanks
update:
for me, the way to do this, first select categorical values in each column and get mean for it which can be the same for others. If I do that, how can I nicely plot them?
Note that in column web1a,web1b, web1c, web1d, 1 mean user and 2 means non-user respectively. I want to compute the average age of the user and non-user. How can I do that? Anyone give me a possible idea to make this happen? Thanks!
Using
df.melt('age').set_index(['variable','value']).mean(level=[0,1]).unstack().plot(kind='bar')
This can be done using groupby method:
df.groupby(['web1a', 'web1b', 'web1c', 'web1d']).mean()
You can groupby the 'web*' columns and calculate the mean on the 'age' column.
You can also plot bar charts (colors can be defined in the subplot). I'm not sure pie charts make sense in this case.
I tried with your data, taking only the columns starting with 'web'. There are more values than '1's and '2's, So I assumed you only wanted to analyze the users and non-users and nothing else. You can change the values or add other values in the chart in the same way, as long as you know what values you want to draw.
df = df.filter(regex=('web|age'),axis=1)
userNr = '1'
nonUserNr = '2'
users = list()
nonUsers = list()
labels = [x for x in df.columns.tolist() if 'web' in x]
for col in labels:
users.append(df.loc[:,['age',col]].groupby(col).mean().loc[userNr][0])
nonUsers.append(df.loc[:,['age',col]].groupby(col).mean().loc[nonUserNr][0])
from matplotlib import pyplot as plt
x = np.arange(1, len(labels)+1)
ax = plt.subplot(111)
ax.bar(x-0.1, users, width=0.2,color='g')
ax.bar(x+0.1,nonUsers, width=0.2,color='r')
plt.xticks(x, labels)
plt.legend(['users','non-users'])
plt.show()
df.melt(id_vars='age').groupby(['variable', 'value']).mean()
I have a situation with a bunch of datafiles, these datafiles have a number of samples in a given time frame that depends on the system. i.e. At time t=1 for instance I might have a file with 10 items, or 20 items, at later times in that file I will always have the same number of items. The format is time, x, y, z in columns, and loaded into a numpy array. The time values show which frame, but as mentioned there's always the same, let's go with 10 as a sample. So I'll have a (10,4) numpy array where the time values are identical, but there are many frames in the file, so lets say 100 frames, so really I have (1000,4). I want to plot the data with time on the x-axis and manipulations of the other data on the y, but I am unsure how to do this with line plot methods in matplotlib. Normally to provide both x,y values I believe I need to do a scatter plot, so I'm hoping there's a better way to do this. What I ideally want is to treat each line that has the same time code as a different series (so it will colour differently), and the next bit of data for that same line number in the next frame (time value) will be labelled the same colour, giving those good contiguous lines. We can look at the time column and figure out how many items share a time code, let's call it "n". Sample code:
a = numpy.loadtxt('sampledata.txt')
plt.plot(a[:0,:,n],a[:1,:1])
plt.show()
I think this code expresses what I'm going for, though it doesn't work.
Edit:
I hope this is what you wanted.
seaborn scatterplot can categorize data to some groups which have the same codes (time code in this case) and use the same colors to them.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"E:\Programming\Python\Matplotlib\timecodes.csv",
names=["time","x","y","z","code"]) #use your file
df["time"]=pd.to_datetime(df["time"]) #recognize the data as Time
df["x"]=df["time"].dt.day # I changed the data into "Date only" and imported to x column. Easier to see on graph.
#just used random numbers in y and z in my data.
sns.scatterplot("x", "y", data = df, hue = "code") #hue does the grouping
plt.show()
I used csv file here but you can do to your text file as well by adding sep="\t" in the argument. I also added a code in the file. If you have it the code can group the data in the graph, so you don't have to separate or make a hierarchical index. If you want to change colors or grouping please see seaborn website.
Hope this helps.
Alternative, the method I used, but Tim's answer is still accurate as well. Since the time codes are not date/time information I modified my own code to add tags as a second column I call "p" (they're polymers).
import numpy as np
import pandas as pd
datain = np.loadtxt('somefile.txt')
df = pd.DataFrame(data = datain, columns = ["t","p","x","y","z"])
ax = sns.scatterplot("t","x", data = df, hue = "p")
plt.show()
And of course the other columns can be plotted similarly if desired.
I'm trying to create histogram from grouped data in pandas.
So far I was able to create standard line plot. But I can't figure out how to do the same to get histogram (bar chart). I would like to get 2 age histograms of persons who survived Titanic crush and who didn't - to see if there is a difference in age distribution.
Source data:
https://www.udacity.com/api/nodes/5454512672/supplemental_media/titanic-datacsv/download
So far my code:
import pandas as pn
titanic = pn.DataFrame.from_csv('titanic_data.csv')
SurvivedAge= titanic.groupby(['Survived','Age']).size()
SurvivedAge=SurvivedAge.reset_index()
SurvivedAge.columns=['Survived', 'Age', 'Num']
SurvivedAge.index=(SurvivedAge['Survived'])
del SurvivedAge['Survived']
SurvivedAget=SurvivedAge.reset_index().pivot('Age', 'Survived','Num')
SurvivedAget.plot()
when I'm trying to plot a histogram from this data set I'm getting strange results.
SurvivedAget.hist()
I would be grateful for help with that.
You can:
titanic = pd.read_csv('titanic_data.csv')
survival_by_age = titanic.groupby(['Age', 'Survived']).size().unstack('Survived')
survival_by_age.columns = ['No', 'Yes']
survival_by_age.plot.bar(title='Survival by Age')
to get:
which you can further tweak. You could also consolidate the fractional ages so you can use integer indices, or bin the data into say 5yr age spans to get more user-friendly output. And then there is seaborn with a various types of distribution plots.