Stacked Histogram not working in Pandas - python

I'm playing with Pandas and have the following code:
tips.hist(stacked=True, column="total_bill", by="time")
The resulting graph looks nice:
However, it is not stacked! I wanted them both on one plot, stacked on top of each other. I wanted it to look like the one in the docs: http://pandas.pydata.org/pandas-docs/stable/visualization.html#histograms
Any help would be greatly appreciated.

You need the values in separate columns.
tips = pd.read_csv('https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv')
>>> tips[['time', 'tip']].pivot(columns='time').plot(kind='hist', stacked=True)
>>> tips[['time', 'tip']].pivot(columns='time').head()
tip
time Dinner Lunch
0 1.01 NaN
1 1.66 NaN
2 3.50 NaN
3 3.31 NaN
4 3.61 NaN

Related

Rolling Correlation of Multi-Column Panda

I am trying to calcualte and then visualize the rolling correlation between multiple columns in a 180 (3 in this example) days window.
My data is formatted like that (in the orginal file there are 12 columns plus the timestamp and thousands of rows):
import numpy as np
import pandas as pd
df = pd.DataFrame({"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'], "Austria" : [6.18 ,6.18, 6.17, 6.17, 6.40],"Belgium" : [7.05, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]},index = [1, 2, 3,4,5])
Timestamp Austria Belgium France
1 1993-11-01 6.18 7.05 7.69
2 1993-11-02 6.18 7.05 7.61
3 1993-11-03 6.17 7.20 7.67
4 1993-11-04 6.17 7.50 7.91
5 1993-11-15 6.40 7.60 8.61
I cant just use this formula, because I get a formatting error if I do because of the Timestamp column:
df.rolling(2).corr(df)
ValueError: could not convert string to float: '1993-11-01'
When I drop the Timestamp column I get a result of 1.0 for every cell, thats also not right and additionally I lose the Timestamp which I will need for the visualization graph in the end.
df_drop = df.drop(columns=['Timestamp'])
df_drop.rolling(2).corr(df_drop)
Austria Belgium France
1 NaN NaN NaN
2 NaN NaN 1.0
3 1.0 1.0 1.0
4 -inf1.0 1.0
5 1.0 1.0 1.0
Any experiences how to do the rolling correlation with multiple columns and a data index?
Building on the answer of Shreyans Jain I propose the following. It should work with an arbitrary number of columns:
import itertools as it
# omit timestamp-col
cols = list(df.columns)[1:]
# -> ['Austria', 'Belgium', 'France']
col_pairs = list(it.combinations(cols, 2))
# -> [('Austria', 'Belgium'), ('Austria', 'France'), ('Belgium', 'France')]
res = pd.DataFrame()
for pair in col_pairs:
# select the first three letters of each name of the pair
corr_name = f"{pair[0][:3]}_{pair[1][:3]}_corr"
res[corr_name] = df[list(pair)].\
rolling(min_periods=1, window=3).\
corr().iloc[0::2, -1].reset_index(drop=True)
print(str(res))
Aus_Bel_corr Aus_Fra_corr Bel_Fra_corr
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.000000 -0.277350 0.277350
3 -0.755929 -0.654654 0.989743
4 0.693375 0.969346 0.849167
The NaN-Values at the beginning result from the windowing.
Update: I uploaded a notebook with detailed explanations for what happens inside the loop.
https://github.com/cknoll/demo-material/blob/main/pandas/pandas_rolling_correlation_iloc.ipynb
You can probably calculate pair-wise correlation like this, instead of going for all 3 at once.
Once you have the correlation, you can directly add them as your columns as well, preserving the timestamp.
df['Aus_Bel_corr'] = df[['Austria','Belgium']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Bel_Fin_corr'] = df[['Belgium','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Aus_Fin_corr'] = df[['Austria','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)```
I guess that there is an another way.
df['Aus_Bel_corr'] = df['Austria']\
.rolling(min_periods = 1, window = 3)\
.corr(df['Belgium'])
For me, I think it is a little simple than the previous answer.

How to update da Pandas Panel without duplicates

Currently i'm working on a Livetiming-Software for a motorsport-application. Therefore i have to crawl a Livetiming-Webpage and copy the Data to a big Dataframe. This Dataframe is the source of several diagramms i want to make. To keep my Dataframe up to date, i have to crawl the webpage very often.
I can download the Data and save them as a Panda.Dataframe. But my Problem is step from the downloaded DataFrame to the Big Dataframe, that includes all the Data.
import pandas as pd
import numpy as np
df1= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':['1:30,000','1:45,000','1:50,000','1:25,333','1:13,366','1:17,000'],
'Laps':['1','1','1','1','1','1']})
df2= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,],
'Laps':['2','2','2','2','2','2']})
df3= pd.DataFrame({'Pos':[1,2,3,4,5,6],'CLS':['V5','V5','V5','V4','V4','V4'],
'Nr.':['13','700','30','55','24','985'],
'Zeit':['1:31,000','1:41,000','1:51,000','1:21,333','1:11,366','1:11,000'],
'Laps':['2','2','2','2','2','2']})
df1.set_index(['CLS','Nr.','Laps'],inplace=True)
df2.set_index(['CLS','Nr.','Laps'],inplace=True)
df3.set_index(['CLS','Nr.','Laps'],inplace=True)
df1 shows a Dataframe from previous laps.
df2 shows a Dataframe in the second lap. The Lap is not completed, so i have a nan.
df3 shows a Dataframe after the second lap is completed.
My target is to have just one row for each Lap per Car per Class.
Either i have the problem, that i have duplicates with incomplete Laps or all date get overwritten.
I hope that someone can help me with this problem.
Thank you so far.
MrCrunsh
If I understand your problem correctly, your issue is that you have overlapping data for the second lap: information while the lap is still in progress and information after it's over. If you want to put all the information for a given lap in one row, I'd suggest use multi-index columns or changing the column names to reflect the difference between measurements during and after laps.
df = pd.concat([df1, df3])
df = pd.concat([df, df2], axis=1, keys=['after', 'during'])
The result will look like this:
after during
Pos Zeit Pos Zeit
CLS Nr. Laps
V4 24 1 5 1:13,366 NaN NaN
2 5 1:11,366 5.0 NaN
55 1 4 1:25,333 NaN NaN
2 4 1:21,333 4.0 NaN
985 1 6 1:17,000 NaN NaN
2 6 1:11,000 6.0 NaN
V5 13 1 1 1:30,000 NaN NaN
2 1 1:31,000 1.0 NaN
30 1 3 1:50,000 NaN NaN
2 3 1:51,000 3.0 NaN
700 1 2 1:45,000 NaN NaN
2 2 1:41,000 2.0 NaN

Pandas (and seaborn) violinplot of state vs. year

I'm learning Pandas, (watching these helpful videos) and currently playing around with a UFO sighting table
import pandas as pd
ufo = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo.head()
ufo.Time = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year
ufo.head()
Now, I'd like to use Seaborn to make a violinplot of each state (on the x-axis) and the year (on the y-axis). Hence the plot shows the frequency density of sightings at any given year, in any given state.
If I use
ufo.State.value_counts()
I can get a Pandas Series of all the counts for each state. But how do I separate this data by year? I somehow need to get data with the ufo sightings per year per state?
Am I on the right track to create a Seaborn violinplot? Or going in completely the wrong direction?
According to the example shown in violinplot documentation of the following example:
ax = sns.violinplot(x="day", y="total_bill", data=tips)
You can directly assign your desired columns into x-axis by supplying the column name into x= and y-axis to the y= parameter. The following code shows the data structure of tips variable.
In [ ]: tips.head()
Out[ ]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Your question is to plot using violinplot, having x-axis to show ufo.State and y-axis to show ufo.Year. Therefore, I believe ufo.State.value_counts() is unnecessary, or even groupby since the ufo data is already well described and satisfy violinplot's parameter format.
You can achieve it by directly supplying both ufo.columnName into x= and y=. See the code below:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
ufo = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo.head()
City Colors Reported Shape Reported State \
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ
2 Holyoke NaN OVAL CO
3 Abilene NaN DISK KS
4 New York Worlds Fair NaN LIGHT NY
Time Year
0 1930-06-01 22:00:00 1930
1 1930-06-30 20:00:00 1930
2 1931-02-15 14:00:00 1931
3 1931-06-01 13:00:00 1931
4 1933-04-18 19:00:00 1933
ufo.Time = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.violinplot(x=ufo.State, y=ufo.Year)
# ax = sns.violinplot(x='State', y='Year', data=ufo) # Works the same with the code one line above
plt.show()

iterating through a dataframe and doing an operation on every element

I have a pandas dataframe df
Red Green Yellow Purple
Basket1 1 2 0 10
Basket2 4 5 0 0
Basket3 9 10 11 12
I want to iterate through this dataframe and divide each element by the total in each column. Example the first element would be 1/14. I know many pieces of code but unable to put it together. For ietrating I use
for idx, row in df.iterrows:
and for the column mean I use df.sum(axis=0)
Please help me out with the intermediate code.
This ought to do the trick:
>>> df/df.sum()
Red Green Yellow Purple
Basket1 0.071429 0.117647 0.0 0.454545
Basket2 0.285714 0.294118 0.0 0.000000
Basket3 0.642857 0.588235 1.0 0.545455
As for your serial "iterating through a dataframe and doing an operation on every element" approach, just know that, while a for loop is sometimes the easiest and most-intuitive way to get the job done, pandas is built for vectorization (i.e. doing things really quickly). When you have lots of data, finding a way to use built-in pandas is often the best tool for the job.

Ordinary Least Squares Regression for multiple columns in Pandas Dataframe

I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1
Time A1 A2 A3 B1 B2 B3
1 1.00 6.64 6.82 6.79 6.70 6.95 7.02
2 2.00 6.70 6.86 6.92 NaN NaN NaN
3 3.00 NaN NaN NaN 7.07 7.27 7.40
4 4.00 7.15 7.26 7.26 7.19 NaN NaN
5 5.00 NaN NaN NaN NaN 7.40 7.51
6 5.50 7.44 7.63 7.58 7.54 NaN NaN
7 6.00 7.62 7.86 7.71 NaN NaN NaN
This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:
from sklearn.linear_model import LinearRegression
series = np.array([]) #blank list to append result
df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]
series= np.concatenate((SGR_trips, m), axis = 0)
As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.
I tried using a for loop such as:
for col in df1.columns:
and replacing 'A1', for example with col in the code, but this does not seem to be working.
Is there any way I can do this more efficiently?
Thank you!
One liner (or three)
time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
['Slope'], df.columns)
Broken down with a bit of explanation
Using the closed form of OLS
In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.
is np.linalg.pinv(time.T.dot(time)).dot(time.T)
Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.
Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.
Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!
Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:
slopes = []
for c in cols:
if c=="Time": break
mask = ~np.isnan(df1[c])
x = np.atleast_2d(df1.Time[mask].values).T
y = np.atleast_2d(df1[c][mask].values).T
reg = LinearRegression().fit(x, y)
slopes.append(reg.coef_[0])
I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.

Categories