I have a data visualisation-based question. I basically want to create a heatmap from a pandas DataFrame, where I have the x,y coordinates and the corresponding z value. The data can be created with the following code -
data = ([[0.2,0.2,24],[0.2,0.6,8],[0.2,2.4,26],[0.28,0.2,28],[0.28,0.6,48],[0.28,2.4,55],[0.36,0.2,34],[0.36,0.6,46],[0.36,2.4,55]])
data=np.array(data)
df=pd.DataFrame(data,columns=['X','Y','Z'])
Please note that I have converted an array into a DataFrame just so that I can give an example of an array. My actual data set is quite large and I import into python as a DataFrame. After processing the DataFrame, I have it available as the format given above.
I have seen the other questions based on the same problem, but they do not seem to be working for my particular problem. Or maybe I am not applying them correctly. I want my results to be similar to what is given here https://plot.ly/python/v3/ipython-notebooks/cufflinks/#heatmaps
Any help would be welcome.
Thank you!
Found one way of doing this -
Using Seaborn.
import seaborn as sns
data = ([[0.2,0.2,24],[0.2,0.6,8],[0.2,2.4,26],[0.28,0.2,28],[0.28,0.6,48],[0.28,2.4,55],[0.36,0.2,34],[0.36,0.6,46],[0.36,2.4,55]])
data=np.array(data)
df=pd.DataFrame(data,columns=['X','Y','Z'])
df=df.pivot('X','Y','Z')
diplay_df = sns.heatmap(df)
Returns the following image -
sorry for creating another question.
Also, thank you for the link to a related post.
How about using plotnine, A Grammar of Graphics for Python
data
data = ([[0.2,0.2,24],[0.2,0.6,8],[0.2,2.4,26],[0.28,0.2,28],[0.28,0.6,48],[0.28,2.4,55],[0.36,0.2,34],[0.36,0.6,46],[0.36,2.4,55]])
data=np.array(data)
df=pd.DataFrame(data,columns=['X','Y','Z'])
Prepare data
df['rows'] = ['row' + str(n) for n in range(0,len(df.index))]
dfMelt = pd.melt(df, id_vars = 'rows')
Make heatmap
ggplot(dfMelt, aes('variable', 'rows', fill='value')) + \
geom_tile(aes(width=1, height=1)) + \
theme(axis_ticks=element_blank(),
panel_background = element_blank()) + \
labs(x = '', y = '', fill = '')
Related
This subject refers to this one I closed earlier:
NetCDF4 file with Python - Filter before dataframing
After applying the solution of the other topic to reduce an xarray size
data_9 = ds.sel(time=datetime.time(9))
I have an xarray this way:
But I still can and need to reduce it on latitude and longitude
For example I want only longitude between -4 and 44
I tried to apply the function sel again but it doesn't seem to work this time :'(
data_9 = ds.sel(time=datetime.time(9)).sel(lon>-4).sel(lon<44)
Doing this it can't recognise lon...
NameError: name 'lon' is not defined
Can someone helps on this too?
Thanks
It seems you have to use where instead of sel here. We can create a condition array just like in numpy and give it to where. The second parameter drop=True removes the data where our condition is falsy. Without it, you would get nans there instead of getting a trimmed dataset.
I am using the same demo dataset used in the other question you linked.
import xarray as xr
import datetime
# Load a demo dataset.
ds = xr.tutorial.load_dataset('air_temperature')
data_9 = ds.sel(time=datetime.time(9))
cond = (-4 < data_9.lon) & (data_9.lon < 44)
data_9 = data_9.where(cond, drop=True)
Xarray's sel methods can take multiple selectors and windows in the form of slices:
ds_subset = ds.sel(time=datetime.time(9), lon=slice(-4, 44))
I need some guidance in working out how to plot a block of histograms from grouped data in a pandas dataframe. Here's an example to illustrate my question:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
In my ignorance I tried this code command:
df.groupby('Letter').hist()
which failed with the error message "TypeError: cannot concatenate 'str' and 'float' objects"
Any help most appreciated.
I'm on a roll, just found an even simpler way to do it using the by keyword in the hist method:
df['N'].hist(by=df['Letter'])
That's a very handy little shortcut for quickly scanning your grouped data!
For future visitors, the product of this call is the following chart:
One solution is to use matplotlib histogram directly on each grouped data frame. You can loop through the groups obtained in a loop. Each group is a dataframe. And you can create a histogram for each one.
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
for group in grouped:
figure()
matplotlib.pyplot.hist(group[1].N)
show()
Your function is failing because the groupby dataframe you end up with has a hierarchical index and two columns (Letter and N) so when you do .hist() it's trying to make a histogram of both columns hence the str error.
This is the default behavior of pandas plotting functions (one plot per column) so if you reshape your data frame so that each letter is a column you will get exactly what you want.
df.reset_index().pivot('index','Letter','N').hist()
The reset_index() is just to shove the current index into a column called index. Then pivot will take your data frame, collect all of the values N for each Letter and make them a column. The resulting data frame as 400 rows (fills missing values with NaN) and three columns (A, B, C). hist() will then produce one histogram per column and you get format the plots as needed.
With recent version of Pandas, you can do
df.N.hist(by=df.Letter)
Just like with the solutions above, the axes will be different for each subplot. I have not solved that one yet.
I write this answer because I was looking for a way to plot together the histograms of different groups. What follows is not very smart, but it works fine for me. I use Numpy to compute the histogram and Bokeh for plotting. I think it is self-explanatory, but feel free to ask for clarifications and I'll be happy to add details (and write it better).
figures = {
'Transit': figure(title='Transit', x_axis_label='speed [km/h]', y_axis_label='frequency'),
'Driving': figure(title='Driving', x_axis_label='speed [km/h]', y_axis_label='frequency')
}
cols = {'Vienna': 'red', 'Turin': 'blue', 'Rome': 'Orange'}
for gr in df_trips.groupby(['locality', 'means']):
locality = gr[0][0]
means = gr[0][1]
fig = figures[means]
h, b = np.histogram(pd.DataFrame(gr[1]).speed.values)
fig.vbar(x=b[1:], top=h, width=(b[1]-b[0]), legend_label=locality, fill_color=cols[locality], alpha=0.5)
show(gridplot([
[figures['Transit']],
[figures['Driving']],
]))
I find this even easier and faster.
data_df.groupby('Letter').count()['N'].hist(bins=100)
I have a particular problem, I would like to clean and prepare my data and I have a lot of unknown values for the "highpoint_metres" column of my dataframe (members). As there is no missing information for the "peak_id", I calculated the median value of the height according to the peak_id to be more accurate.
I would like to do two steps: 1) add a new column to my "members" dataframe where there would be the value of the median but different depending on the "peak_id" (value calculated thanks to the code in the question). 2) That the code checks that the value in highpoint_metres is null, if it is, that the value of the new column is put instead. I don't know if this is clearer
code :
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
print(members)
mediane_peak_id = members[["peak_id","highpoint_metres"]].groupby("peak_id",as_index=False).median()
And I don't know how to continue from there (my level of python is very bad ;-))
I believe that's what you're looking for:
import numpy as np
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
median_highpoint_by_peak = members.groupby("peak_id")["highpoint_metres"].transform("median")
is_highpoint_missing = np.isnan(members.highpoint_metres)
members["highpoint_meters_imputed"] = np.where(is_highpoint_missing, median_highpoint_by_peak, members.highpoint_metres)
so one way to go about replacing 0 with median could be:
import numpy as np
df[col_name] = df[col_name].replace({0: np.median(df[col_name])})
You can also use apply function:
df[col_name] = df[col_name].apply(lambda x: np.median(df[col_name]) if x==0 else x)
Let me know if this helps.
So adding a little bit more info based on Marie's question.
One way to get median is through groupby and then left join it with the original dataframe.
df_gp = df.groupby(['peak_id']).agg(Median = (highpoint_metres, 'median')).reset_index()
df = pd.merge(df, df_gp, on='peak_id')
df = df.apply(lambda x['highpoint_metres']: x['Median'] if x['highpoint_metres']==np.nan else x['highpoint_metres'])
Let me know if this solves your issue
Sorry if I haven't explained things very well. I'm a complete novice please feel free to critic
I've searched every where but I havent found anything close to subtracting a percent. when its done on its own(x-.10=y) it works wonderfully. the only problem is Im trying to make 'x' stand for sample_.csv[0] or the numerical value from first column from my understanding.
import csv
import numpy as np
import pandas as pd
readdata = csv.reader(open("sample_.csv"))
x = input(sample_.csv[0])
y = input(x * .10)
print(x + y)
the column looks something like this
"20,a,"
"25,b,"
"35,c,"
"45,d,"
I think you should only need pandas for this task. I'm guessing you want to apply this operation on one column:
import pandas as pd
df = pd.read_csv('sample_.csv') # assuming columns within csv header.
df['new_col'] = df['20,a'] * 1.1 # Faster than adding to a percentage x + 0.1x = 1.1*x
df.to_csv('new_sample.csv', index=False) # Default behavior is to write index, which I personally don't like.
BTW: input is a reserved command in python and asks for input from the user. I'm guessing you don't want this behavior but I could be wrong.
import pandas as pd
df = pd.read_csv("sample_.csv")
df['newcolumn'] = df['column'].apply(lambda x : x * .10)
Please try this.
I need some guidance in working out how to plot a block of histograms from grouped data in a pandas dataframe. Here's an example to illustrate my question:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
In my ignorance I tried this code command:
df.groupby('Letter').hist()
which failed with the error message "TypeError: cannot concatenate 'str' and 'float' objects"
Any help most appreciated.
I'm on a roll, just found an even simpler way to do it using the by keyword in the hist method:
df['N'].hist(by=df['Letter'])
That's a very handy little shortcut for quickly scanning your grouped data!
For future visitors, the product of this call is the following chart:
One solution is to use matplotlib histogram directly on each grouped data frame. You can loop through the groups obtained in a loop. Each group is a dataframe. And you can create a histogram for each one.
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
for group in grouped:
figure()
matplotlib.pyplot.hist(group[1].N)
show()
Your function is failing because the groupby dataframe you end up with has a hierarchical index and two columns (Letter and N) so when you do .hist() it's trying to make a histogram of both columns hence the str error.
This is the default behavior of pandas plotting functions (one plot per column) so if you reshape your data frame so that each letter is a column you will get exactly what you want.
df.reset_index().pivot('index','Letter','N').hist()
The reset_index() is just to shove the current index into a column called index. Then pivot will take your data frame, collect all of the values N for each Letter and make them a column. The resulting data frame as 400 rows (fills missing values with NaN) and three columns (A, B, C). hist() will then produce one histogram per column and you get format the plots as needed.
With recent version of Pandas, you can do
df.N.hist(by=df.Letter)
Just like with the solutions above, the axes will be different for each subplot. I have not solved that one yet.
I write this answer because I was looking for a way to plot together the histograms of different groups. What follows is not very smart, but it works fine for me. I use Numpy to compute the histogram and Bokeh for plotting. I think it is self-explanatory, but feel free to ask for clarifications and I'll be happy to add details (and write it better).
figures = {
'Transit': figure(title='Transit', x_axis_label='speed [km/h]', y_axis_label='frequency'),
'Driving': figure(title='Driving', x_axis_label='speed [km/h]', y_axis_label='frequency')
}
cols = {'Vienna': 'red', 'Turin': 'blue', 'Rome': 'Orange'}
for gr in df_trips.groupby(['locality', 'means']):
locality = gr[0][0]
means = gr[0][1]
fig = figures[means]
h, b = np.histogram(pd.DataFrame(gr[1]).speed.values)
fig.vbar(x=b[1:], top=h, width=(b[1]-b[0]), legend_label=locality, fill_color=cols[locality], alpha=0.5)
show(gridplot([
[figures['Transit']],
[figures['Driving']],
]))
I find this even easier and faster.
data_df.groupby('Letter').count()['N'].hist(bins=100)