Annotate scatter plot with multiindex

Annotate scatter plot with multiindex - python

I have constructed a scatter plot using data from a DataFrame with a multiindex. The indexes are country and year
fig,ax=plt.subplots(1,1)
rel_pib=welfare["rel_pib_pc"].loc[:,1960:2010].groupby("country").mean()
rel_lambda=welfare["Lambda"].loc[:,1960:2010].groupby("country").mean()
ax.scatter(rel_pib,rel_lambda)
ax.set_ylim(0,2)
ax.set_ylabel('Bienestar(Lambda)')
ax.set_xlabel('PIBPc')
ax.plot([0,1],'red', linewidth=1)
I would like to annotate each point with the country name (and if possible, the Lambda value). I have the following code
for i, txt in enumerate(welfare.index):
plt.annotate(txt, (welfare["rel_pib_pc"].loc[:,1960:2010].groupby("country").mean()[i], welfare["Lambda"].loc[:,1960:2010].groupby("country").mean()[i]))
I am not sure how to indicate that i want the country names since all the lambda and pib_pc values for a given country are given as a single value, since I´m using the .mean() function.
I have tried using .xs() but all the combinations I tried won´t work.

I used the following test data:
rel_pib_pc Lambda
country year
Country1 2007 260 1.12
2008 265 1.13
2009 268 1.10
Country2 2007 230 1.05
2008 235 1.07
2009 236 1.04
Country3 2007 200 1.02
2008 203 1.07
2009 208 1.05
Then, to generate a scatter plot, I used the following code:
fig, ax = plt.subplots(1, 1)
ax.scatter(rel_pib,rel_lambda)
ax.set_ylabel('Bienestar(Lambda)')
ax.set_xlabel('PIBPc')
ax.set_xlim(190,280)
annot_dy = 0.005
for i, txt in enumerate(rel_lambda.index):
ax.annotate(txt, (rel_pib.loc[txt], rel_lambda.loc[txt] + annot_dy), ha='center')
plt.show()
and got the following result:
The trick to correctly generate annotations is:
Enumerate the index of one of already generated Series objects,
so that txt contains the country name.
Take values from already generated Series objects (don't compute
these values again).
Locate both coordinates by the current index value.
To put these annotations just above respective points, use:
ha (horizontal alignment) as 'center',
shift y coordinate a little up (if needed, experiment with othere
values of annot_dy.
I added also ax.set_xlim(190,280) in order to keep annotations within the
picture rectangle. Maybe you will not need it.

Related

Grouping values in a clustered pie chart

I'm working with a dataset about when certain houses were constructed and my data stretches from the year 1873-2018(143 slices). I'm trying to visualise this data in the form of a piechart but because of the large number of indivdual slices the entire pie chart appears clustered and messy.
What I'm trying to implement to get aroud this is by grouping the values in 15-year time periods and displaying the periods on the pie chart instead. I seen a similiar post on StackOverflow where the suggested solution was using a dictionary and defining a threshold to group the values but implementing a version of that on my own piechart didn't work and I was wondering how I could tackle this problem
CODE
testing = df1.groupby("Year Built").size()
testing.plot.pie(autopct="%.2f",figsize=(10,10))
plt.ylabel(None)
plt.show()
Dataframe(testing)
Current Piechart

For the future, always provide a reproducible example of the data you are working on (maybe use df.head().to_dict()). One solution to your problem could be achieved by using pd.resample.
# Data Used
df = pd.DataFrame( {'year':np.arange(1890, 2018), 'built':np.random.randint(1,150, size=(2018-1890))} )
>>> df.head()
year built
0 1890 34
1 1891 70
2 1892 92
3 1893 135
4 1894 16
# First, convert your 'year' values into DateTime values and set it as the index
df['year'] = pd.to_datetime(df['year'], format=('%Y'))
df_to_plot = df.set_index('year', drop=True).resample('15Y').sum()
>>> df_to_plot
built
year
1890-12-31 34
1905-12-31 983
1920-12-31 875
1935-12-31 1336
1950-12-31 1221
1965-12-31 1135
1980-12-31 1207
1995-12-31 1168
2010-12-31 1189
2025-12-31 757
Also you could use pd.cut()
df['group'] = pd.cut(df['year'], 15, precision=0)
df.groupby('group')[['year']].sum().plot(kind='pie', subplots=True, figsize=(10,10), legend=False)

Python change the starting values on the plot

I have data set which looks like this:
Hour_day Profits
7 645
3 354
5 346
11 153
23 478
7 464
12 356
0 346
I crated a line plot to visualize the hour on the x-axis and the profit values on y-axis. My code worked good with me but the problem is that on the x-axis it started at 0. but I want to start from 5 pm for example.
hours = df.Hour_day.value_counts().keys()
hours = hours.sort_values()
# Get plot information from actual data
y_values = list()
for hr in hours:
temp = df[df.Hour_day == hr]
y_values.append(temp.Profits.mean())
# Plot comparison
plt.plot(hours, y_values, color='y')

From what I know you have two options:
Create a sub DF that excludes the rows that have an Hour_day value under 5 and proceed with the rest of your code as normal:
df_new = df.where(df['Hour_day'] >= 5)
or, you might be able to set the x_ticks:
default_x_ticks = range(5:23)
plt.plot(hours, y_values, color='y')
plt.xticks(default_x_ticks, hours)
plt.show()
I haven't tested the x_ticks code so you might have to play around with it just a touch, but there are lots of easy to find resources on x_ticks.

PANDAS-GEOPANDAS: Localization of points in a shapefile

Using pandas and geopandas, I would like to define a function to be applied to each row of a dataframe which operates as follows:
INPUT: column with coordinates
OUTPUT: zone in which the point falls.
I tried with this, but it takes very long.
def zone_assign(point,zones,codes):
try:
zone_label=zones[zones['geometry'].contains(point)][codes].values[0]
except:
zone_label=np.NaN
return(zone_label)
where:
point is the cell of the row which contains geographical coordinates;
zones is the shapefile imported with geopandas;
codes is the column of the shapefile which contains label to be assigned to the point.

Part of the answer, is taken from another answer I made earlier that needed within rather than contains
Your situation looks like a typical case where spatial joins are useful. The idea of spatial joins is to merge data using geographic coordinates instead of using attributes.
Three possibilities in geopandas:
intersects
within
contains
It seems like you want contains, which is possible using the following syntax:
geopandas.sjoin(polygons, points, how="inner", op='contains')
Note: You need to have installed rtree to be able to perform such operations. If you need to install this dependency, use pip or conda to install it
Example
As an example, let's take a random sample of cities and plot countries associated. The two example datasets are
import geopandas
import matplotlib.pyplot as plt
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
cities = geopandas.read_file(geopandas.datasets.get_path('naturalearth_cities'))
cities = cities.sample(n=50, random_state=1)
world.head(2)
pop_est continent name iso_a3 gdp_md_est geometry
0 920938 Oceania Fiji FJI 8374.0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1 53950935 Africa Tanzania TZA 150600.0 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
cities.head(3)
name geometry
196 Bogota POINT (-74.08529 4.59837)
95 Tbilisi POINT (44.78885 41.72696)
173 Seoul POINT (126.99779 37.56829)
world is a worldwide dataset and cities is a subset.
Both dataset need to be in the same projection system. If not, use .to_crs before merging.
data_merged = geopandas.sjoin(countries, cities, how="inner", op='contains')
Finally, to see the result let's do a map
f, ax = plt.subplots(1, figsize=(20,10))
data_merged.plot(axes=ax)
countries.plot(axes=ax, alpha=0.25, linewidth=0.1)
plt.show()
and the underlying dataset merges together the information we need
data_merged.head(2)
pop_est continent name_left iso_a3 gdp_md_est geometry index_right name_right
7 6909701 Oceania Papua New Guinea PNG 28020.0 MULTIPOLYGON (((141.00021 -2.60015, 142.73525 ... 59 Port Moresby
9 44293293 South America Argentina ARG 879400.0 MULTIPOLYGON (((-68.63401 -52.63637, -68.25000... 182 Buenos Aires
Here, I used inner join method but that's a parameter you can change if, for instance, you want to keep all points, including those not within a polygon.

How to implement weights in folium Heatmap?

I'm trying to add weights to my folium heatmap layer, but I can't figure out how to correctly implement this.
I have a dataframe with 3 columns: LAT, LON and VALUE. Value being the total sales of that location.
self.map = folium.Map([mlat, mlon], tiles=tiles, zoom_start=8)
locs = zip(self.data.LAT, self.data.LON, self.data.VALUE)
HeatMap(locs, radius=30, blur=10).add_to(self.map)
I tried to use the absolute sales values and I also tried to normalize sales/sales.sum(). Both give me similar results.
The problem is:
Heatmap shows stronger red levels for regions with more stores. Even if the total sales of those stores together is a lot smaller than sales of a distant and isolate large store.
Expected behaviour:
I would expect that the intensity of the heatmap should use the value of sales of each store, as sales was passed in the zip object to the HeatMap plugin.
Let's say I have 2 regions: A and B.
In region A I have 3 stores: 10 + 15 + 10 = 35 total sales.
In region B I have 1 big store: 100 total sales
I'd expect a greater intensity for region B than for region A. I noticed that a similar behaviour only occurs when the difference is very large (if I try 35 vs 5000000 then region B becomes more relevant).
My CSV file is just a random sample, like this:
LAT,LON,VALUE,DATE,DIFFLAT1,DIFFLON1
-22.4056,-53.6193,14,2010,0.0242,0.4505
-22.0516,-53.7025,12,2010,0.3137,0.6636
-22.3239,-52.9108,100,2010,0.0514,0.0002
-22.6891,-53.7424,6,2010,0.0002,0.7887
-21.8762,-53.6866,16,2010,0.7283,0.6180
-22.1861,-53.5353,11,2010,0.1420,0.2924

from folium import plugins
from folium.plugins import HeatMap
heat_df = df.loc[:,["lat","lon","weight"]]
map_hooray = folium.Map(location=[45.517999 ,-73.568184 ], zoom_start=12 )
Format: list of lists as well as lat, lon and weight
heat_data = heat_df.values.tolist()
Plot it on the map
HeatMap(heat_data,radius=13).add_to(map_hooray)
Save the map
map_hooray.save('heat_map.html')

Plots do not appear when calling seaborn's pairplot on a pandas Dataframe

I have a Dataframe that looks like so
Price Mileage Age
4250 71000 8
6500 43100 6
26950 10000 3
1295 78000 17
5999 61600 8
This is assigned to dataset. I simply call sns.pairplot(dataset) and I'm left with just a single graph - the distribution of prices across my dataset. I expected a 3x3 grid of plots.
When I import a pre-configured dataset from seaborn I get the expected multiplot pair plot.
I'm new to seaborn so apologies if this is a silly question, but what am I doing wrong? It seems like a simple task.

From your comment, it seems like you're trying to plot on non-numeric columns. Try coercing them first:
dataset = dataset.apply(lambda x: pd.to_numeric(x, errors='coerce'))
sns.pairplot(dataset)
The errors='coerce' argument will replace non-coercible values (the reason your columns are objects in the first place) to NaN.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.