show text description in x axis rather than numbers using pandas matplotlib - python

I have written code to show my data set as bar chart. this is my code:
I have read my data from .csv file in this way:
names = ["Clinic Number","Question Text","Answer Text","Answer Date","Class"]
data = pd.read_csv('ADLCI.csv', names = names)
And then
grouped = data.groupby(['Question Text','Answer Text']).size().reset_index(name='counts')
import matplotlib.pyplot as plt
plt.figure()
grouped.plot(kind='bar', title ="Functional Status Count", figsize=(15, 10), legend=True, fontsize=12)
plt.show()
This is also the result of data frame I have which I want to show as bar chart.
Question Text Answer Text counts
0 CI function No 513
1 CI function Yes 373
2 bathing? No 2827
3 bathing? Yes 408
4 dressing? No 2824
5 dressing? Yes 423
6 feeding No 2851
7 feeding Yes 160
8 housekeeping No 2803
9 housekeeping Yes 717
10 preparing food No 2604
11 preparing food Yes 593
12 responsibility for own medications No 2793
13 responsibility for own medications Yes 625
14 shopping No 35
15 shopping Yes 49
16 toileting No 2843
17 toileting Yes 239
18 transferring No 2834
19 transferring Yes 904
20 using transportation No 2816
21 using transportation Yes 483
the first column that is number has been added automatically, Actually I do not have that in my data set.
Here is the bar chart created by this code.
As you see in the bar chart, all bars has the same color. also the x axis is the number I was saying. but I dont want in this shape.
the thing I want is look like this link:
Im going to explain what changes I want to the picture I have uploaded here.
Instead of 0 and 1 ... in the x axis, it should depict the Question Text column. In detail, the bar chart in x axis will be: as we see in the dataframe there is two CI function one for yes and one for No. I want CI function instead of 0 and 1 with two different colors one pointing to the count of No 1596 and one different color pointing to Yes 1376.
The next item will be bathing?, again one bar pointing to 17965 and another one to 702.
With this I should have nearly ten bars, each contains two bars stick with each other like the link I put above.
I tried various ways like the above link but mine not showing like that or getting error.
Thanks :)
Update 1
when I applied your code:
import matplotlib.pyplot as plt
data.groupby(['Question Text','Answer Text']).sum().unstack().plot(kind='bar')
plt.show()
I got this error:
Traceback (most recent call last):
File "C:/Users/M193053/PycharmProjects/ADL-distribution/test.py", line 52, in <module>
data.groupby(['Question Text','Answer Text']).sum().unstack().plot(kind='bar')
File "C:\Users\M193053\Documents\Anaconda3\envs\conda3\lib\site-packages\pandas\plotting\_core.py", line 2941, in __call__
sort_columns=sort_columns, **kwds)
File "C:\Users\M193053\Documents\Anaconda3\envs\conda3\lib\site-packages\pandas\plotting\_core.py", line 1977, in plot_frame
**kwds)
File "C:\Users\M193053\Documents\Anaconda3\envs\conda3\lib\site-packages\pandas\plotting\_core.py", line 1804, in _plot
plot_obj.generate()
File "C:\Users\M193053\Documents\Anaconda3\envs\conda3\lib\site-packages\pandas\plotting\_core.py", line 258, in generate
self._compute_plot_data()
File "C:\Users\M193053\Documents\Anaconda3\envs\conda3\lib\site-packages\pandas\plotting\_core.py", line 373, in _compute_plot_data
'plot'.format(numeric_data.__class__.__name__))
TypeError: Empty 'DataFrame': no numeric data to plot
but when I use this code:
grouped = data.groupby(['Question Text','Answer Text']).size().reset_index(name='counts')
import matplotlib.pyplot as plt
grouped.groupby(['Question Text','Answer Text']).sum().unstack().plot(kind='bar')
plt.show()
It seems ok to me like this:
but it does not seem logical to apply two groupby. because of that Im not sure still what should I do.
Thaks for taking time :)
Update two
this is my data frame, has been got with this code:
grouped = data.groupby(['Question Text','Answer Text']).size().reset_index(name='counts')
0 CI function No 513
1 CI function Yes 373
2 bathing? No 2827
3 bathing? Yes 408
4 dressing? No 2824
5 dressing? Yes 423
6 feeding No 2851
7 feeding Yes 160
8 housekeeping No 2803
9 housekeeping Yes 717
10 preparing food No 2604
11 preparing food Yes 593
12 responsibility for own medications No 2793
13 responsibility for own medications Yes 625
14 shopping No 35
15 shopping Yes 49
16 toileting No 2843
17 toileting Yes 239
18 transferring No 2834
19 transferring Yes 904
20 using transportation No 2816
21 using transportation Yes 483
and this the data frame, has got from combination of your code and mine:
grouped = data.groupby(['Question Text','Answer Text']).size().reset_index(name='counts')
print(grouped)
import matplotlib.pyplot as plt
final = grouped.groupby(['Question Text','Answer Text']).sum()
print(final)
Question Text Answer Text
CI function No 513
Yes 373
bathing? No 2827
Yes 408
dressing? No 2824
Yes 423
feeding No 2851
Yes 160
housekeeping No 2803
Yes 717
preparing food No 2604
Yes 593
responsibility for own medications No 2793
Yes 625
shopping No 35
Yes 49
toileting No 2843
Yes 239
transferring No 2834
Yes 904
using transportation No 2816
Yes 483
Update 3
Original data frame there is 200000 rows like this :
1 bathing? No 3529933
2 dressing? No 3529933
3 feeding No 3529933
4 housekeeping No 3529933
5 responsibility for own medications No 3529933
6 using transportation No 3529933
7 toileting No 3529933
8 transferring No 3529933
10 preparing food No 3529933
11 bathing? NaN 2864155
12 dressing? NaN 2864155
13 feeding NaN 2864155
14 housekeeping NaN 2864155
15 responsibility for own medications NaN 2864155
16 toileting NaN 2864155
17 transferring NaN 2864155
19 preparing food NaN 2864155
20 using transportation Yes 2864155
21 bathing? NaN 2921299
22 dressing? NaN 2921299

You can do so(df is the dataframe you wrote):
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
df.groupby(['Question Text','Answer Text']).sum().unstack().plot(kind='bar')
plt.show()
Output:
You can also rotate the xlabel in this way:
plt.xticks(rotation=45)
but I suggest you to make the labels shorter to make it more clear

Related

Display all values on a maplotlib barplot

I have a data frame with 20 values, and I am trying to bar.plot it using matplotlib. when I do it, I am not seeing the 20 bars but 10. I have 5 nana values in it and 4 of them.
Here is a sample of dataframe:
Name Bonus
Jack Carpenter 890
John Clegg 653
Mike Holiday 367
Rene Moukad 900
........... ...
my code is standard:
fig,ax = plt.subplots(figsize=(16,6))
plt.bar(df.Name, df.Bonus)
fig.autofmt_xdate(rotation=45)

annotate a single line from a multi-line plot with labels from another pandas column matplotlib

i have been looking around and i can find examples for annotating a single line chart by using iterrows for the dataframe. what i am struggling with is
a) selecting the single line in the plot instead of ax.lines (using ax.lines[#]) is clearly not proper and
b) annotating the values for the line with values from a different column
the dataframe dfg is in a format such that (edited to provide a minimal, reproducible example):
week 2016 2017 2018 2019 2020 2021 min max avg WoW Change
1 8188.0 9052.0 7658.0 7846.0 6730.0 6239.0 6730 9052 7893.7
2 7779.0 8378.0 7950.0 7527.0 6552.0 6045.0 6552 8378 7588.0 -194.0
3 7609.0 7810.0 8041.0 8191.0 6432.0 5064.0 6432 8191 7529.4 -981.0
4 8256.0 8290.0 8430.0 7083.0 6660.0 6507.0 6660 8430 7687.0 1443.0
5 7124.0 9372.0 7892.0 7146.0 6615.0 5857.0 6615 9372 7733.7 -650.0
6 7919.0 8491.0 7888.0 6210.0 6978.0 5898.0 6210 8491 7455.3 41.0
7 7802.0 7286.0 7021.0 7522.0 6547.0 4599.0 6547 7802 7218.1 -1299.0
8 8292.0 7589.0 7282.0 5917.0 6217.0 6292.0 5917 8292 7072.3 1693.0
9 8048.0 8150.0 8003.0 7001.0 6238.0 5655.0 6238 8150 7404.0 -637.0
10 7693.0 7405.0 7585.0 6746.0 6412.0 5323.0 6412 7693 7135.1 -332.0
11 8384.0 8307.0 7077.0 6932.0 6539.0 6539 8384 7451.7
12 7748.0 8224.0 8148.0 6540.0 6117.0 6117 8224 7302.6
13 7254.0 7850.0 7898.0 6763.0 6047.0 6047 7898 7108.1
14 7940.0 7878.0 8650.0 6599.0 5874.0 5874 8650 7352.1
15 8187.0 7810.0 7930.0 5992.0 5680.0 5680 8187 7066.6
16 7550.0 8912.0 8469.0 7149.0 4937.0 4937 8912 7266.6
17 7660.0 8264.0 8549.0 7414.0 5302.0 5302 8549 7291.4
18 7655.0 7620.0 7323.0 6693.0 5712.0 5712 7655 6910.0
19 7677.0 8590.0 7601.0 7612.0 5391.0 5391 8590 7264.6
20 7315.0 8294.0 8159.0 6943.0 5197.0 5197 8294 7057.0
21 7839.0 7985.0 7631.0 6862.0 7200.0 6862 7985 7480.6
22 7705.0 8341.0 8346.0 7927.0 6179.0 6179 8346 7574.7
... ... ... ... ... ... ... ... ...
51 8167.0 7993.0 7656.0 6809.0 5564.0 5564 8167 7131.4
52 7183.0 7966.0 7392.0 6352.0 5326.0 5326 7966 6787.3
53 5369.0 5369 5369 5369.0
with the graph plotted by:
fig, ax = plt.subplots(1, figsize=[14,4])
ax.fill_between(dfg.index, dfg["min"], dfg["max"], label="5 Yr. Range", facecolor="oldlace")
ax.plot(dfg.index, dfg[2020], label="2020", c="grey")
ax.plot(dfg.index, dfg[2021], label="2021", c="coral")
ax.plot(dfg.index, dfg.avg, label="5 Yr. Avg.", c="goldenrod", ls=(0,(1,2)), lw=3).
I would like to label the dfg[2021] line with the values from dfg['WoW Change']. Additionally, if anyone knows how to get the calculate the first value in the WoW column based on the last value from 2020 and the first value from 2021, that would be wonderful! It's currently just dfg['WoW Change'] = dfg[2021].diff()
Thanks!
Figured it out. Zipped the index and two columns up as a tuple. I ended up deciding I only wanted the last value to be shown but using below code:
a = dfg.index.values
b = dfg[2021]
c = dfg['WoW Change']
#zip 3 columns separately
labels = list(zip(dfg.index.values,dfg[2021],dfg['WoW Change']))
#remove tuples with index + 2 nan values
labels_light = [i for i in labels if not any(isinstance(n,float) and math.isnan(n) for n in i)]
#label last point using list accessors
ax.annotate(str("w/w change: " + str("{:,}".format(int(labels_light[-1][2])))+link[1]),xy=(labels_light[-1][0],labels_light[-1][1]))
I'm sure this could have been done much better by someone who knows what they're doing, any feedback is appreciated.

How can I draw circle on a map?

Here is my dataframe:
Boston
Zipcode Employees Latitude Longitude
0 02021 174 -71.131057 42.228065
1 02026 193 -71.143038 42.237719
3 02109 45 -71.054027 42.363498
4 02110 14 -71.053642 42.357649
5 02111 30 -71.060280 42.350586
6 02113 77 -71.054618 42.365215
8 02115 116 -71.095106 42.343330
10 02118 318 -71.072103 42.339342
11 02119 804 -71.085268 42.323002
12 02120 168 -71.097569 42.332539
13 02121 781 -71.086649 42.305792
15 02124 1938 -71.066702 42.281721
16 02125 859 -71.053049 42.310813
17 02126 882 -71.090424 42.272444
19 02128 786 -71.016037 42.375254
21 02130 886 -71.114080 42.309087
22 02131 1222 -71.121464 42.285216
23 02132 1348 -71.168150 42.280316
24 02134 230 -71.123323 42.355355
25 02135 584 -71.147046 42.357537
26 02136 1712 -71.125550 42.255064
28 02152 119 -70.960324 42.351129
29 02163 1 -71.120420 42.367263
30 02186 361 -71.113223 42.258883
31 02199 4 -71.082279 42.346991
32 02210 35 -71.044281 42.347148
33 02215 83 -71.103877 42.348709
34 02459 27 -71.187563 42.286356
35 02467 66 -71.157691 42.314277
And I want to draw circles on my map, each circle corresponds to one point, the size of the circle depends on the number of Employees
Here are my map code (I try to use marker, but I think circle is better:
boston_map=folium.Map([Boston['Longitude'].mean(), Boston['Latitude'].mean()],zoom_start=12)
incidents2=plugins.MarkerCluster().add_to(boston_map)
for Latitude,Longitude,Employees in zip(Boston.Latitude,Boston.Longitude,Boston.Employees):
folium.Marker(location=[Latitude,Longitude],icon=None,popup=Employees).add_to(incidents2)
boston_map.add_child(incidents2)
boston_map
Here is my map:
If the number of employees can show in the circle, it will be better! Thank you very much!
To draw circles you can use CircleMarker instead of Marker
BTW: you have wrong column's names. Boston has lat: 42.361145, long: -71.057083 but you have values 42 in column Longitude and values -71 in column Latitude
Because I don't use Juputer so I save map in HTML file and use webbrowser to automatically open it in web browser.
Because it created big circles so I divide Employees to create smaller circles. But now some circles are very small and it shows number of circles instead circles. Maybe it should be used math.log() or other method to make it smaller (normalized).
I use tooltip=str(employees) to display number when you hover circle.
text = '''
Zipcode Employees Longitude Latitude
0 02021 174 -71.131057 42.228065
1 02026 193 -71.143038 42.237719
3 02109 45 -71.054027 42.363498
4 02110 14 -71.053642 42.357649
5 02111 30 -71.060280 42.350586
6 02113 77 -71.054618 42.365215
8 02115 116 -71.095106 42.343330
10 02118 318 -71.072103 42.339342
11 02119 804 -71.085268 42.323002
12 02120 168 -71.097569 42.332539
13 02121 781 -71.086649 42.305792
15 02124 1938 -71.066702 42.281721
16 02125 859 -71.053049 42.310813
17 02126 882 -71.090424 42.272444
19 02128 786 -71.016037 42.375254
21 02130 886 -71.114080 42.309087
22 02131 1222 -71.121464 42.285216
23 02132 1348 -71.168150 42.280316
24 02134 230 -71.123323 42.355355
25 02135 584 -71.147046 42.357537
26 02136 1712 -71.125550 42.255064
28 02152 119 -70.960324 42.351129
29 02163 1 -71.120420 42.367263
30 02186 361 -71.113223 42.258883
31 02199 4 -71.082279 42.346991
32 02210 35 -71.044281 42.347148
33 02215 83 -71.103877 42.348709
34 02459 27 -71.187563 42.286356
35 02467 66 -71.157691 42.314277
'''
import pandas as pd
import io
import folium
import folium.plugins
boston = pd.read_csv(io.StringIO(text), sep='\s+')
boston_map = folium.Map([boston.Latitude.mean(), boston.Longitude.mean(), ], zoom_start=12)
incidents2 = folium.plugins.MarkerCluster().add_to(boston_map)
for latitude, longitude, employees in zip(boston.Latitude, boston.Longitude, boston.Employees):
print(latitude, longitude, employees)
folium.vector_layers.CircleMarker(
location=[latitude, longitude],
tooltip=str(employees),
radius=employees/10,
color='#3186cc',
fill=True,
fill_color='#3186cc'
).add_to(incidents2)
boston_map.add_child(incidents2)
# display in web browser
import webbrowser
boston_map.save('map.html')
webbrowser.open('map.html')
EDIT: answer for question how to add a label on each circle in a folium.circile map python shows how to use Marker with icon=DivIcon(text) to add text but it doesn't work as I expect.

Trying to lookup a value from a pandas dataframe within a range of two rows in the index dataframe

I have two dataframes - "grower_moo" and "pricing" in a Python notebook to analyze harvested crops and price payments to the growers.
pricing is the index dataframe, and grower_moo has various unique load tickets with information about each load.
I need to pull the price per ton from the pricing index to a new column in the load data if the Fat of that load is not greater than the next Wet Fat.
Below is a .head() sample of each dataframe and the code I tried. I received a ValueError: Can only compare identically-labeled Series objects error.
pricing
Price_Per_Ton Wet_Fat
0 306 10
1 339 11
2 382 12
3 430 13
4 481 14
5 532 15
6 580 16
7 625 17
8 665 18
9 700 19
10 728 20
11 750 21
12 766 22
13 778 23
14 788 24
15 797 25
grower_moo
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat
0 L2019000011817 56660 833 1.448872 21.92
1 L2019000011816 53680 1409 2.557679 21.12
2 L2019000011815 53560 1001 1.834644 21.36
3 L2019000011161 62320 2737 4.207080 21.41
4 L2019000011160 57940 1129 1.911324 20.06
grower_moo['price_per_ton'] = max(pricing[pricing['Wet_Fat'] < grower_moo['Fat']]['Price_Per_Ton'])
Example output - grower_moo['Fat'] of 13.60 is less than 14 Fat, therefore gets a price per ton of $430
grower_moo_with_price
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat price_per_ton
0 L2019000011817 56660 833 1.448872 21.92 750
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011160 57940 1129 1.911324 20.06 728
This looks like a job for an "as of" merge, pd.merge_asof (documentation):
This is similar to a left-join except that we match on nearest key
rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A "backward" search [the default]
selects the last row in the right DataFrame whose ‘on’ key is less
than or equal to the left’s key.
In the following code, I use your example inputs, but with column names using underscores _ instead of spaces .
# Required by merge_asof: sort keys in left DataFrame
grower_moo = grower_moo.sort_values('Fat')
# Required by merge_asof: key column data types must match
pricing['Wet_Fat'] = pricing['Wet_Fat'].astype('float')
# Perform the asof merge
res = pd.merge_asof(grower_moo, pricing, left_on='Fat', right_on='Wet_Fat')
# Print result
res
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton Wet_Fat
0 L2019000011160 57940 1129 1.911324 20.06 728 20.0
1 L2019000011816 53680 1409 2.557679 21.12 750 21.0
2 L2019000011815 53560 1001 1.834644 21.36 750 21.0
3 L2019000011161 62320 2737 4.207080 21.41 750 21.0
4 L2019000011817 56660 833 1.448872 21.92 750 21.0
# Optional: drop the key column from the right DataFrame
res.drop(columns='Wet_Fat')
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton
0 L2019000011160 57940 1129 1.911324 20.06 728
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011817 56660 833 1.448872 21.92 750
concat_df = pd.concat([grower_moo, pricing], axis)
cocnat_df = concat_df[concat_df['Wet_Fat'] < concat_df['Fat']]
del cocnat_df['Wet_Fat']

can not remove a trend components and a seasonal components

I am trying to make a model for predicting energy production, by using ARMA model.
 
The data I can use for training is as following;
(https://github.com/soma11soma11/EnergyDataSimulationChallenge/blob/master/challenge1/data/training_dataset_500.csv)
ID Label House Year Month Temperature Daylight EnergyProduction
0 0 1 2011 7 26.2 178.9 740
1 1 1 2011 8 25.8 169.7 731
2 2 1 2011 9 22.8 170.2 694
3 3 1 2011 10 16.4 169.1 688
4 4 1 2011 11 11.4 169.1 650
5 5 1 2011 12 4.2 199.5 763
...............
11995 19 500 2013 2 4.2 201.8 638
11996 20 500 2013 3 11.2 234 778
11997 21 500 2013 4 13.6 237.1 758
11998 22 500 2013 5 19.2 258.4 838
11999 23 500 2013 6 22.7 122.9 586
As shown above, I can use data from July 2011 to May 2013 for training.
Using the training, I want to predict energy production on June 2013 for each 500 house.
The problem is that the time series data is not stationary and has trend components and seasonal components (I checked it as following.).
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data_train = pd.read_csv('../../data/training_dataset_500.csv')
rng=pd.date_range('7/1/2011', '6/1/2013', freq='M')
house1 = data_train[data_train.House==1][['EnergyProduction','Daylight','Temperature']].set_index(rng)
fig, axes = plt.subplots(nrows=1, ncols=3)
for i, column in enumerate(house1.columns):
house1[column].plot(ax=axes[i], figsize=(14,3), title=column)
plt.show()
With this data, I cannot implement ARMA model to get good prediction. So I want to get rid of the trend components and a seasonal components and make the time series data stationary. I tried this problem, but I could not remove these components and make it stationary..
I would recommend the Hodrick-Prescott (HP) filter, which is widely used in macroeconometrics to separate long-term trending component from short-term fluctuations. It is implemented statsmodels.api.tsa.filters.hpfilter.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
df = pd.read_csv('/home/Jian/Downloads/data.csv', index_col=[0])
# get part of the data
x = df.loc[df.House==1, 'Daylight']
# hp-filter, set parameter lamb=129600 following the suggestions for monthly data
x_smoothed, x_trend = sm.tsa.filters.hpfilter(x, lamb=129600)
fig, axes = plt.subplots(figsize=(12,4), ncols=3)
axes[0].plot(x)
axes[0].set_title('raw x')
axes[1].plot(x_trend)
axes[1].set_title('trend')
axes[2].plot(x_smoothed)
axes[2].set_title('smoothed x')

Categories