I am trying to create a plot which shows each individual's trajectory as well as the mean. This is working OK except that there appear to be extra lines and the lines go backwards, even after sorting the values.
Example:
import pandas as pd
import plotly.graph_objects as go
df = pd.DataFrame({"id": [1,1,1,1,2,2,2,2],
"months": [0,1,2,3,0,1,2,3],
"outcome":[5,2,7,11,18,3,15,3]})
#sort by each individual and the months ie. time column
df.sort_values(by=["id", "months"], inplace=True)
#create mean to overlay on plot
grouped = df.groupby("months")["outcome"].mean().reset_index()
#create plot
fig = go.Figure()
fig.add_trace(go.Scatter(x= df['months'], y= df['outcome'], name = "Individuals"))
fig.add_trace(go.Scatter(x=grouped['months'], y=grouped['outcome'], name = "Mean"))
fig.write_image("test.jpeg", scale = 2)
fig.show()
Now that I'm looking at it it actually looks like it's just creating one giant line for all IDs together, whereas I'd like one line for ID 1, and one line for ID2.
Any help much appreciated. Thanks in advance.
I believe the issue is in your x-values. In Pycharm, I looked at the dataframe and it looks like this:
Your months go from 0-3 and then back to 0-3. I'm a little unclear on what you want to do though - do you want to display only the ones with IDs that match? Such as all the ID with 1 and ID with 2?
Let us know what you expect to see given this dataframe I'm showing, it would be helpful.
EDIT So, I couldn't read the original question. Looking at it more, I believe I can at least answer the first portion however that led me to another bug. The line in question should be changed like so:
fig.add_trace(go.Scatter(x=df['months'][df['id'] == 1], y=df['outcome'][df['id'] == 1], name="Individuals"))
This will pull from the dataframe only where the id == 1, however this then won't show on your graph since your grouped dataframe doesn't fall within the same bounds.
Related
So i have this df, the columns that im intrested in visualizing
later with matplotlib are the 'incident_date', 'fatalities'. I want to create two diagrams. The one will display the number of the incidents with injuries (the column named 'fatalities' says whether it was a fatal accident, or just one with injuries or neither), the other will display the dates with the most deaths. So, in order to do those, I need somehow to turn the data in the 'fatalities' column into numeral ones.
This is my df's head, so you get an idea
I created dummy data based on picture you provided
data = {'incident_date':['1-Mar-20','1-Mar-20','3-Mar-20','3-Mar-20','3-Mar-20','5-Mar-20','6-Mar-20','7-Mar-20','7-Mar-20'] \
,'fatalities':['Fatal','Fatal','Injuries','Injuries','Neither','Fatal','Fatal','Fatal','Fatal'] \
, 'conclusion_number':[1,1,3,23,23,34,23,24,123]}
df = pd.DataFrame(data)
All you need is to do a group by incident_data and fatalities and you will get the numerical values for that particular date and that particular incident.
df_grp = df.groupby(['incident_date','fatalities'],as_index=False)['conclusion_number'].count()
df_grp.rename({'conclusion_number':'counts'},inplace=True, axis=1)
The Output of above looks like this.
output dataframe
Once you get counts column you can perform your matplot diagrams.
Let me know if you need help with diagrams as well
I have a dataframe, where I would like to make a time series plot with three different lines that each show the daily occurrences (the number of rows per day) for each of the values in another column.
To give an example, for the following dataframe, I would like to see the development for how many a's, b's and c's there have been each day.
df = pd.DataFrame({'date':pd.to_datetime(['2019-10-10','2019-10-14','2019-10-09','2019-10-10','2019-10-08','2019-10-14','2019-10-10','2019-10-08','2019-10-08','2019-10-13','2019-10-08','2019-10-12','2019-10-11','2019-10-09','2019-10-08']),
'letter':['a','b','c','a','b','b','b','b','c','b','b','a','b','a','c']})
When I try the command below (my best guess so far), however, it does not filter for the different dates (I would like three lines representing each of the letters.
Any ideas on how to solve this?
df.groupby(['date']).count().plot()['letter']
I have also tried a solution in Matplotlib, though this one gives an error..
fig, ax = plt.subplots()
ax.plot(df['date'], df['letter'].count())
Based on your question, I believe you are looking for a line plot which has dates in X-axis and the counts of letters in the Y-axis. To achieve this, these are the steps you will need to do...
Group the dataframe by date and then letter - get the number of entries/rows for each which you can do using size()
Flatten the grouped dataframe using reset_index(), rename the new column to Counts and sort by letter column (so that the legend shows the data in the alphabetical format)... these are more to do with keeping the new dataframe and graph clean and presentable. I would suggest you do each step separately and print, so that you know what is happening in each step
Plot each line plot separately using filtering the dataframe by each specific letter
Show legend and rotate date so that it comes out with better visibility
The code is shown below....
df = pd.DataFrame({'date':pd.to_datetime(['2019-10-10','2019-10-14','2019-10-09','2019-10-10','2019-10-08','2019-10-14','2019-10-10','2019-10-08','2019-10-08','2019-10-13','2019-10-08','2019-10-12','2019-10-11','2019-10-09','2019-10-08']),
'letter':['a','b','c','a','b','b','b','b','c','b','b','a','b','a','c']})
df_grouped = df.groupby(by=['date', 'letter']).size().reset_index() ## New DF for grouped data
df_grouped.rename(columns = {0 : 'Counts'}, inplace = True)
df_grouped.sort_values(['letter'], inplace=True)
colors = ['r', 'g', 'b'] ## New list for each color, change as per your preference
for i, ltr in enumerate(df_grouped.letter.unique()):
plt.plot(df_grouped[df_grouped.letter == ltr].date, df_grouped[df_grouped.letter == ltr].Counts, '-o', label=ltr, c=colors[i])
plt.gcf().autofmt_xdate() ## Rotate X-axis so you can see dates clearly without overlap
plt.legend() ## Show legend
Output graph
Updated with more info
I've seen this answered on here for single line plots, but I need help with a plot showing two variables, if that matters at all... I am fairly new to python in general. My line graph shows two different departments' funding over the years. I just want to reformat the y axis to display as a number in the hundreds of millions.
Using a csv for the general public funding report of Minneapolis.
msp_df = pd.read_csv('Minneapolis_Data_Snapshot_v2.csv',error_bad_lines=False)
msp_df.info()
Saved just the two depts I was interested in, to a dataframe.
CPED_df = (msp_df['Unnamed: 0'] == 'CPED')
msp_df.iloc[CPED_df.values]
police_df = (msp_df['Unnamed: 0'] == 'Police')
msp_df.iloc[police_df.values]
("test" is the new name of my data frame containing all the info as seen below.)
test = pd.DataFrame({'Year': range(2014,2021),
'CPED': msp_df.iloc[CPED_df.values].T.reset_index(drop=True).drop(0,0)[5].tolist(),
'Police': msp_df.iloc[police_df.values].T.reset_index(drop=True).drop(0,0)[4].tolist()})
The numbers from the original dataset were being read as strings because of the commas so had to fix that first.)
test['Police2'] = test['Police'].str.replace(',','').astype(int)
test['CPED2'] = test['CPED'].str.replace(',','').astype(int)
And here is my code for the plot. It executes, I'm just wanting to reformat the y axis number scale. Right now it just shows up as a decimal. (I've already imported pandas and seaborn and matploblib)
plt.plot(test.Year, test.Police2, test.Year, test.CPED2)
plt.ylabel('Budget in Hundreds of Millions')
plt.xlabel('Year')
Current plot
Any help super appreciated! Thanks :)
the easiest way to reformat the y axis, to force it to take certain values is to use
plt.yticks(ticks, labels)
for example if you want to have only display values from 0 to 1 you can do :
plt.yticks([0,0.2,0.5,0.7,1], ['a', 'b', 'c', 'd', 'e'])
I'm trying to create a stacked bar-graph which shows two transaction types for a customer. The graph is sorted into columns by week.
Sample code within my code structure is below:
%matplotlib inline
import pandas as pd
values = [('1','2019-07-28','retail',11),
('1','2019-07-28','wholesale',18),
('1','2019-08-04','retail',7),
('1','2019-08-04','wholesale',12),
('1','2019-08-11','retail',6),
('1','2019-08-11','wholesale',16)]
columns = ['customer_id','week',
'transaction_type',
'sale_count']
df = pd.DataFrame(values, columns=columns)
df.groupby(['week','transaction_type']).size()\
.unstack()\
.plot(sort_columns='week',
kind='bar', stacked=True);
The result I'm getting is a row count for each transaction_type as either 1 or 2
current:
What I need is a stacked bar graph that gives the sum of sale_count for each date listed in week like the one below
expected:
Can anyone tell me what I'm doing wrong here?
Similar to commented:
(df.groupby(['week','transaction_type'])['sale_count']
.sum().unstack('transaction_type')
.plot.bar(stacked=True)
)
Output:
#Quang Hoang's answer is correct and should be accepted and upvoted. This is just a note about formatting code. I guess it will be better to get rid of extra round brackets and move legend outside as in the following code
df.groupby(['week','transaction_type'])['sale_count']\
.sum().unstack('transaction_type')\
.plot.bar(stacked=True, rot=0)\
.legend(bbox_to_anchor=(1.3, 1.0));
Good morning,
I am trying to iterate through a CSV to produce a title for each stock chart that I am making.
The CSV is formatted as: Ticker, Description spanning about 200 rows.
The code is shown below:
df_symbol_description = pd.read_csv('C:/TS/Combined/Tickers Desc.csv')
print(df_symbol_description['Description'])
for r in df_symbol_description['Description']:
plt.suptitle(df_symbol_description['Description'][r],size = '20')
It is erroneous as it comes back with this error: "KeyError: 'iShrs MSCI ACWI ETF'"
This error is just showing me the first ticker description in the CSV. If anyone knows how to fix this is is much appreciated!
Thank you
I don't know how to fix the error, since it's unclear what you are trying to achieve, but we can have a look at the problem itself.
Consider this example, which is essentially your code in small.
import pandas as pd
df=pd.DataFrame({"x" : ["A","B","C"]})
for r in df['x']:
print(r, df['x'][r])
The dataframe consists of one column, called x which contains the values "A","B","C". In the for loop you select those values, such that for the first iteration r is "A". You are then using "A" as an index to the column, which is not possible, since the column would need to be indexed by 0,1 or 2, but not the string that it contains.
So in order to print the column values, you can simply use
for r in df['x']:
print(r)