I need plot of aggregrated data
import pandas as pd
basic_data= pd.read_csv('WHO-COVID-19-global-data _2.csv',parse_dates= ['Date_reported'] )
cum_daily_cases = basic_data.groupby('Date_reported')[['New_cases']].sum()
import pylab
x = cum_daily_cases['Date_reported']
y = cum_daily_cases['New_cases']
pylab.plot(x,y)
pylab.show()
Error: 'Date_reported'
Input: Date_reported, Country_code, Country, WHO_region, New_cases, Cumulative_cases, New_deaths, Cumulative_deaths 2020-01-03,AF,Afghanistan,EMRO,0,0,0,0
Output: the total quantity of "New cases" showed on the plot per day.
What should I do to run this plot? link to dataset
The column names contain a leading space (can be easily seen by checking basic_data.dtypes). Fix that by adding the following line immediately after basic_data was read:
basic_data.columns = [s.strip() for s in basic_data.columns]
In addition, your x variable should be the index after groupby-sum, not a column Date_reported. Correction:
x = cum_daily_cases.index
The plot should show as expected.
Related
I have a dataframe that hast 3 columns. I made it up from a bigger dataframe like this :
new_df = df[['client_name', 'time_window_end', 'tag_count']]
then I used groupby to find out the number of tags for each client in each day using this code :
new_df.groupby(['client_name' ,'time_window_end']) ['tag_count'].count()
I totally have 70 client names in a list an I want to loop through my list to plot a line
plot for each costumer name. in the x axis I want to have 'time_window_end' and in the y axis I want to have 'tag_count'.
I want 70 plot but the for loop that I have written does not do that. I would be happy if you could help me to fix it.
clients = new_df['client_name'].unique()
client_list = clients.tolist()
for client in client_list[:60]:
temp = new_df.loc[new_df['client_name'] == client]
x = temp.groupby(temp['time_window_end'].dt.floor('d'))['tag_count'].sum()
df2 = x.to_frame()
df2.reset_index(inplace=True)
df2["time_window_end"]= pd.to_datetime(df2["time_window_end"])
line_chart = df2.copy()
plt.plot(line_chart.reset_index()["time_window_end"], x)
If I'm understanding this right, it sounds like the seaborn package might have what you need. The plotting functions take the argument 'hue' which splits plots up into multiple lines, based on the data in a column
import seaborn as sn
new_df = new_df.groupby(['client_name' ,'time_window_end']) ['tag_count'].count().reset_index()
sn.relplot(
data = new_df,
x = pd.to_datetime(new_df["time_window_end"]),
y = 'tag_count',
hue = 'client_name',
kind = 'line')
EDIT: to get multiple plots
import seaborn as sn
new_df["time_window_end"] = pd.to_datetime(new_df["time_window_end"])
g = sn.FacetGrid(
data = new_df,
row = 'client_name')
g.map(sn.lineplot, 'time_window_end', 'tag_count')
EDIT again: to get separate plot images
import matplotlib.pyplot as plt
for name in pd.unique(new_df.client_names):
sn.lineplot(
data = new_df.loc[new_df.client_names == name],
x = 'time_window_end',
y = 'tag_count',
label = name)
plt.show()
I have an assignment from my Python class to mine a set of data consisting CO2 emissions from (almost) all the countries in the world from 1960 to 2011. One of the task i've been working on is to produce a line graph that represents the growth of CO2 production in a specific country, and i'd like to avoid inserting zeros into the graph. Here is the code i've been using.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sn
import numpy as np
# Creating DataFrame
Data = pd.read_excel('CO2 Sorted Data.xlsx')
df = pd.DataFrame(Data, columns=['Year','CountryName','Region','IncomeType','CO2Emission','CO2EmCMT'])
df.replace(0,np.nan,inplace=True)
print(df)
# Creating the Pivot Table
pvt = df.pivot_table(index=['Year'],columns=['CountryName'],values='CO2Emission',aggfunc='sum')
# Creating the Graph
pvt2 = pvt.reindex()
CO2Country = input('Input Country Name = ')
remove_zero=pvt2[CO2Country]
rz1=[i for i in remove_zero if i !=0]
plt.plot(rz1,c='red')
plt.title('CO2 Emission of ' +CO2Country +' (1960-2011)', fontsize=10)
plt.xlabel('Year',fontsize=10)
plt.ylabel('CO2 Emission (kiloton)')
plt.grid(True)
plt.show
If i input Aruba for example, output would look like this.
Line Graph of Aruba
However, the x-axis only shows the 'number' of years on the data requested, not the year itself. I have no clue on what triggers this other than changing the zeroes to NaN, but that doesn't make any sense in my mind. How can i make the x-axis show the true year, as in 1986-2011?
Here is a glimpse of the data:
To get the output in proper year format, you must enumerate the data first.
So: data = list(enumerate(rz1, start=1960))
There are to ways to go about plotting this new data, one is by converting the data into a np Array and transposing, the other is by using the zip function. They both have the same output.
data = list(zip(*b))
or
data = np.array(data).transpose()
The final code(in the creating the graph section) is:
# Creating the Graph
pvt2 = pvt.reindex()
CO2Country = 'Aruba'
remove_zero=pvt2[CO2Country]
rz1=[i for i in remove_zero if i !=0]
data = list(enumerate(rz1, start=1960))
# data= np.array(data).transpose()
data= list(zip(*data))
plt.plot(data[0], data[1],c='red')
s/n: call plt.show(), not plt.show
I don't know about the pivoting thing, but the following works fine:
co2.Year = pd.to_datetime(co2.Year)
aruba = df[df.CountryName = "Aruba"].set_index("Year")
aruba.CO2Emission.plot()
I am trying to find linear regression plot for the data provided
import pandas
from pandas import DataFrame
import matplotlib.pyplot
data = pandas.read_csv('cost_revenue_clean.csv')
data.describe()
X = DataFrame(data,columns=['production_budget_usd'])
y = DataFrame(data,columns=['worldwide_gross_usd'])
when I try to plot it
matplotlib.pyplot.scatter(X,y)
matplotlib.pyplot.show()
the plot was completely empty
and when I printed the type of X
for element in X:
print(type(element))
it shows the type is string.. Where am I standing wrong???
No need to make new DataFrames for X and y. Try astype(float) if you want them as numeric:
X = data['production_budget_usd'].astype(float)
y = data['worldwide_gross_usd'].astype(float)
I have a dataset with mostly non numeric forms. I would love to create a visualization for them but I am having an error message.
My data set looks like this
|plant_name|Customer_name|Job site|Delivery.Date|DeliveryQuantity|
|SN13|John|Sweden|01.01.2019|6|
|SN14|Ruth|France|01.04.2018|4|
|SN15|Jane|Serbia|01.01.2019|2|
|SN11|Rome|Denmark|01.04.2018|10|
|SN14|John|Sweden|03.04.2018|5|
|SN15|John|Sweden|04.09.2019|7|
|
I need to create a lineplot to show how many times John made a purchase using Delivery Date as my timeline (x-axis)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.set_option("display.max_rows", 5)
hr_data = pd.read_excel("D:\data\Days_Calculation.xlsx", parse_dates = True)
x = hr_data['DeliveryDate']
y = hr_data ['Customer_name']
sns.lineplot(x,y)
Error: No numeric types to aggregate
My expected result show be a line graph like this
John's marker will present on the timeline (Delivery Date) on "01.01.2019", "03.04.2018" and "04.09.2019"
Another instance
To plot string vs float for example Total number of quantity (DeliveryQuantity) vs Customer Name .How can one approach this
how do one format the axes distance of a plot (not label)
Why not make Delivery Date a timestamp object instead of a string?
hr_data["Delivery.Date"] = pd.to_datetime(hr_data["Delivery.Date"])
Now you got plot options.
Working with John.
john_data = hr_data[hr_data["Customer_name"]=="John"]
sns.countplot(john_data["Delivery.Date"])
Generally speaking you have to aggregate something when working with categorical data. Whether you will be counting names in a column or adding number of orders, or ranking some categories this is still numeric data.
plot_data = hr_data.pivot_table(index='DeliveryDate', columns='Customer_name', values='DeliveryQuantity', aggfunc='sum')
plt.xticks(LISTOFVALUESFORXRANGE)
plot_data.plot(legend=False)
I have a situation with a bunch of datafiles, these datafiles have a number of samples in a given time frame that depends on the system. i.e. At time t=1 for instance I might have a file with 10 items, or 20 items, at later times in that file I will always have the same number of items. The format is time, x, y, z in columns, and loaded into a numpy array. The time values show which frame, but as mentioned there's always the same, let's go with 10 as a sample. So I'll have a (10,4) numpy array where the time values are identical, but there are many frames in the file, so lets say 100 frames, so really I have (1000,4). I want to plot the data with time on the x-axis and manipulations of the other data on the y, but I am unsure how to do this with line plot methods in matplotlib. Normally to provide both x,y values I believe I need to do a scatter plot, so I'm hoping there's a better way to do this. What I ideally want is to treat each line that has the same time code as a different series (so it will colour differently), and the next bit of data for that same line number in the next frame (time value) will be labelled the same colour, giving those good contiguous lines. We can look at the time column and figure out how many items share a time code, let's call it "n". Sample code:
a = numpy.loadtxt('sampledata.txt')
plt.plot(a[:0,:,n],a[:1,:1])
plt.show()
I think this code expresses what I'm going for, though it doesn't work.
Edit:
I hope this is what you wanted.
seaborn scatterplot can categorize data to some groups which have the same codes (time code in this case) and use the same colors to them.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"E:\Programming\Python\Matplotlib\timecodes.csv",
names=["time","x","y","z","code"]) #use your file
df["time"]=pd.to_datetime(df["time"]) #recognize the data as Time
df["x"]=df["time"].dt.day # I changed the data into "Date only" and imported to x column. Easier to see on graph.
#just used random numbers in y and z in my data.
sns.scatterplot("x", "y", data = df, hue = "code") #hue does the grouping
plt.show()
I used csv file here but you can do to your text file as well by adding sep="\t" in the argument. I also added a code in the file. If you have it the code can group the data in the graph, so you don't have to separate or make a hierarchical index. If you want to change colors or grouping please see seaborn website.
Hope this helps.
Alternative, the method I used, but Tim's answer is still accurate as well. Since the time codes are not date/time information I modified my own code to add tags as a second column I call "p" (they're polymers).
import numpy as np
import pandas as pd
datain = np.loadtxt('somefile.txt')
df = pd.DataFrame(data = datain, columns = ["t","p","x","y","z"])
ax = sns.scatterplot("t","x", data = df, hue = "p")
plt.show()
And of course the other columns can be plotted similarly if desired.