Removing zero values from pivot table somehow creates another problem

Removing zero values from pivot table somehow creates another problem - python

I have an assignment from my Python class to mine a set of data consisting CO2 emissions from (almost) all the countries in the world from 1960 to 2011. One of the task i've been working on is to produce a line graph that represents the growth of CO2 production in a specific country, and i'd like to avoid inserting zeros into the graph. Here is the code i've been using.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sn
import numpy as np
# Creating DataFrame
Data = pd.read_excel('CO2 Sorted Data.xlsx')
df = pd.DataFrame(Data, columns=['Year','CountryName','Region','IncomeType','CO2Emission','CO2EmCMT'])
df.replace(0,np.nan,inplace=True)
print(df)
# Creating the Pivot Table
pvt = df.pivot_table(index=['Year'],columns=['CountryName'],values='CO2Emission',aggfunc='sum')
# Creating the Graph
pvt2 = pvt.reindex()
CO2Country = input('Input Country Name = ')
remove_zero=pvt2[CO2Country]
rz1=[i for i in remove_zero if i !=0]
plt.plot(rz1,c='red')
plt.title('CO2 Emission of ' +CO2Country +' (1960-2011)', fontsize=10)
plt.xlabel('Year',fontsize=10)
plt.ylabel('CO2 Emission (kiloton)')
plt.grid(True)
plt.show
If i input Aruba for example, output would look like this.
Line Graph of Aruba
However, the x-axis only shows the 'number' of years on the data requested, not the year itself. I have no clue on what triggers this other than changing the zeroes to NaN, but that doesn't make any sense in my mind. How can i make the x-axis show the true year, as in 1986-2011?
Here is a glimpse of the data:

To get the output in proper year format, you must enumerate the data first.
So: data = list(enumerate(rz1, start=1960))
There are to ways to go about plotting this new data, one is by converting the data into a np Array and transposing, the other is by using the zip function. They both have the same output.
data = list(zip(*b))
or
data = np.array(data).transpose()
The final code(in the creating the graph section) is:
# Creating the Graph
pvt2 = pvt.reindex()
CO2Country = 'Aruba'
remove_zero=pvt2[CO2Country]
rz1=[i for i in remove_zero if i !=0]
data = list(enumerate(rz1, start=1960))
# data= np.array(data).transpose()
data= list(zip(*data))
plt.plot(data[0], data[1],c='red')
s/n: call plt.show(), not plt.show

I don't know about the pivoting thing, but the following works fine:
co2.Year = pd.to_datetime(co2.Year)
aruba = df[df.CountryName = "Aruba"].set_index("Year")
aruba.CO2Emission.plot()

Related

Matplotlib time-based heatmap [duplicate]

This question already has answers here:
Normalize columns of a dataframe
(23 answers)
Closed 8 months ago.
Background: I picked up Python about a month ago, so my experience level is pretty slim. I'm pretty comfortable with VBA though years of data analysis through excel and PI Processbook.
I have 27 thermocouples that I pull data for in 1s intervals. I would like to heatmap them from hottest to coldest at a given instance in time. I've leveraged seaborn heatmaps, but the problem with those is that they compare temperatures across time as well and the aggregate of these thermocouples changes dramatically over time. See chart below:
Notice how in the attached, the pink one is colder than the rest when all of them are cold, but when they all heat up, the cold spot transfers to the orange and green ones (and even the blue one for a little bit at the peak).
In excel, I would write a do loop to apply conditional formatting to each individual timestamp (row), however in Python I can't figure it out for the life of me. The following is the code that I used to develop the above chart, so I'm hoping I can modify this to make it work.
tsStartTime = pd.Timestamp(strStart_Time)
tsEndTime = pd.Timestamp(strEnd_Time)
t = np.linspace(tsStartTime.value,tsEndTime.value, 150301)
TimeAxis = pd.to_datetime(t)
fig,ax = plt.subplots(figsize=(25,5))
plt.subplots_adjust(bottom = 0.25)
x = TimeAxis
i = 1
while i < 28:
globals()['y' + str(i)] = forceconvert_v(globals()['arTTXD' + str(i)])
ax.plot(x,globals()['y' + str(i)])
i += 1
I've tried to use seaborn heatmaps, but when i slice it by timestamps, the output array is size (27,) instead of (27,1), so it gets rejected.
Ultimately, I'm looking for an output that looks like this:
Notice how the values of 15 in the middle are blue despite being higher than the red 5s in the beginning. I didnt fill out every cell, but hopefully you get the jist of what I'm trying to accomplish.
This data is being pulled from OSISoft PI via the PIConnect library. PI leverages their own classes, but they are essentially either series or dataframes, but I can manipulate them into whatever they need to be if someone has any awesome ideas to handle this.
Here's the link to the data: https://file.io/JS0RoQvDL6AB
Thanks!

You are going the wrong way with globals. In this case, I suggest to use pandas.DataFrame.
What you are looking for can be produced like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Settings
col_number = 5
start = '1/1/2022 10:00:00'
end = '1/1/2022 10:10:00'
# prepare a data frame
index = pd.date_range(start=start, end=end, freq="S")
columns = [f'y{i}' for i in range(col_number)]
df = pd.DataFrame(index=index, columns=columns)
# fill in the data
for n, col in enumerate(df.columns):
df[col] = np.array([n + np.sin(2*np.pi*i/len(df)) for i in range(len(df))])
# drawing a heatmap
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 5))
ax1.plot(df)
ax1.legend(df.columns)
ax2.imshow(df.T, aspect='auto', cmap='plasma')
ax2.set_yticks(range(len(df.columns)))
ax2.set_yticklabels(df.columns)
plt.show()
Here:
As far as you didn't supply data to reproduce your case I use sin as illustrative values.
Transposing df.T is needed to put records horizontally. Of course, we can initially write data horizontally, it's up to you.
set_yticks is here to avoid troubles when changing the y-labels on the second figure.
seaborn.heatmap(...) can be used as well:
import seaborn as sns
data = df.T
data.columns = df.index.strftime('%H:%M:%S')
plt.figure(figsize=(15,3))
sns.heatmap(data, cmap='plasma', xticklabels=60)
Update
To compare values at each point in time:
data = (data - data.min())/(data.max() - data.min())
sns.heatmap(data, cmap='plasma', xticklabels=60)

Power law test using XY scatter plot

I have Daily Crude oil prices downloaded from FRED, about 10k observations, some values are blank(code cleans them). I believe that I cannot share excel sheets here, so I will just give you a screenshot of what the data looks like:
I calculate the differences and returns and clean up the data but I am kind of stuck.
Here is what the code looks like to get you started:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("DCOILWTICO.csv")
nan_value = float("NaN")
data.replace("", nan_value, inplace=True)
data.replace(".", nan_value, inplace=True)
data['Previous'] = data['DCOILWTICO'].shift(1)
data.dropna(subset=['Previous'],inplace=True)
data.replace("", nan_value, inplace=True)
data.replace(".", nan_value, inplace=True)
data['DCOILWTICO'] = data['DCOILWTICO'].astype(float)
data['Previous'] = data['Previous'].astype(float)
data['Diff'] = data['DCOILWTICO'] - data['Previous']
data['Return'] = (data['DCOILWTICO'] - data['Previous'])/data['Previous']
Here comes the question: I am trying to duplicate the graph below.(which I believe was generated using Mathematica) The difficult part is to be able to create the bins in the right way. Looking at the graph it looks like there are around 200 bins. On the x-axis are the returns and on the y axis are the frequencies(which have been binned).

I think you are asking how to make equally spaced bins in logspace. If so then use the np.geomspace function (geometric space), rather than np.linspace (linear space).
plt.figure()
bins = np.geomspace(data['returns'].min(), data['returns'].max(), 200)
plt.hist(data['returns'], bins = bins)

plot matplotlib aggregated data python

I need plot of aggregrated data
import pandas as pd
basic_data= pd.read_csv('WHO-COVID-19-global-data _2.csv',parse_dates= ['Date_reported'] )
cum_daily_cases = basic_data.groupby('Date_reported')[['New_cases']].sum()
import pylab
x = cum_daily_cases['Date_reported']
y = cum_daily_cases['New_cases']
pylab.plot(x,y)
pylab.show()
Error: 'Date_reported'
Input: Date_reported, Country_code, Country, WHO_region, New_cases, Cumulative_cases, New_deaths, Cumulative_deaths 2020-01-03,AF,Afghanistan,EMRO,0,0,0,0
Output: the total quantity of "New cases" showed on the plot per day.
What should I do to run this plot? link to dataset

The column names contain a leading space (can be easily seen by checking basic_data.dtypes). Fix that by adding the following line immediately after basic_data was read:
basic_data.columns = [s.strip() for s in basic_data.columns]
In addition, your x variable should be the index after groupby-sum, not a column Date_reported. Correction:
x = cum_daily_cases.index
The plot should show as expected.

How to make a loop for multiple scatterplots in python?

I am trying to automate the plotting procedure of a large dataframe matrix. The goal is to plot each column with an other column. Each column represents a variable. See also the image below.
F.e: sex vs age, sex vs BMI, sex vs smoke, sex vs type and so on.
For the sake of clearity, I have simplified the problem to image below:
enter image description here
Initially, I tried to plot each combination by hand. But this is rather a time-consuming excersize and not what I want.
I tried also this (not working):
variables = ["Sex", "Age", "BMI"]
for variable in variables:
plt.scatter(df.variable, df.variable)
plt.xlabel('variable')
plt.ylabel('variable')
plt.title('variable vs. variable')
plt.show()
Any help is welcome!
PS: If it would be a simple excersize to incorporate a linear regression on the combination of variables as well, that would also be appreciated.
Greetings,
Nadia

What you coded plots each column against itself. What you described is a nested loop. A simple upgrade is
col_choice = ["Sex", "Age", "BMI"]
for pos, axis1 in enumerate(col_choice): # Pick a first col
for axis2 in enumerate(col_choice[pos+1:]): # Pick a later col
plt.scatter(df.loc[:, axis1], df.loc[:, axis2])
I think this generates a series acceptable to scatter.
Does that help? If you want to be more "Pythonic", then look into itertools.product to generate your column choices.

You could do something like this:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Create dummy dataframe, or load your own with pd.read_csv()
columns = ["sex", "age", "BMI", "smoke", "type"]
data = pd.DataFrame(np.array([[1,0,0,1,0], [23,16,94,18,24], [32, 26, 28, 23, 19], [0,1,1,1,0], [1,2,2,2,1]]).T, columns=columns)
x_col = "sex"
y_columns = ["age", "BMI", "smoke"]
for y_col in y_columns:
figure = plt.figure
ax = plt.gca()
ax.scatter(data[x_col], data[y_col])
ax.set_xlabel(x_col)
ax.set_ylabel(y_col)
ax.set_title("{} vs {}".format(x_col, y_col))
plt.legend()
plt.show()
Basically, if you have your dataset saved as a .csv file, you can load it with pandas using pd.read_csv(), and use the column names as keys to access the corresponding rows, and iterate on that (here I created a dummy dataframe just for the sake of it).
Regarding the linear regression part, you should check out the scikit-learn library. It has a lot of regression models for many different tasks like regression, classification and clustering

A line graph for non-numeric data

I have a dataset with mostly non numeric forms. I would love to create a visualization for them but I am having an error message.
My data set looks like this
|plant_name|Customer_name|Job site|Delivery.Date|DeliveryQuantity|
|SN13|John|Sweden|01.01.2019|6|
|SN14|Ruth|France|01.04.2018|4|
|SN15|Jane|Serbia|01.01.2019|2|
|SN11|Rome|Denmark|01.04.2018|10|
|SN14|John|Sweden|03.04.2018|5|
|SN15|John|Sweden|04.09.2019|7|
|
I need to create a lineplot to show how many times John made a purchase using Delivery Date as my timeline (x-axis)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.set_option("display.max_rows", 5)
hr_data = pd.read_excel("D:\data\Days_Calculation.xlsx", parse_dates = True)
x = hr_data['DeliveryDate']
y = hr_data ['Customer_name']
sns.lineplot(x,y)
Error: No numeric types to aggregate
My expected result show be a line graph like this
John's marker will present on the timeline (Delivery Date) on "01.01.2019", "03.04.2018" and "04.09.2019"
Another instance
To plot string vs float for example Total number of quantity (DeliveryQuantity) vs Customer Name .How can one approach this
how do one format the axes distance of a plot (not label)

Why not make Delivery Date a timestamp object instead of a string?
hr_data["Delivery.Date"] = pd.to_datetime(hr_data["Delivery.Date"])
Now you got plot options.
Working with John.
john_data = hr_data[hr_data["Customer_name"]=="John"]
sns.countplot(john_data["Delivery.Date"])

Generally speaking you have to aggregate something when working with categorical data. Whether you will be counting names in a column or adding number of orders, or ranking some categories this is still numeric data.
plot_data = hr_data.pivot_table(index='DeliveryDate', columns='Customer_name', values='DeliveryQuantity', aggfunc='sum')
plt.xticks(LISTOFVALUESFORXRANGE)
plot_data.plot(legend=False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing zero values from pivot table somehow creates another problem - python

I don't know about the pivoting thing, but the following works fine: co2.Year = pd.to_datetime(co2.Year) aruba = df[df.CountryName = "Aruba"].set_index("Year") aruba.CO2Emission.plot()

Related

Matplotlib time-based heatmap [duplicate]

Power law test using XY scatter plot

plot matplotlib aggregated data python

How to make a loop for multiple scatterplots in python?

A line graph for non-numeric data

Categories

Resources