I need to create a chart like this:
X Axis: Dates. I can present it either as a categorical (and order it somehow, even by another field) axis or as a continues axis.
Y Axis: Date Time (from 00:00 to 23:59)
The bars represent occurrences of events.
Emphasis points:
I might have some parallel occurrence (as you can see in the last date which is visualized in the graph). I can separate them with a field.
Event might start at any time and any day and continue to the next day (as you can see in the second and the third days above). I want them to be visualized as in the above chart. (solutions that based on separating them to 2 different events are not relevant here because of the final target - clustering).
Data can be arranged in any order either in Python or Tableau.
Related
I'm trying to plot a bar chart of some de-identified transactional banking data using the pandas and matplotlib libraries.
The data looks like this:
The column named "day" stores the numbers of the days on which the transaction was made, the column named "tr_type" stores the numbers of transactions made on each day, and the column named "average_income" stores the average amount of incomes for each of the different types of transactions.
The task is to display the data of all three columns, which have the largest average amount of incomes, on one graph.
For definiteness, I took the top 5 rows of sorted data.
`
slised_two = sliced_df_new.sort_values('average_income', ascending=False).head(5)
slised_two = slised_two.set_index('day')
`
For convenience in further plotting, I set a column called "day" as an index. I get this:
Based on this data, I tried to build one graph, but, unfortunately, I did not achieve the result I wanted, because I had to build 2 graphs for normal data display.
`
axes = slised_two.plot.bar(rot=0, subplots=True)
axes[1].legend(loc=2)
`
The question arises, is it possible to build a histogram in such a way that days are displayed on the x-axis, the average amount of incomes is displayed on the y-axis, and at the same time, the transaction number is signed on top of each column?
I have a time series with daily data that I want to plot to see how it evolves over a year. I want to compare how it evolves over the year compared to previous years. I have written the following code in Python:
xindex = data['biljett'].index.month*30 + data['biljett'].index.day
plt.plot(xindex, data['biljett'])
plt.show()
The graph looks as follows:
A graph how the data evolves over a year compared to previous years. The line is continuous and and does not end with the end of the year which makes it fuzzy. What am I doing wrong ?
From technical perspectives, it happens because your data points are not sorted w.r.t. date, thus it goes back and forth to connect data points in the data frame order. you sort the data based on xindex and you're good to go. to do that: (first you need to put xindex in data dataframe as a new column)
data.sort_values(by='xindex').reset_index(drop=True)
From the visualization perspective, I think you might have several values per each day count, thus plot is not a good option to begin with. So IMHO you'd want to try plt.scatter() to visualize your data in a better way.
I have rewritten as follows:
xindex = data['biljett'].index.month*30 + data['biljett'].index.day
data['biljett'].sort_values('xindex').reset_index(drop=True)
plt.plot(xindex, data['biljett'])
plt.show()
but gets the following error message:
ValueError: No axis named xindex for object type
I'm writing my own code to analyse/visualise COVID-19 data from the European CDC.
https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
I've got a simple code to extract the data and make plots with cumulative deaths against time, and am trying to add functionality.
My aim is something like the attached graph, with all countries time shifted to match at (in this case the 5th death) I want to make a general bit of code to shift countries to match at the 'n'th death.
https://ourworldindata.org/grapher/covid-confirmed-deaths-since-5th-death
The current way I'm trying to do this is to have a maze of "if group is 'country' shift by ..." terms.
Where ... is a lookup to find the date for the particular 'country' when there were 'n' deaths, and to interpolate fractional dates where appropriate.
i.e. currently deaths are assigned as 00:00 day/month, but the data can be shifted by 2/3 a day as below.
datetime cumulative deaths
00:00 15/02 80
00:00 16/02 110
my '...' should give 16:00 15/02
I'm working on this right now but it doesn't feel very efficient and I'm sure there must be a much simpler way that I'm not seeing.
Essentially despite copious googling I can't seem to find a simple way of automatically shifting a bunch of timeseries to match at a particular y value, which feels like it should have some built-in functionality, i.e. a Lookup with interpolation.
####Live url (I've downloaded my own csv and been calling that for code development)
url = 'https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
dataraw = pd.read_csv(url)
#extract relevanty colums
data = dataraw.loc[:,["dateRep","countriesAndTerritories","deaths"]]
####convert date format
data['dateRep'] = pd.to_datetime(data['dateRep'],dayfirst=True)
####sort by date
data = data.sort_values(["dateRep"],ascending=True)
data['cumdeaths'] = data.groupby(['countriesAndTerritories']).cumsum()
##### limit to countries with cumulative deaths > 500
data = data.groupby('countriesAndTerritories').filter(lambda x:x['cumdeaths'].max() >500)
###### remove China from data for now as it doesn't match so well with dates
data = data.groupby('countriesAndTerritories').filter(lambda x:(x['countriesAndTerritories'] != "China").any())
##### only recent dates
data = data[data['dateRep'] > '2020-03-01']
print(data)
You can use groupby('country') and the pd.transform function to add a column which will set every row with the date in which its country hit the nth death.
Then you can do a vectorized subtraction of the date column and the new column to get the number of days.
I have a text data file with a time series whose entries are in the form (This is the first column):
20000101
20000102
20000103
...
20001231
20010101
...
20151231
Using this int values results in the points being accumulated around the year with unequal spaces (This is logical since it will simply leave the corresponding gap between 20001231 to 20010101)
Now one solution to this is to use an array like this one (let's suppose I have the dates stored in an array called date):
xaxis= np.arange(0, len(date))
The problem is that, although the plot is correct, the x axis ticks are then labeled as 0,1,2,3...
I have been trying to modify the xticks, but whatever I do changes the whole figure resulting in a weird plot.
What is the best solution to this?
You have to pass the x data and the x ticks to the plt.xticks(xdata, xticks) function. As seen here.
This post is quiet long and I will be very grateful to everybody who reads it until the end. :)
I am experimenting execution python code issues and would like to know if you have a better way of doing what I want to.
I explain my problem brifely. I have plenty solar panels measurements. Each one of them is done each 3 minutes. Unfortunately, some measurements can fail. The goal is to compare the time in order to keep only the values that have been measured in the same minutes and then retrieve them. A GUI is also included in my software, so each time the user changes the panels to compare, the calculation has to be done again. To do so, I have implemented 2 parts, the first one creates a vector of true or false for each panel for each minute, and the second compare the previous vector and keep only the common measures.
All the datas are contained in the pandas df energiesDatas. The relevant columns are:
name: contains the name of the panel (length 1)
date: contains the day of the measurement (length 1)
list_time: contains a list of all time of measurement of a day (length N)
list_energy_prod : contains the corresponding measures (length N)
The first part loop over all possible minutes from beginning to end of measurements. If a measure has been done, add True, otherwise add False.
self.ListCompare2=pd.DataFrame()
for n in self.NameList:#loop over all my solar panels
m=self.energiesDatas[self.energiesDatas['Name']==n]#all datas
#table_date contains all the possible date from the 1st measure, with interval of 1 min.
table_list=[1 for i in range(len(table_date))]
pointerDate=0 #pointer to the current value of time
#all the measures of a given day are transform into a str of hour-minutes
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
#some test
changeDate=0
count=0
#store the current pointed date
m_date=m['Date'].iloc[pointerDate]
#for all possible time
for curr_date in table_date:
#if considered date is bigger, move pointer to next day
while curr_date.date()>m_date:
pointerDate+=1
changeDate=1
m_date=m['Date'].iloc[pointerDate]
#if the day is changed, recalculate the measures of this new day
if changeDate:
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
changeDate=0
#check if a measure has been done at the considered time
table_list[count]=curr_date.strftime('%H-%M') in DateString
count+=1
#add to a dataframe
self.ListCompare2[n]=table_list
l2=self.ListCompare2
The second part is the following: given a "ListOfName" of modules to compare, check if they have been measured in the same time and only keep the values measure in the same minute.
ListToKeep=self.ListCompare2[ListOfName[0]]#take list of True or False done before
for i in ListOfName[1:]#for each other panels, check if True too
ListToKeep=ListToKeep&self.ListCompare2[i]
for i in ListOfName:#for each module, recover values
tmp=self.energiesDatas[self.energiesDatas['Name']==i]
count=0
#loop over value we want to keep (also energy produced and the interval of time)
for j,k,l,m,n in zip(tmp['list_time'],tmp['Date'],tmp['list_energy_prod'],tmp['list_energy_rec'],tmp['list_interval']):
#calculation of the index
delta_day=(k-self.dt.date()).days*(18*60)
#if the value of ListToKeep corresponding to the index is True, we keep the value
tmp['list_energy_prod'].iloc[count]=[ l[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_energy_rec'].iloc[count]=[ m[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_interval'].iloc[count]=[ n[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
count+=1
self.store_compare=self.store_compare.append(tmp)
Actually, this part is the one that takes a very long time.
My question is: Is there a way to save time, using build-in function or anything.
Thank you very much
Kilian
The answer of chris-sc sloved my problem:
I believe your data structure isn't appropriate for your problem. Especially the list in fields of a DataFrame, they make loops or apply almost unavoidable. Could you in principle re-structure the data? (For example one df per solar panel with columns date, time, energy)