Make all columns (dates) index of data frame - python

My data is organized like this:
Where country code is the index of the data frame and the columns are the years for the data. First, is it possible to plot line graphs (using matplotlib.pylot) over time for each country without transforming the data any further?
Second, if the above is not possible, how can I make the columns the index of the table so I can plot time series line graphs?
Trying df.t gives me this:
How can I make the dates the index now?

Transpose using df.T.
Plot as usual.
Sample:
import pandas as pd
df = pd.DataFrame({1990:[344,23,43], 1991:[234,64,23], 1992:[43,2,43]}, index = ['AFG', 'ALB', 'DZA'])
df = df.T
df
AFG ALB DZA
1990 344 23 43
1991 234 64 23
1992 43 2 43
# transform index to dates
import datetime as dt
df.index = [dt.date(year, 1, 1) for year in df.index]
import matplotlib.pyplot as plt
df.plot()
plt.savefig('test.png')

Related

getting mean() used in groupby to use the right grouped values for calculation

Data import from csv:
Date
Item_1
Item 2
1990-01-01
34
78
1990-01-02
42
19
.
.
.
.
.
.
2020-12-31
41
23
df = pd.read_csv(r'Insert file directory')
df.index = pd.to_datetime(df.index)
gb= df.groupby([(df.index.year),(df.index.month)]).mean()
Issue:
So basically, the requirement is to group the data according to year and month before processing and I thought that the groupby function would have grouped the data so that the mean() calculate the averages of all values grouped under Jan-1990, Feb-1990 and so on. However, I was wrong. The output result in the average of all values under Item_1
My example is similar to the below post but in my case, it is calculating the mean. I am guessing that it has to do with the way the data is arranged after groupby or some parameters in mean() have to be specified but I have no idea which is the cause. Can someone enlighten me on how to correct the code?
Pandas groupby month and year
Update:
Hi all, I have created the sample data file .csv with 3 items and 3 months of data. I am wondering if the cause has to do with the conversion of data into df when it is imported from .csv because I have noticed some weird time data on the leftmost as shown below:
Link to sample file is:
https://www.mediafire.com/file/t81wh3zem6vf4c2/test.csv/file
import pandas as pd
df = pd.read_csv( 'test.csv', index_col = 'date' )
df.index = pd.to_datetime( df.index )
df.groupby([(df.index.year),(df.index.month)]).mean()
Seems to do the trick from the provided data.
IIUC, you want to calculate the mean of all elements. You can use numpy's mean function that operates on the flattened array by default:
df.index = pd.to_datetime(df.index, format='%d/%m/%Y')
gb = df.groupby([(df.index.year),(df.index.month)]).apply(lambda d: np.mean(d.values))
output:
date date
1990 1 0.563678
2 0.489105
3 0.459131
4 0.755165
5 0.424466
6 0.523857
7 0.612977
8 0.396031
9 0.452538
10 0.527063
11 0.397951
12 0.600371
dtype: float64

Python - Get percentage based on column values

I want to evaluate 'percent of number of releases in a year' as a parameter of popularity of a genre in the movieLens dataset.
Sample data is shown below:
I can set the index to be the year as
df1 = df.set_index('year')
then, I can find the total per row and then divide the individual cells to get a sense of percentages as:
df1= df.set_index('year')
df1['total'] = df1.iloc[:,1:4].sum(axis=1)
df2 = df1.drop('movie',axis=1)
df2 = df2.div(df2['total'], axis= 0) * 100
df2.head()
Now,what's the best way to get % of number of releases in a year? I believe use 'groupby' and then heatmap?
You can clearly use groupby method:
import pandas as pd
import numpy as np
df = pd.DataFrame({'movie':['Movie1','Movie2','Movie3'], 'action':[1,0,0], 'com':[np.nan,np.nan,1], 'drama':[1,1,np.nan], 'year
':[1994,1994,1995]})
df.fillna(0,inplace=True)
df.set_index('year')
print((df.groupby(['year']).sum()/len(df))*100)
Output:
action com drama
year
1994 33.333333 0.000000 66.666667
1995 0.000000 33.333333 0.000000
Also, you can use pandas built-in style for the colored representation of the dataframe (or just use seaborn):
df = df.groupby(['year']).sum()/len(df)*100
df.style.background_gradient(cmap='viridis')
Output:

Stacked area chart with datetime axis

I am attepmtimng to create a Bokeh stacked area chart from the following Pandas DataFrame.
An example of the of the DataFrame (df) is as follows;
date tom jerry bill
2014-12-07 25 12 25
2014-12-14 15 16 30
2014-12-21 10 23 32
2014-12-28 12 13 55
2015-01-04 5 15 20
2015-01-11 0 15 18
2015-01-18 8 9 17
2015-01-25 11 5 16
The above DataFrame represents a snippet of the total df, which snaps over a number of years and contains additional names to the ones shown.
I am attempting to use the datetime column date as the x-axis, with the count information for each name as the y-axis.
Any assistance that anyone could provide would be greatly appreciated.
You can create a stacked area chart by using the patch glyph. I first used df.cumsum to stack the values in the dataframe by row. After that I append two rows to the dataframe with the max and min date and Y value 0. I plot the patches in a reverse order of the column list (excluding the date column) so the person with the highest values is getting plotted first and the persons with lower values are plotted after.
Another implementation of a stacked area chart can be found here.
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.palettes import inferno
from bokeh.models.formatters import DatetimeTickFormatter
df = pd.read_csv('stackData.csv')
df_stack = df[list(df)[1:]].cumsum(axis=1)
df_stack['date'] = df['date'].astype('datetime64[ns]')
bot = {list(df)[0]: max(df_stack['date'])}
for column in list(df)[1:]:
bot[column] = 0
df_stack = df_stack.append(bot, ignore_index=True)
bot = {list(df)[0]: min(df_stack['date'])}
for column in list(df)[1:]:
bot[column] = 0
df_stack = df_stack.append(bot, ignore_index=True)
p = figure(x_axis_type='datetime')
p.xaxis.formatter=DatetimeTickFormatter(days=["%d/%m/%Y"])
p.xaxis.major_label_orientation = 45
for person, color in zip(list(df_stack)[2::-1], inferno(len(list(df_stack)))):
p.patch(x=df_stack['date'], y=df_stack[person], color=color, legend=person)
p.legend.click_policy="hide"
show(p)

Python matplotlib - add trend line, make subplot and save to .pdf [duplicate]

I have a temperature file with many years temperature records, in a format as below:
2012-04-12,16:13:09,20.6
2012-04-12,17:13:09,20.9
2012-04-12,18:13:09,20.6
2007-05-12,19:13:09,5.4
2007-05-12,20:13:09,20.6
2007-05-12,20:13:09,20.6
2005-08-11,11:13:09,20.6
2005-08-11,11:13:09,17.5
2005-08-13,07:13:09,20.6
2006-04-13,01:13:09,20.6
Every year has different numbers, time of the records, so the pandas datetimeindices are all different.
I want to plot the different year's data in the same figure for comparing . The X-axis is Jan to Dec, the Y-axis is temperature. How should I go about doing this?
Try:
ax = df1.plot()
df2.plot(ax=ax)
If you a running Jupyter/Ipython notebook and having problems using;
ax = df1.plot()
df2.plot(ax=ax)
Run the command inside of the same cell!! It wont, for some reason, work when they are separated into sequential cells. For me at least.
Chang's answer shows how to plot a different DataFrame on the same axes.
In this case, all of the data is in the same dataframe, so it's better to use groupby and unstack.
Alternatively, pandas.DataFrame.pivot_table can be used.
dfp = df.pivot_table(index='Month', columns='Year', values='value', aggfunc='mean')
When using pandas.read_csv, names= creates column headers when there are none in the file. The 'date' column must be parsed into datetime64[ns] Dtype so the .dt extractor can be used to extract the month and year.
import pandas as pd
# given the data in a file as shown in the op
df = pd.read_csv('temp.csv', names=['date', 'time', 'value'], parse_dates=['date'])
# create additional month and year columns for convenience
df['Year'] = df.date.dt.year
df['Month'] = df.date.dt.month
# groupby the month a year and aggreate mean on the value column
dfg = df.groupby(['Month', 'Year'])['value'].mean().unstack()
# display(dfg)
Year 2005 2006 2007 2012
Month
4 NaN 20.6 NaN 20.7
5 NaN NaN 15.533333 NaN
8 19.566667 NaN NaN NaN
Now it's easy to plot each year as a separate line. The OP only has one observation for each year, so only a marker is displayed.
ax = dfg.plot(figsize=(9, 7), marker='.', xticks=dfg.index)
To do this for multiple dataframes, you can do a for loop over them:
fig = plt.figure(num=None, figsize=(10, 8))
ax = dict_of_dfs['FOO'].column.plot()
for BAR in dict_of_dfs.keys():
if BAR == 'FOO':
pass
else:
dict_of_dfs[BAR].column.plot(ax=ax)
This can also be implemented without the if condition:
fig, ax = plt.subplots()
for BAR in dict_of_dfs.keys():
dict_of_dfs[BAR].plot(ax=ax)
You can make use of the hue parameter in seaborn. For example:
import seaborn as sns
df = sns.load_dataset('flights')
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
.. ... ... ...
139 1960 Aug 606
140 1960 Sep 508
141 1960 Oct 461
142 1960 Nov 390
143 1960 Dec 432
sns.lineplot(x='month', y='passengers', hue='year', data=df)

Dataframe Sorting

I am working on Python pandas, beginning with sorting a dataframe I have created from a csv file. I am trying to create a for loop eventually, using values to compare. However, when I print the new values, they are using the original dataframe instead of the sorted version. How do I properly do the below?
Original CSV data:
date fruit quantity
4/5/2014 13:34 Apples 73
4/5/2014 3:41 Cherries 85
4/6/2014 12:46 Pears 14
4/8/2014 8:59 Oranges 52
4/10/2014 2:07 Apples 152
4/10/2014 18:10 Bananas 23
4/10/2014 2:40 Strawberries 98
Code:
import pandas as pd
import numpy
df = pd.read_csv('example2.csv', header=0, dtype='unicode')
df_count = df['fruit'].value_counts()
x = 0 #starting my counter values or position in the column
df.sort_values(['fruit'], ascending=True, inplace=True) #sorting the column
fruit
print(df)
old_fruit = df.fruit[x]
new_fruit = df.fruit[x+1]
print(old_fruit)
print(new_fruit)
I believe you are still accessing the old index of x. After you sort, insert this to reindex:
df.reset_index(drop=True, inplace=True)

Categories