Matplotlib histogram from x,y values with datetime months as bins - python

I have an array of date time objects x and an array of y values corresponding to those datetimes. I'm trying to create a histogram which groups all those y values into the same bin by month. Basically adding all y values which are in the same month and creating a histogram which shows the total values for each month.
This is a simplified version of what my data looks like:
x = np.array(datetime.datetime(2014, 2, 1, 0, 0), datetime.datetime(2014, 2, 13, 0, 0),\n
datetime.datetime(2014, 3, 4, 0, 0), datetime.datetime(2014, 3, 6, 0, 0))
y = np.array(4,3,2,6)
The end result should be a histogram showing month 2 in 2014 with y value 7 and month 3 in 2014 with y value 8.
The first thing I tried was creating a pandas dataframe out of my two array like so:
frame = pd.DataFrame({'x':x,'y':y})
This worked fine with x mapping to all datetime objects and y to all corresponding values. However after creating this dataframe I'm kind of lost on how to add all the y values by month and create bins out of these months using plt.hist()

First of all, thanks for a well-posed question with an example of your data.
This seems to be what you want:
import pandas as pd
import numpy as np
import datetime
%matplotlib inline
x = np.array([datetime.datetime(2014, 2, 1, 0, 0),
datetime.datetime(2014, 2, 13, 0, 0),
datetime.datetime(2014, 3, 4, 0, 0),
datetime.datetime(2014, 3, 6, 0, 0)])
y = np.array([4,3,2,6])
frame = pd.DataFrame({'x':x,'y':y})
(frame.set_index('x'). # use date-time as index
assign(month=lambda x: x.index.month). # add new column with month
groupby('month'). # group by that column
sum(). # find a sum of the only column 'y'
plot.bar()) # make a barplot

Do This First
df = pd.DataFrame(dict(y=y), pd.DatetimeIndex(x, name='x'))
df
y
x
2014-02-01 4
2014-02-13 3
2014-03-04 2
2014-03-06 6
Option 1
df.resample('M').sum().hist()
Option 2
df.groupby(pd.TimeGrouper('M')).sum().hist()
Or Do This First
df = pd.DataFrame(dict(x=pd.to_datetime(x), y=y))
df
x y
0 2014-02-01 4
1 2014-02-13 3
2 2014-03-04 2
3 2014-03-06 6
Option 3
df.resample('M', on='x').sum().hist()
Yields

Related

How to get a series of highest values out of equal portions of another pandas series python

I am working with a series in python, What I want to achieve is to get the highest value out of every n values in the series.
For example:
if n is 3
Series: 2, 1, 3, 5, 3, 6, 1, 6, 9
Expected Series: 3, 6, 9
I have tried nlargest function in pandas but it returns largest values in descending order, But I need the values in order of the original series.
There are various options. If the series is guaranteed to have a length of a multiple of n, you could drop down to numpy and do a .reshape followed by .max along an axis.
Otherwise, if the index is the default (0, 1, 2, ...), you can use groupby:
import pandas as pd
n = 3
ser = pd.Series([2, 1, 3, 5, 3, 6, 1, 6, 9])
out = ser.groupby(ser.index // n).max()
out:
0 3
1 6
2 9
dtype: int64

Group rows based on +- threshold on high dimensional object

I have a large df with coordinates in multiple dimensions. I am trying to create classes (Objects) based on threshold difference between the coordinates. An example df is as below:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
So based on this df I want to group each row to a class based on -+ 2 across all coordinates. So the df will have a unique group name added to each row. So the output for this threshold function is:
'x' 'y' 'z' 'group'
1 10 7 -
2 14 6 -
3 5 2 G1
4 14 43 -
5 3 1 G1
6 12 40 -
It is similar to clustering but I want to work on my own threshold functions. How can this done in python.
EDIT
To clarify the threshold is based on the similar coordinates. All rows with -+ threshold across all coordinates will be grouped as a single object. It can also be taken as grouping rows based on a threshold across all columns and assigning unique labels to each group.
As far as I understood, what you need is a function apply. It was not very clear from your statement, whether you need all the differences between the coordinates, or just the neighbouring differences (x-y and y-z). The row 5 has the difference between x and z coordinate 4, but is still assigned to the class G1.
That's why I wrote it for the two possibilities and you can just choose which one you need more:
import pandas as pd
import numpy as np
def your_specific_function(row):
'''
For all differences use this:
diffs = np.array([abs(row.x-row.y), abs(row.y-row.z), abs(row.x-row.z)])
'''
# for only x - y, y - z use this:
diffs = np.diff(row)
statement = all(diffs <= 2)
if statement:
return 'G1'
else:
return '-'
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
df['group'] = df.apply(your_specific_function, axis = 1)
print(df.head())

Cumulative average in python

I'm working with csv files.
I'd like a to create a continuously updated average of a sequence. ex;
I'd like to output the average of each individual value of a list
list; [a, b, c, d, e, f]
formula:
(a)/1= ?
(a+b)/2=?
(a+b+c)/3=?
(a+b+c+d)/4=?
(a+b+c+d+e)/5=?
(a+b+c+d+e+f)/6=?
To demonstrate:
if i have a list; [1, 4, 7, 4, 19]
my output should be; [1, 2.5, 4, 4, 7]
explained;
(1)/1=1
(1+4)/2=2.5
(1+4+7)/3=4
(1+4+7+4)/4=4
(1+4+7+4+19)/5=7
As far as my python file it is a simple code:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('somecsvfile.csv')
x = [] #has to be a list of 1 to however many rows are in the "numbers" column, will be a simple [1, 2, 3, 4, 5] etc...
#x will be used to divide the numbers selected in y to give us z
y = df[numbers]
z = #new dataframe derived from the continuous average of y
plt.plot(x, z)
plt.show()
If numpy is needed that is no problem.
pandas.DataFrame.expanding is what you need.
Using it you can just call df.expanding().mean() to get the result you want:
mean = df.expanding().mean()
print(mean)
Out[10]:
0 1.0
1 2.5
2 4.0
3 4.0
4 7.0
If you want to do it just in one column, use pandas.Series.expanding.
Just use the column instead of df:
df['column_name'].expanding().mean()
You can use cumsum to get cumulative sum and then divide to get the running average.
x = np.array([1, 4, 7, 4, 19])
np.cumsum(x)/range(1,len(x)+1)
print (z)
output:
[1. 2.5 4. 4. 7. ]
To give a complete answer to your question, filling in the blanks of your code using numpy and plotting:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#df = pd.read_csv('somecsvfile.csv')
#instead I just create a df with a column named 'numbers'
df = pd.DataFrame([1, 4, 7, 4, 19], columns = ['numbers',])
x = range(1, len(df)+1) #x will be used to divide the numbers selected in y to give us z
y = df['numbers']
z = np.cumsum(y) / np.array(x)
plt.plot(x, z, 'o')
plt.xticks(x)
plt.xlabel('Entry')
plt.ylabel('Cumulative average')
But as pointed out by Augusto, you can also just put the whole thing into a DataFrame. Adding a bit more to his approach:
n = [1, 4, 7, 4, 19]
df = pd.DataFrame(n, columns = ['numbers',])
#augment the index so it starts at 1 like you want
df.index = np.arange(1, len(df)+1)
# create a new column for the cumulative average
df = df.assign(cum_avg = df['numbers'].expanding().mean())
# numbers cum_avg
# 1 1 1.0
# 2 4 2.5
# 3 7 4.0
# 4 4 4.0
# 5 19 7.0
# plot
df['cum_avg'].plot(linestyle = 'none',
marker = 'o',
xticks = df.index,
xlabel = 'Entry',
ylabel = 'Cumulative average')

How to generate a regular time series from date X to date Y with intervals of Z units?

I am new to Python (from Matlab) and having some trouble with a simple task:
How can I create a regular time series from date X to date Y with intervals of Z units?
E.g. from 1st January 2013 to 31st January 2013 every 10 minutes
In Matlab:
t = datenum(2013,1,1):datenum(0,0,0,0,10,0):datenum(2013,12,31);
If you can use pandas then use pd.date_range
Ex:
import pandas as pd
d = pd.date_range(start='2013/1/1', end='2013/12/31', freq="10min")
This function will make an interval of anything, similar to range, but working also for non-integers:
def make_series(begin, end, interval):
x = begin
while x < end:
yield x
x = x + interval
You can then do this:
>>> import datetime
>>> date_x = datetime.datetime(2013,1,1)
>>> date_y = datetime.datetime(2013,12,31)
>>> step = datetime.timedelta(days=50)
>>>
>>> list(make_series(date_x, date_y, step))
[datetime.datetime(2013, 1, 1, 0, 0), datetime.datetime(2013, 2, 20, 0, 0), datetime.datetime(2013, 4, 11, 0, 0), datetime.datetime(2013, 5, 31, 0, 0), datetime
.datetime(2013, 7, 20, 0, 0), datetime.datetime(2013, 9, 8, 0, 0), datetime.datetime(2013, 10, 28, 0, 0), datetime.datetime(2013, 12, 17, 0, 0)]
I picked a random date range, but here is how you can essentially get it done:
a = pd.Series({'A': 1, 'B': 2, 'C': 3}, pd.date_range('2018-2-6', '2018-3-4', freq='10Min'))
print(a)
You can add a date range when creating the DataFrame. The first argument that you enter into the date_range() method is the start date, and the second argument is automatically the end date. The last argument that you enter in this example is the frequency argument, which you can set to 10 minutes. You can set it to 'H' for 1 hour, or to 'M' for 1 minute intervals as well. Another method that is worth noting however is asfreq() method that will let you edit the frequency or store a copy of another dataframe with a different frequency. Here is an example:
copy = a.asfreq('45Min', method='pad')
print(copy)
That's important if you want to study multiple frequencies. You can just copy your dataframe repeatedly in order to look at various time intervals. Hope that this answer helps.

Pandas Time Series - add a value to the first value in each year

I have a Pandas time series with unevenly spaced dates/datapoints. I want to add 1 to the value of each data point that is the first value for each year.
The time series is very sparse and the data is sorted.
Is there a better way to do this then by looping through all data points and checking when the year changes?
Example:
dates = [datetime(2012, 1, 1, 1, 1), datetime(2012, 1, 1, 1, 2), datetime(2012, 1, 2, 0 ,0), datetime(2013, 1, 2, 0, 0), datetime(2014, 1, 3, 1, 1)]
ts = Series(np.random.randn(len(dates)), dates)
Using the example above I want to add 1 to value on 2012-01-01 01:01:00, 2013-01-02 00:00:00 and 2014-01-03 01:01:00
Sure. You can extract the year:
ts.index.year
Find where the adjacent difference is nonzero:
np.diff(ts.index.year) != 0
Remember that you also want to select the very first data point:
np.concatenate(([True], np.diff(ts.index.year) != 0))
And then modify those data points:
ts[np.concatenate(([True], np.diff(ts.index.year) != 0))] += 1

Categories