I have a 6 x n matrix with the data: year, month, day, hour, minute, use.
I have to make a new matrix containing the aggregated measurements for use, in the value ’hour’. So all rows recorded within the same hour are combined.
So every time the number of hour chances the code need to know a new period starts.
I just tried something, but I don't now how to solve this.
Thank you. This is what I tried + a test
def groupby_measurements(data):
count = -1
for i in range(9):
array = np.split(data, np.where(data[i,3] != data[i+1,3])[0][:1])
return array
print(groupby_measurements(np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])))
In this case I tried, I expect the output to be:
np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76]]),
np.array([[2006,2,11,10,2,89],
[2006,2,11,10,3,33]]),
np.array([[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])
The final output should be:
np.array([2006,2,11,1,0,278]),
np.array([2006,2,11,10,0,122]),
np.array([2006,2,11,14,0,56])
(the sum of use in the 3 hour periodes)
I would recommend using pandas Dataframes, and then using groupby combined with sum
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array(
[[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]]),
columns=['year','month','day','hour','minute','use'])
aggregated = data.groupby(['year','month','day','hour'])['use'].sum()
# you can also use .agg and pass which aggregation function you want as a string.
aggregated = data.groupby(['year','month','day','hour'])['use'].agg('sum')
year month day hour
2006 2 11 1 278
10 122
14 56
Aggregated is now a pandas Series, if you want it as an array just do
aggregated.values
I have a dataframe:
df = pd.DataFrame({'year':[2000,2000,2000,2001,2001,2002,2002,2002],'ID':['a','b','c','a','b','a','b','c'],'values':[1,2,3,4,5,7,8,9]})
I would like to create a column that has the lag value of each ID-year, for example, ID'a' in 2000 has a value of 1, so ID'a' in 2001 should have a pre-value of 1. The key point is that if an ID doesn't have an value in the previous year (so the year is not continuous for some ID), then the pre-value should be NaN, instead of having the value from two years ago. For example, ID'c' doesn't show up in 2001, then for 2002, ID'c' should have pre-value = NaN.
Ideally, the final output should look like the following:
I tried the df.groupby(['ID'])['values'].shift(1), but it gives the following:
The problem is that when ID'c' doesn't have a value one year ago, the value two years ago is used. I also tried multiindex shift, which gives me the same result.
df.set_index(['year','ID'], inplace = True)
df.groupby(level=1)['values'].shift(1)
The thing that works is the answer mentioned here. But since my dataframe is fairly large, the merge kills the kernel. So far, I haven't figured out a better way to do it. I hope I explained my problem clearly.
Suppose the year column is unique for each id, i.e, there are no duplicated years for each specific id, then you can shift the value firstly and then replace shifted values where the difference between the year at the current row and previous row is not equal to 1 with NaN:
import pandas as pd
import numpy as np
df['pre_value'] = df.groupby('ID')['values'].shift(1)
df['pre_value'] = df.pre_value.where(df.groupby('ID').year.diff() == 1, np.nan)
df
a reindex approach
def reindex_min_max(df):
mn = df.year.min()
mx = df.year.max() + 1
d = df.set_index('year').reindex(pd.RangeIndex(mn, mx, name='year'))
return pd.concat([d, d['values'].shift().rename('pre_value')], axis=1)
df.groupby('ID')[['year', 'values']].apply(reindex_min_max) \
.sort_index(level=[1, 0]).dropna(subset=['values']).reset_index()
Currently I have a series of string as a column in pandas dataframe which represents a particular year in a "yyyy-yyyy" format for example "2004-2005" is a single string value in this column.
I wanted to know if there is anyway to convert this from string to something similar to datetime format.
The purpose for this is to calculate the difference between the values of this column and other similar column in "Years". For example something similar to below:
col 1 col2 Answer(Total years)
2004-2005 2006-2007 3
Note: One of the ways I thought of doing was to make a dictionary mapping each year to a unique integer value and then calculate the difference between them.
Although I was wondering if there is any simpler way of doing it.
It looks like you subtracting the last year in column 2 with the first year in column 1. In which case I'd use str.extract (and convert the result to a number):
In [11]: pd.to_numeric(df['col 1'].str.extract('(\d{4})'))
Out[11]:
0 2004
Name: col 1, dtype: int64
In [12]: pd.to_numeric(df['col2'].str.extract('-(\d{4})')) - pd.to_numeric(df['col 1'].str.extract('(\d{4})'))
Out[12]:
0 3
dtype: int64
What do you mean by "something similar to a datetime object." Datetimes aren't designed to represent date ranges.
If you want to create a pair of datetime objects you could do something like this:
[datetime.datetime.strptime(x, '%Y') for x in '2005-2006'.split('-')]
Alternatively you could try using a Pandas date_range object if that's closer to what you want.
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.date_range.html
If you are trying to find the difference between the lowest year and the highest year, here is a go at it
col1="2004-2005"
col2="2006-2007"
col1=col1.split("-") # make a list of the years in col1 ['2004', '2005']
col2=col2.split("-") # make a list of the years in col2 ['2006', '2007']
biglist=col1+col2 #add the two list
biglist.sort() #sort the list from lowest year to highest year
Answer=int(biglist[len(biglist)-1])-int(biglist[0]) #find the difference between lowest and highest year
I have a pandas DataFrame with a TIMESTAMP column, which is of the datetime64 data type. Please keep in mind, initially this column is not set as the index; the index is just regular integers, and the first few rows look like this:
TIMESTAMP TYPE
0 2014-07-25 11:50:30.640 2
1 2014-07-25 11:50:46.160 3
2 2014-07-25 11:50:57.370 2
There is an arbitrary number of records for each day, and there may be days with no data. What I am trying to get is the average number of daily records per month then plot it as a bar chart with months in the x-axis (April 2014, May 2014... etc.). I managed to calculate these values using the code below
dfWIM.index = dfWIM.TIMESTAMP
for i in range(dfWIM.TIMESTAMP.dt.year.min(),dfWIM.TIMESTAMP.dt.year.max()+1):
for j in range(1,13):
print dfWIM[(dfWIM.TIMESTAMP.dt.year == i) & (dfWIM.TIMESTAMP.dt.month == j)].resample('D', how='count').TIMESTAMP.mean()
which gives the following output:
nan
nan
3100.14285714
6746.7037037
9716.42857143
10318.5806452
9395.56666667
9883.64516129
8766.03225806
9297.78571429
10039.6774194
nan
nan
nan
This is ok as it is, and with some more work, I can map to results to correct month names, then plot the bar chart. However, I am not sure if this is the correct/best way, and I suspect there might be an easier way to get the results using Pandas.
I would be glad to hear what you think. Thanks!
NOTE: If I do not set the TIMESTAMP column as the index, I get a "reduction operation 'mean' not allowed for this dtype" error.
I think you'll want to do two rounds of groupby, first to group by day and count the instances, and next to group by month and compute the mean of the daily counts. You could do something like this.
First I'll generate some fake data that looks like yours:
import pandas as pd
# make 1000 random times throughout the year
N = 1000
times = pd.date_range('2014', '2015', freq='min')
ind = np.random.permutation(np.arange(len(times)))[:N]
data = pd.DataFrame({'TIMESTAMP': times[ind],
'TYPE': np.random.randint(0, 10, N)})
data.head()
Now I'll do the two groupbys using pd.TimeGrouper and plot the monthly average counts:
import seaborn as sns # for nice plot styles (optional)
daily = data.set_index('TIMESTAMP').groupby(pd.TimeGrouper(freq='D'))['TYPE'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
The formatting along the x axis leaves something to be desired, but you can tweak that if necessary.
I have a dataframe in pandas with the columns Year (int), Loc (ordered pair of ints), and Rain (boolean). There are many data points of Rain for each Year. For example, in the graph, you might see:
Year | Loc | Rain
1700 (0, 0) 1
1700 (0, 0) 1
1700 (5, 6) 0
etc.
Is there a function that will combine these data points into a single data point if Year AND Loc are the same, with Rain as the sum of all the Rain points of the corresponding Year AND Loc points?
Do you mean to group by "Year" and "Loc" and show SUM of Rain? something like the following?
df.groupby(['Year', 'Loc']).sum().reset_index()
there. This should do the trick as well:
# Just a dict of your data
dd = {'year':(1700,1700,1700),'loc':((0,0),(0,0),(5,6)),'rain':(1,1,0)}
df = DataFrame(dd)
# Set an index, groupby and count aggregate.
adjusted_df = df.set_index(['year','loc']).groupby(level=['year','loc']).count()
Though this is almost exactly the same as the first solution. First solution is probably better (less code).