I have the following data in the format below (see below)
I next perform recasting, groupby and averaging (see code) to reduce data dimensionality.
df_mod=pd.read_csv('wet_bulb_hr.csv')
#Mod Date
df_mod['wbt_date'] = pd.to_datetime(df_mod['wbt_date'])
#Mod Time
df_mod['wbt_time'] = df_mod['wbt_time'].astype('int')
df_mod['wbt_date'] = df_mod['wbt_date'] + \
pd.to_timedelta(df_mod['wbt_time']-1, unit='h')
df_mod['wet_bulb_temperature'] = \
df_mod['wet_bulb_temperature'].astype('float')
df = df_mod
df = df.drop(['wbt_time','_id'], axis = 1)
#df_novel = df.mean()
df = df.groupby([df.wbt_date.dt.year,df.wbt_date.dt.month]).mean()
After writing to an output file, I get an output that looks like this.
Investigating further, I can understand why. All my processing has resulted in a dataframe of shape 1 but what I really need is the 2 wbt_date columns to be exported as well. This does not seem to happen due to the groupby function
My question: How do I generate an index and have the groupby wbt_date columns as a new single column such that the output is:
You can flatten MultiIndex to Index in YYYY-MM by list comprehension:
df = df.groupby([df.wbt_date.dt.year,df.wbt_date.dt.month]).mean()
df.index = [f'{y}-{m}' for y, m in df.index]
df = df.rename_axis('date').reset_index()
Or use month period by Series.dt.to_period:
df = df.groupby([df.wbt_date.dt.to_period('m')).mean().reset_index()
Try this,
# rename exisiting index & on reset will get added as new column.
df.index.rename("wbt_year", inplace=True)
df.reset_index(inplace=True)
df['month'] = df['wbt_year'].astype(str) + "-" + df['wbt_date'].astype(str)
Output,
>>> df['month']
0 2019-0
1 2018-1
2 2017-2
Related
I am new to pandas , so assume I must be missing something obvious...
Summary:
I have a DataFrame with 300K+ rows. I retrieve a row of new data which may or may not be related to the existing subset of rows in the DF(identified by Group ID), either retrieve the existing Group ID or generate new one and finally insert it with the Group ID.
Pandas seems very slow for this.
Please advise : What am I missing / should I be using something else?
Details:
Columns are (example):
columnList = ['groupID','timeStamp'] + list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
Each groupID can have many unique timeStamp's
groupID is internally generated :
Either using an existing one (by matching the row to existing data, say by column 'D')
Generate new groupID
Thus (in my view at least) I cannot do updates/inserts in bulk, I have to do it row by row
I used an SQL DB analogy to create an index as concat of groupID and timeStamp (I Have tried MultiIndex but it seems even slower).
Finally I insert/update using .loc(ind,columnName)
Code:
import pandas as pd
import numpy as np
import time
columnList = ['groupID','timeStamp'] + list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
columnTypeDict = {'groupID':'int64','timeStamp':'int64'}
startID = 1234567
df = pd.DataFrame(columns=columnList)
df = df.astype(columnTypeDict)
fID = list(range(startID,startID+300000))
df['groupID'] = fID
ts = [1000000000]*150000 + [10000000001]*150000
df['timeStamp'] = ts
indx = [str(i) + str(j) for i, j in zip(fID, ts)]
df['Index'] = indx
df['Index'] = df['Index'].astype('uint64')
df = df.set_index('Index')
startTime = time.time()
for groupID in range(startID+49000,startID+50000) :
timeStamp = 1000000003
# Obtain/generate an index
ind =int(str(groupID) + str(timeStamp))
#print(ind)
df.loc[ind,'A'] = 1
print(df)
print(time.time()-startTime,"secs")
If the index column already exists, its fast, but if it doesn't 10,000 inserts take 140secs
I think accessing dataframes is a relatively expensive operation.
You can save temporatily these values and use them to create dataframe that will be merged with the original one as follows:
startTime = time.time()
temporary_idx = []
temporary_values = []
for groupID in range(startID+49000,startID+50000) :
timeStamp = 1000000003
# Obtain/generate an index
ind = int(str(groupID) + str(timeStamp))
temporary_idx.append(ind)
temporary_values.append(1)
# create a dataframe with new values and apply a join with the original dataframe
df = df.drop(columns=["A"])\
.merge(
pd.DataFrame({"A": temporary_values}, index=temporary_idx).rename_axis("Index", axis="index"),
how="outer", right_index=True, left_index=True
)
print(df)
print(time.time()-startTime,"secs")
When I benchmarked, This takes less than 2 seconds to execute
I don't know what is exactly your real use case, but this for the case of inserting column A as you stated in your example. If your use case is more complex than that, then there might be a better solution
I got a dataframe on which I would like to perform some analysis. An easy example of what I would like to achieve is, having the dataframe:
data = ['2017-02-13', '2017-02-13', '2017-02-13', '2017-02-15', '2017-02-16']
df = pd.DataFrame(data = data, columns = ['date'])
I would like to create a new dataframe from this. The new dataframe should contain 2 columns, the entire date span. So it should also include 2017-02-14 and the number of times each date appears in the original data.
I managed to construct a dataframe that includes all the dates as so:
dates = pd.to_datetime(df['date'], format = "%Y-%m-%d")
dateRange = pd.date_range(start = dates.min(), end = dates.max()).tolist()
df2 = pd.DataFrame(data = datumRange, columns = ['datum'])
My question is, how would I add the counts of each date from df to df 2? I've been messing around trying to write my own functions but have not managed to achieve it. I am assuming this needs to be done more often and that I am thinking to difficult...
Try this:
df2['counts'] = df2['datum'].map(pd.to_datetime(df['date']).value_counts()).fillna(0).astype(int)
I have a datetime-sorted Dataframe with "Headlines".
I want to group these "Headlines" by their date.
So my plan was to create a Series with all the "Headlines" for each day. Then I wanted to append this Series to a new Dataframe with the date as the column name.
As the columns have different sizes, I used pd.concat() to add the series to the Dataframe. But it won't work either.
Examples:
Input Dataframe
, Output Dataframe
df_sorted = df.sort_values('dates', ascending=False)
df_added = pd.DataFrame()
df1 = pd.DataFrame()
k = 0
for i in range(0, len(df_sorted)):
if (df_sorted.iloc[i]['dates'] == df_sorted.iloc[i+1]['dates']):
continue
else:
df1[df_sorted.iloc[i]['dates']] = df_sorted.iloc[k:i+1]['headlines']
df_added = pd.concat([df_added,df1], axis=1)
df1 = pd.DataFrame()
k = i + 1
I'm working with futures timeseries where the trading day starts at 17:00:00 CT and ends at 15:15:00 CT of the next day. To account for this, I make a change in the index, however, when pivoting the dataframe it ignores this change....
Let's look at it with an example:
# Dummy Data
rng = pd.date_range('1/1/2011', periods=5000, freq='min')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
df = pd.DataFrame(ts, columns=['ts'])
df_1 = df.resample('5min').last()
# Change index to account for futures hours
df_1.index = pd.to_datetime(df_1.index.values + np.where((df_1.index.time >= datetime.time(17)), pd.offsets.Day(1).nanos, 0))
# Pivoting df_1 and making some formatting changes
df_2 = pd.pivot_table(df_1, index=df_1.index.date, columns=df_1.index.time, values='ts')
df_2.columns = df_2.columns.map(lambda t: t.strftime('%H%M'))
df_2_cols = df_2.columns.tolist()
for i in range(len(df_2_cols)):
if df_2_cols[i][0] == '0':
df_2_cols[i] = df_2_cols[i][1:4]
After doing all this, the dataframe is in the shape and format I want but the first column, corresponding to the first timestamp of the day is 00:00:00 instead of 17:00:00, as I intended with the index modification.
How can I fix this??
After pivoting the columns get sorted chronologically. But you can reorder them. Say, the columns are already formatted, so we look for '1700':
pos = np.nonzero(df_2.columns == '1700')[0][0]
(np.nonzero returns a tuple of arrays, hence those double [0]). Then
new_cols = df_2.columns[pos:].append(df_2.columns[:pos])
df_2 = df_2.reindex(columns = new_cols)
This question already has an answer here:
Grouping a dataframe by X columns
(1 answer)
Closed 5 years ago.
I have a DataFrame with 40 columns (columns 0 through 39) and I want to group them four at a time:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.binomial(1, 0.2, (100, 40)))
new_df["0-3"] = df[0] + df[1] + df[2] + df[3]
new_df["4-7"] = df[4] + df[5] + df[6] + df[7]
...
new_df["36-39"] = df[36] + df[37] + df[38] + df[39]
Can I do this in a single statement (or in a better way than summing them separately)? The column names in the new DataFrame are not important.
You could select out the columns and sum on the row axis, like this.
df['0-3'] = df.loc[:, 0:3].sum(axis=1)
A couple things to note:
Summing like this will ignore missing data while df[0] + df[1] ... propagates it. Pass skipna=False if you want that behavior.
Not necessarily any performance benefit, may actually be a little slower.
Here's another way to do it:
new_df = df.transpose()
new_df['Group'] = new_df.index / 4
new_df = new_df.groupby('Group').sum().transpose()
Note that the divide-by operation here is integer division, not floating-point division.
I don't know if it is the best way to go but I ended up using MultiIndex:
df.columns = pd.MultiIndex.from_product((range(10), range(4)))
new_df = df.groupby(level=0, axis=1).sum()
Update: Probably because of the index, this was faster than the alternatives. The same can be done with df.groupby(df.columns//4, axis=1).sum() faster if you take into account the time for constructing the index. However, the index change is a one time operation and I update the df and take the sum thousands of times so using a MultiIndex was faster for me.
Consider a list comprehension:
df = # your data
df_slices = [df.iloc[x:x+4] for x in range(10)]
Or more generally
df_slices = [df.iloc[x:x+4] for x in range(len(df.columns)/4)]