New to pandas.
Have a DataFrame of the order:
A B C Date1 Date2 D with multiple rows with values. I want to divide the entire DataFrame into multiple dataframes based on quarters, i.e (Jan-Mar, Apr-Jun, Jul-Sep, Oct-Dec). I am trying to use only the Date1 column values for the same. I tried the following so far:
data_q = data.groupby(pandas.TimeGrouper(freq = '3M'))
The dates are in the form 2009-11-03.
There a few ways to do this.
I would ensure that Date1 column is a datetime type using the .dtype method.
e.g. df['Date1'].dtype
If it's not, cast to datetime object using:
df.Date1 = pd.to_datetime(df.Date1)
Add a quarters column for eventual data frame slicing:
df['quarters'] = df.Date1.dt.quarter
Create your data frames:
q1 = df[df.quarters == 1]
q2 = df[df.quarters == 2]
q3 = df[df.quarters == 3]
q4 = df[df.quarters == 4]
So the approach that appears easiest to me is to convert Date1 to your index, then groupby on the quarter.
df2 = df.set_index('Date1')
quardfs = list(df2.groupby(df2.index.quarter))
This will leave you with quardfs, which a list of DataFrames.
If you don't want to set Date1 to an index, you can also copy it out of the DataFrame and use it:
quars = pd.DatetimeIndex(df['Date1']).quarter
quardfs = list(df2.groupby(quars))
Related
From the datetime object in the dataframe I created two new columns based on month and day.
data["time"] = pd.to_datetime(data["datetime"])
data['month']= data['time'].apply(lambda x: x.month)
data['day']= data['time'].apply(lambda x: x.day)
The resultant data had the correct month and day added to the specific columns.
Then I tried to filter it based on specific day
data = data[data['month']=='9']
data = data[data['day']=='2']
This values were visible in the dataframe before filtering.
This returns an empty dataframe. What did I do wrong?
Compare by 9,2 like integers, without '':
data = data[(data['month']==9) & (data['day']==2)]
Or:
data = data[(data['time'].dt.month == 9) & (data['time'].dt.day == 2)]
I've done a dataframe aggregation and I want to add a new column in which if there is a value > 0 in year 2020 in row, it will put an 1, otherwise 0.
this is my code
and head of dataframe
df['year'] = pd.DatetimeIndex(df['TxnDate']).year # add column year
df['client'] = df['Customer'].str.split(' ').str[:3].str.join(' ') # add colum with 3 first word
Datedebut = df['year'].min()
Datefin = df['year'].max()
#print(df)
df1 = df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()
print(df1)
df1['nb2020']= np.where( df1['year']==2020, 1, 0)
Data frame df1 print before last line is like that:
Last line error is : KeyError: 'year'
thanks
When you performed that the aggregation and unstacked (df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()), the values of the column year have been expanded into columns, and these columns are a MultiIndex. You can look at that by calling:
print (df1.columns)
And then you can select them.
Using the MultiIndex column
So to select the column which matches to 2020 you can use:
df1.loc[:,df1.columns.get_level_values(2).isin({2020})
You can probably get the correct column then check if 2020 has a non zero value using:
df1['nb2020'] = df1.loc[:,df1.columns.get_level_values('year').isin({2020})] > 0
If you would like to have the 1 and 0 (instead of the bool types), you can convert to int (using astype).
Renaming the columns
If you think this is a bit complicated, you might also prefer change the column to single indexes. Using something like
df1.columns = df1.columns.get_level_values('year')
Or
df1.columns = df1.columns.get_level_values(2)
And then
df1['nb2020'] = (df1[2020] > 0).astype(int)
Below is the example of a sample pandas dataframe. I am trying to find the difference between the dates in the two rows (with the first row as the base):
PH_number date Type
H09879721 2018-05-01 AccountHolder
H09879731 2018-06-22 AccountHolder
If the difference between two dates is within 90 days, then those two rows should be added to a new pandas dataframe. The date column is of type object.
How can I do this?
Use .diff():
df.date.diff()<=pd.Timedelta(90,'d')
0 False
1 True
Name: date, dtype: bool
Convert date column to datetime64[ns] data type using pd.to_datetime and then subtract as given:
df['date'] = pd.to_datetime(df['date'])
#if comparing with only 1st row
mask = (df['date']-df.loc[0,'date']).dt.days<=90
# alternative mask = (df['date']-df.loc[0,'date']).dt.days.le(90)
#if comparing with immediate rows.
mask = df['date'].diff().dt.days<=90
# alternative mask = df['date'].diff().dt.days.le(90)
df1 = df.loc[mask,:] #gives you required rows with all columns
I have two dataframes (df) A and B. df A has a column called 'Symbol' with non-unique stock-ticker-symbols as values in random order and the corresponding amount of buy or sell quantities in another column called 'Shares'; it is indexed by non-negative integers. df B, indexed by dates in the same date-order as df A and same number of rows as df A, has the same ticker symbols as df A as unique column names. I need to populate all df B rows with the amount of stock purchase or sell amounts from corresponding A.Shares.values. I get an error when trying the below code. Alternatively, would it be possible to loop through the df A rows using join command constraint to match df A's column values to column names of df B similar to SQL queries?
import pandas as pd
bCN = B.dtypes.index # list of column names in df B to be used for populating its stock quantity based on matching values from df A
A = pd.DataFrame({'Date': ['2011-01-14', '2011-01-19', '2011-01-19'],
'Symbol': ['AAPL', 'AAPL', 'IBM'], 'Order':['BUY','SELL','BUY'],'Shares':[1500, 1500, 4000]}) #example of A
B = pd.DataFrame({'AAPL':[0,0,0],'IBM': [0,0,0], index = pd.date_range(start, end)}) #example of B
Expected Result
B = pd.DataFrame({'AAPL':[1500,0,-1500],'IBM': [0,0,400], index = pd.date_range(start, end)}) #example of resultant B
Attempt
B = A.pivot('Date','Symbol','Shares')
B = B.fillna(value = 0)
B['Cash'] = pd.Series([0]*len(B.index),index=B.index)
for index, row in A.iterrows():
if row['Order'] == 'SELL':
B.loc[row, A['Symbol']] *= -1
first of all, I highly suggest you to read how-to-make-good-reproducible-pandas-examples
I think you could use pivot such has:
B = A.pivot('Date','Symbol','Shares')
Since image of dataframe are hard to copy paste I can't show you the exact result you could get using this method
I'm working with futures timeseries where the trading day starts at 17:00:00 CT and ends at 15:15:00 CT of the next day. To account for this, I make a change in the index, however, when pivoting the dataframe it ignores this change....
Let's look at it with an example:
# Dummy Data
rng = pd.date_range('1/1/2011', periods=5000, freq='min')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
df = pd.DataFrame(ts, columns=['ts'])
df_1 = df.resample('5min').last()
# Change index to account for futures hours
df_1.index = pd.to_datetime(df_1.index.values + np.where((df_1.index.time >= datetime.time(17)), pd.offsets.Day(1).nanos, 0))
# Pivoting df_1 and making some formatting changes
df_2 = pd.pivot_table(df_1, index=df_1.index.date, columns=df_1.index.time, values='ts')
df_2.columns = df_2.columns.map(lambda t: t.strftime('%H%M'))
df_2_cols = df_2.columns.tolist()
for i in range(len(df_2_cols)):
if df_2_cols[i][0] == '0':
df_2_cols[i] = df_2_cols[i][1:4]
After doing all this, the dataframe is in the shape and format I want but the first column, corresponding to the first timestamp of the day is 00:00:00 instead of 17:00:00, as I intended with the index modification.
How can I fix this??
After pivoting the columns get sorted chronologically. But you can reorder them. Say, the columns are already formatted, so we look for '1700':
pos = np.nonzero(df_2.columns == '1700')[0][0]
(np.nonzero returns a tuple of arrays, hence those double [0]). Then
new_cols = df_2.columns[pos:].append(df_2.columns[:pos])
df_2 = df_2.reindex(columns = new_cols)