I extract data from column using:
df_filtered = pd.DataFrame(df[1].apply(extractMMYY).tolist(), columns=['MM', 'YY'])
It returns me a new dataset, but I need to return MM, YY into initial dataset df.
I have tried:
df(df[1].apply(extractMMYY).tolist(), columns=['MM', 'YY'])
Or I need to bins two datasets to be able filter first df by df_filtered
It looks to me like you are trying to do
df[['MM', 'YY']] = df[1].apply(extractMMYY).tolist()
Related
I would transform a dataframe based on another dataframe transformed numbers.
Code
df1 = pd.DataFrame(np.random.random_integers(0,100,(100,2)), columns=['M','A'])
max_min = 100
step = 5
df1['RnkGroup'] = df1.groupby(pd.cut(df1['M'],
range(0,max_min,step)))['A'].transform('rank', ascending=False).round()
df1['RnkMeanGroup'] = df1.groupby(pd.cut(df1['M'],
range(0,max_min+step,step)))['RnkGroup'].transform('mean').round()
Is it possible transform a new df2 = pd.DataFrame(np.random.random_integers(0,100,(100,2)), columns=['M','A']) based on the previous one?
I need something like fit_transform of sklearn.
I have a numeric np array which I want to use that as a condition/filter over a column number 4 of a dataframe (df) to extract a subset of dataframe (sale_data_sub). However, I am getting an empty sale_data_sub (with just the name of all the columns and no rows) as a result of the code
sale_data_sub = df.loc[df[4].isin(sale_condition_arr)].values
sale_condition_arr is a numpy array
df is the original dataframe with 100 columns
sale_data_subset is the desired sub_dataframe
Sorry that I didn't include a working sample.
the issue is that your df dataframe don't have headers assigned.
try:
#give your dataframe a header:
df = df.set_axis([str(i) for i in list(range(len(df.columns)))], axis='columns')
#then proceed to your usual work with df:
sale_data_sub = df.loc[df["4"].isin(sale_condition_arr)].values #be careful, it's df["4"] not df[4]
Assume I have the following data frame:
I want to create two data frames such that for any row if column Actual is equal to column Predicted then the value in both columns goes in one data frame otherwise both columns go in another data frame.
For example, row 0,1,2 goes in dataframe named correct_df and row 245,247 goes in dataframe named incorect_df
Use boolean indexing:
m = df['Actual'] == df['Predicted']
correct_df = df.loc[m]
incorrect_df = df.loc[~m]
You can use this :
df_cor = df.loc[(df['Actual'] == df['Predicted'])]
df_incor = df2 = df.loc[(df['Actual']!= df['Predicted'])]
And use reset_index if you want a new index.
I have 2 dataframes of numerical data. Given a value from one of the columns in the second df, I would like to look up the index for the value in the first df. More specifically, I would like to create a third df, which contains only index labels - using values from the second to look up its coordinates from the first.
listso = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
data = pd.DataFrame(listso,index=list('abcdefghij'), columns=list('AB'))
rollmax = pd.DataFrame(data.rolling(center=False,window=5).max())
So for the third df, I hope to use the values from rollmax and figure out which row they showed up in data. We can call this third df indexlookup.
For example, rollmax.ix['j','A'] = 30, so indexlookup.ix['j','A'] = 'g'.
Thanks!
You can build a Series with the indexing the other way around:
mapA = pd.Series(data.index, index=data.A)
Then mapA[rollmax.ix['j','A']] gives 'g'.
I'm following tutorial of Wes McKinney on using pandas/python for trading backtesting (http://youtu.be/6h0IVlp_1l8).
After pd.read_csv(...) he's using 'dt' (datetime) column as index of dataframe.
df.index = pd.to_datetime(df.pop('dt'))
However, my data has 2 separate columns, 'Date[G]' and 'Time[G]' and the data inside is something like 04-JAN-2013,00:00:00.000 (comma-separated).
How do i modify that line of code in order to do the same? I.e. merge two columns within one data frame, and then delete it. Or is there a way to do that during read_csv itself?
Thanks for all answers.
You should be able to concat two columns using apply() and then use to_datetime().
To remove columns from dataframe use drop() or just select columns you need:
df['dt'] = pd.to_datetime(df.apply(lambda x: x['Date[G]'] + ' ' + x['Time[G]'], 1))
df = df.drop(['Date[G]', 'Time[G]'], 1)
# ..or
# df = df[['dt', ...]]
df.set_index('dt', inplace = True)