Assume I have the following data frame:
I want to create two data frames such that for any row if column Actual is equal to column Predicted then the value in both columns goes in one data frame otherwise both columns go in another data frame.
For example, row 0,1,2 goes in dataframe named correct_df and row 245,247 goes in dataframe named incorect_df
Use boolean indexing:
m = df['Actual'] == df['Predicted']
correct_df = df.loc[m]
incorrect_df = df.loc[~m]
You can use this :
df_cor = df.loc[(df['Actual'] == df['Predicted'])]
df_incor = df2 = df.loc[(df['Actual']!= df['Predicted'])]
And use reset_index if you want a new index.
Related
I am not sure how to build a data frame here but I am looking for a way to take the data from multiple columns and combine them into 1 column. Not as a sum but as a joined value.
Ex. MB|Val|34567|W123 -> MB|Val|34567|W123|MB_Val_34567_W123.
What I have tried so far is creating a conditions variable that calls a particular column identical to the value in it
conditions = [(Groupings_df['GroupingCriteria1'] == 'MB')]
then a values variable that would include what I want in the new column
values = ['MB_Val_34567_W123']
and lastly grouping it
Groupings_df['GroupingColumn'] = np.select(conditions,values)
This works for 1 row but it would be inefficient to keep manually changing the number in the values variable (34567) over a df with thousands of rows
IIUC, you want to create a new column as a concatenation of each row:
df = pd.DataFrame({'GC1': ['MB'], 'GC2': ['Val'], 'GC3': [34567], 'GC4': ['W123'],
'Dummy': [10], 'Other': ['Hello']})
df['GC'] = df.filter(like='GC').astype(str).apply(lambda x: '_'.join(x), axis=1)
print(df)
# Output
GC1 GC2 GC3 GC4 Dummy Other GC
0 MB Val 34567 W123 10 Hello MB_Val_34567_W123
What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.
You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).
You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]
I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")
I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.
I have a main dataframe (df) with a Date column (non-index), a column 'VXX_Full' with values, and a 'signal' column.
I want to iterate through the signals column, and whenever it is 1, i want to capture a slice (20 rows before, 40 rows after) of the 'VXX_Full' column and create a new dataframe with all the slices. I would like the column name of the new dataframe to be the row number of the original dataframe.
VXX_signal = pd.DataFrame(np.zeros((60,0)))
counter = 1
for row in df.index:
if df.loc[row,'signal'] == 1:
add_row = df.loc[row - 20:row +20,'VXX_Full']
VXX_signal[counter] = add_row
counter +=1
VXX_signal
It just doesn't seem to be working. It creates a dataframe, however the values are all Nan. The first slice, it at least appears to be getting data from the main df, however the data doesn't correspond to the correct location. The following set of columns (there are 30 signals so 30 columns created) in the new df are all NaN
Thanks in advance!
I'm not sure about your current code - but basically all you need is a list of ranges of indexes. If your index is linear, this would be something like:
indexes = list(df[df.signal==1].index)
ranges = [(i,list(range(i-20,i+21))) for i in indexes] #create tuple (original index,range)
dfs = [df.loc[i[1]].copy().rename(
columns={'VXX_Full':i[0]}).reset_index(drop=True) for i in ranges]
#EDIT: for only the VXX_Full Column:
dfs = [df.loc[i[1]].copy()[['VXX_Full']].copy().rename(
columns={'VXX_Full':i[0]}).reset_index(drop=True) for i in ranges]
#here we take the -20:+20 slice of df, make a separate dataframe, the
#we change 'VXX_Full' to the original index value, and reset index to give it 0:40 index.
#The new index will be useful when putting all the columns next to each other.
So we made a list of indexes with signal == 1, turned it into a list of ranges and finally a list of dataframes with reset index.
Now we want to merge it all together:
from functools import reduce
merged_df = reduce(lambda left, right: pd.merge(
left, right, left_index=True, right_index=True), dfs)
I would build the resulting dataframe from a dictionnary of lists:
resul = pd.DataFrame({i:df.loc[i-20 if i >=20 else 0: i+40 if i <= len(df) - 40 else len(df),
'VXX_FULL'].values for i in df.loc[df.signal == 1].index})
The trick is that .values extract a numpy array with no associated index.
Beware: above code assumes that the index of the original dataframe is just the row number. Use reset_index first if it is different.
I have a multiindex DataFrame and I'm trying to select data in it base on certain criteria, so far so good. The problem is that once I have selected my data using .loc and pd.IndexSlice, the resulting DataFrame which should logically have less rows and less element in the first level of the multiindex keeps exactly the same multiIndex but with some keys in it refering to empty dataframe.
I've tried creating a completely new DataFrame with a new index, but the structure of my data set is complicating and there is not always the same number of elements in a given level, so it is not easy to created a dataFrame with the right shape in which I can put the data.
import numpy as np
import pandas as pd
np.random.seed(3) #so my exemple is reproductible
idx = pd.IndexSlice
iterables = [['A','B','C'],[0,1,2],['some','rdm','data']]
my_index = pd.MultiIndex.from_product(iterables,names =
['first','second','third'])
my_columns = ['col1','col2','col3']
df1 = pd.DataFrame(data = np.random.randint(10,size =
(len(my_index),len(my_columns))),
index = my_index,
columns = my_columns
)
#Ok, so let's say I want to keep only the elements in the first level of my index (["A","B","C"]) for
#which the total sum in column 3 is less than 35 for some reasons
boolean_mask = (df1.groupby(level = "first").col3.sum() < 35).tolist()
first_level_to_keep = df1.index.levels[0][boolean_mask].tolist()
#lets select the wanted data and put it in df2
df2 = df1.loc[idx[first_level_to_keep,:,:],:]
So far, everything is as expected
The problem is when I want to access the df2 index. I expected the following:
df2.index.levels[0].tolist() == ['B','C']
to be true. But this is what gives a True statement:
df2.index.levels[0].tolist() == ['A','B','C']
So my question is the following: is there a way to select data and to have in retrun a dataFrame with a multiindex reflecting what is in it. Because I find weird to be able to select non existing data in my df2:
I tried to put some images of the dataframes in question but I couldn't because I dont't have enough «reputation»... sorry about that.
Thank you for your time!
Even if you delete the rows corresponding to a particular value in an index level, that value still exists. You can reset the index and then set those columns back as an index in order to generate a MultiIndex with new level values.
df2 = df2.reset_index().set_index(['first','second','third'])
print(df2.index.levels[0].tolist() == ['B','C'])
True