Suppose I have df1:
dates = pd.date_range('20170101',periods=20)
df1 = pd.DataFrame(np.random.randint(10,size=(20,3)),index=dates,columns=['foo','bar','see'])
I would like to create df2 with the same shape, index and columns. I often find myself doing something like this:
df2= pd.DataFrame(np.ones(shape(df1),index = df1.index,columns =df1.columns)
This is less than ideal. What's the pythonic way?
How about this:
df2 = df1.copy()
df2[:] = 1 # Or any other value, for the matter
The last line is not even necessary if all you want is to preserve the shape and the row/column headers.
You can also use the dataframe method "where" which will allow you to keep data based on condition and preserve the shape/index of the original df.
dates = pd.date_range('20170101',periods=20)
df1 = pd.DataFrame(np.random.randint(10,size=(20,3)),index=dates,columns=['foo','bar','see'])
df2= df1.where(df1['foo'] % 2 == 0, 9999)
df2
Related
I have a dataframe containing only duplicate "MainID" rows. One MainID may have multiple secondary IDs (SecID). I want to concatenate the values of SecID if there is a common MainID, joined by ':' in SecID col. What is the best way of achieving this? Yes, I know this is not best practice, however it's the structure the software wants.
Need to keep the df structure and values in rest of the df. They will always match the other duplicated row. Only SecID will be different.
Current:
data={'MainID':['NHFPL0580','NHFPL0580','NHFPL0582','NHFPL0582'],'SecID':['G12345','G67890','G11223','G34455'], 'Other':['A','A','B','B']}
df=pd.DataFrame(data)
print(df)
MainID SecID Other
0 NHFPL0580 G12345 A
1 NHFPL0580 G67890 A
2 NHFPL0582 G11223 B
3 NHFPL0582 G34455 B
Intended Structure
MainID SecID Other
NHFPL0580 G12345:G67890 A
NHFPL0582 G11223:G34455 B
Try:
df.groupby('MainID').apply(lambda x: ':'.join(x.SecID))
the above code returns a pd.Series, and you can convert it to a dataframe as #Guy suggested:
You need .reset_index(name='SecID') if you want it back as DataFrame
The solution to the edited question:
df = df.groupby(['MainID', 'Other']).apply(lambda x: ':'.join(x.SecID)).reset_index(name='SecID')
You can then change the column order
cols = df.columns.tolist()
df = df[[cols[i] for i in [0, 2, 1]]]
Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1
How do i find the most efficient way to check what rows differ in a pandas dataframe?
Imagine we have the following pandas dataframes, df1 and df2.
df1 = pd.DataFrame([[a,b],[c,d],[e,f]], columns=['First', 'Last'])
df2 = pd.DataFrame([[a,b],[e,f],[g,h]], columns=['First', 'Last'])
In this case, row index 0 of df1 would be [a,b]; row index 1 of df1 would be [c,d] etc
I want to know what is the most efficient way to find what rows these dataframes differ.
In particular, although [e,f] appears in both dataframes - in df1 it is at index 2 and in df2 it is in index 1, I would want my outcome to show this.
something like diff(df1,df2) = [1,2]
I know I could loop through all the rows and check if df1.loc[i,:] == df2.loc[i,:] for i in range(len(df1)) but is there a more efficient way?
You may be looking for this :
df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
From https://stackoverflow.com/a/57812527/15179457.
I have a multiindex DataFrame and I'm trying to select data in it base on certain criteria, so far so good. The problem is that once I have selected my data using .loc and pd.IndexSlice, the resulting DataFrame which should logically have less rows and less element in the first level of the multiindex keeps exactly the same multiIndex but with some keys in it refering to empty dataframe.
I've tried creating a completely new DataFrame with a new index, but the structure of my data set is complicating and there is not always the same number of elements in a given level, so it is not easy to created a dataFrame with the right shape in which I can put the data.
import numpy as np
import pandas as pd
np.random.seed(3) #so my exemple is reproductible
idx = pd.IndexSlice
iterables = [['A','B','C'],[0,1,2],['some','rdm','data']]
my_index = pd.MultiIndex.from_product(iterables,names =
['first','second','third'])
my_columns = ['col1','col2','col3']
df1 = pd.DataFrame(data = np.random.randint(10,size =
(len(my_index),len(my_columns))),
index = my_index,
columns = my_columns
)
#Ok, so let's say I want to keep only the elements in the first level of my index (["A","B","C"]) for
#which the total sum in column 3 is less than 35 for some reasons
boolean_mask = (df1.groupby(level = "first").col3.sum() < 35).tolist()
first_level_to_keep = df1.index.levels[0][boolean_mask].tolist()
#lets select the wanted data and put it in df2
df2 = df1.loc[idx[first_level_to_keep,:,:],:]
So far, everything is as expected
The problem is when I want to access the df2 index. I expected the following:
df2.index.levels[0].tolist() == ['B','C']
to be true. But this is what gives a True statement:
df2.index.levels[0].tolist() == ['A','B','C']
So my question is the following: is there a way to select data and to have in retrun a dataFrame with a multiindex reflecting what is in it. Because I find weird to be able to select non existing data in my df2:
I tried to put some images of the dataframes in question but I couldn't because I dont't have enough «reputation»... sorry about that.
Thank you for your time!
Even if you delete the rows corresponding to a particular value in an index level, that value still exists. You can reset the index and then set those columns back as an index in order to generate a MultiIndex with new level values.
df2 = df2.reset_index().set_index(['first','second','third'])
print(df2.index.levels[0].tolist() == ['B','C'])
True
I have two data frames df1 and df2. both have same numbers of rows but different columns.
I want to concat all columns of df1 and 2nd and 3rd column of df2.
df1 has 119 columns and df2 has 3 of which i want 2nd & 3rd
Code I am using is:
data_train_test = pd.concat([df1,df2.iloc[:,
[2,3]]],axis=1,ignore_index=False)
Error I am getting is
ValueError: Shape of passed values is (121, 39880), indices imply (121, 28898)
My Analysis:
39880 - 28898 = 10982
df1 is TFID data frame made from concat of two other data frames with rows 17916+10982 = 28898.
how I made df2 is
frames = [data, prediction_data]
df2 = pd.concat(frames)
I am not able to find the exact reason for this problem. Can someone please help?
I think I solved it by resetting the index while creating df2.
frames = [data, prediction_data]
df2 = pd.concat(frames).reset_index()
I am not sure I understood your question correctly but I thinks what you want to do is :
data_train_test = pd.concat([df1,df2[[1,2]]])
.iloc[] is used to select a row (the ith row in the index of your dataframe). So you don't really need it their.
import pandas as pd
df1 = pd.DataFrame(data={'a':[0]})
df2 = pd.DataFrame(data={'b1':[1], 'b2':[2], 'b3':[3]})
data_train_test = pd.concat([df1,df2[df2.columns[1:3]]], axis=1)
# or
data_train_test = pd.concat([df1,df2.loc[:,df2.columns[1:3]]], axis=1)