Deleting multiple DataFrame columns in Pandas - python

Say I want to delete a set of adjacent columns in a DataFrame and my code looks something like this currently:
del df['1'], df['2'], df['3'], df['4'], df['5'], df['6']
This works, but I was wondering if there was a more efficient, compact, or aesthetically pleasing way to do it, such as:
del df['1','6']

I think you need drop, for selecting is used range or numpy.arange:
df = pd.DataFrame({'1':[1,2,3],
'2':[4,5,6],
'3':[7,8,9],
'4':[1,3,5],
'5':[7,8,9],
'6':[1,3,5],
'7':[5,3,6],
'8':[5,3,6],
'9':[7,4,3]})
print (df)
1 2 3 4 5 6 7 8 9
0 1 4 7 1 7 1 5 5 7
1 2 5 8 3 8 3 3 3 4
2 3 6 9 5 9 5 6 6 3
print (np.arange(1,7))
[1 2 3 4 5 6]
print (range(1,7))
range(1, 7)
#convert string column names to int
df.columns = df.columns.astype(int)
df = df.drop(np.arange(1,7), axis=1)
#another solution with range
#df = df.drop(range(1,7), axis=1)
print (df)
7 8 9
0 5 5 7
1 3 3 4
2 6 6 3

You can do this without modifying the columns, by passing a slice object to drop:
In [29]:
df.drop(df.columns[slice(df.columns.tolist().index('1'),df.columns.tolist().index('6')+1)], axis=1)
Out[29]:
7 8 9
0 5 5 7
1 3 3 4
2 6 6 3
So this returns the ordinal position of the lower and upper bound of the column end points and passes these to create a slice object against the columns array

Related

Using the original index after slicing a df

Suppose I have a dataframe dataset as the following:
dataset = pd.DataFrame({'id':list('123456'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3]})
print(dataset)
id B C
0 1 4 7
1 2 5 8
2 3 4 9
3 4 5 4
4 5 5 2
5 6 4 3
Now I slice it using iloc() and get
dataset = dataset.iloc[2:5]
id B C
2 3 4 9
3 4 5 4
4 5 5 2
Now I set the id as the new index due to my needs in my project, so I do
dataset.set_index("id", inplace=True)
print(dataset)
B C
id
3 4 9
4 5 4
5 5 2
I would like to select the new dataset using iloc on the original index. So if I do dataset.iloc[3] I would lke to see the first row. However, if I do that it throws me a out of bound error. If I do dataset.iloc[0] it gives me the first row.
Is there anyway I can preserve the original index? Thanks.
iloc is slice by its position you can check subtract the lower
n = 2 # n is 2 since you slice 2:5
dataset.iloc[3-n-1]
Out[648]:
B 4
C 9
Name: 3, dtype: int64
In this case it is recommended to use loc instead of iloc:
dataset.index = dataset.index.astype('int')
dataset.loc[3]
>>>
B 4
C 9
Name: 3, dtype: int64

How do I subset columns in a Pandas dataframe based on criteria using a loop?

I have a Pandas dataframe called "bag' with columns called beans1, beans2, and beans3
bag = pd.DataFrame({'beans1': [3,1,2,5,6,7], 'beans2': [2,2,1,1,5,6], 'beans3': [1,1,1,3,3,2]})
bag
Out[50]:
beans1 beans2 beans3
0 3 2 1
1 1 2 1
2 2 1 1
3 5 1 3
4 6 5 3
5 7 6 2
I want to use a loop to subset each column with observations greater than 1, so that I get:
beans1
0 3
2 2
3 5
4 6
5 7
beans2
0 2
1 2
4 5
5 6
beans3
3 3
4 3
5 2
The way to do it manually is :
beans1=beans.loc[bag['beans1']>1,['beans1']]
beans2=beans.loc[bag['beans2']>1,['beans2']]
beans3=beans.loc[bag['beans3']>1,['beans3']]
But I need to employ a loop, with something like:
for i in range(1,4):
beans+str(i).loc[beans.loc[bag['beans'+i]>1,['beans'+str(i)]]
But it didn't work. I need a Python version of R's eval(parse(text="")))
Any help appreciated. Thanks much!
It is possible, but not recommended, with globals:
for i in range(1,4):
globals()['beans' + str(i)] = bag.loc[bag['beans'+str(i)]>1,['beans'+str(i)]]
for c in bag.columns:
globals()[c] = bag.loc[bag[c]>1,[c]]
print (beans1)
beans1
0 3
2 2
3 5
4 6
5 7
Better is create dictionary:
d = {c: bag.loc[bag[c]>1, [c]] for c in bag}
print (d['beans1'])
beans1
0 3
2 2
3 5
4 6
5 7

How to concat two python DataFrames where if the row already exits it doesn't add it, if not, append it

I'm pretty new to python.
I am trying to concat two dataframes (df1, df2) where if a row already exists in df1 then it is not added. if not, it adds to df1.
I don't want to use .concat().drop_duplicates() because I don't want duplicates within the same DataFrame to be removed.
BackStory:
I have multiple csv files that are exported from a software in different locations once in a while I want to merge these into one file. the problem is the exported files will have the same data as before along with the new records made within that period of time. therefore I need to check if the record is already in there as I will be executing the same code each time I export the data.
for the sake of example:
import pandas as pd
main_df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[2,5,4,5],[9,8,7,6],[8,5,6,7]])
df1 = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[1,5,4,8],[7,3,5,7],[4,3,8,5],[4,3,8,5]])
main_df
0 1 2 3
0 1 2 3 4 --duplicates I want to include--
1 1 2 3 4 --duplicates I want to include--
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
df1
0 1 2 3
0 1 2 3 4 --duplicates I want to exclude--
1 1 2 3 4 --duplicates I want to exclude--
2 4 2 5 1 --duplicates I want to exclude--
3 2 4 1 5 --duplicates I want to exclude--
4 1 5 4 8
5 7 3 5 7
6 4 3 8 5 --duplicates I want to include--
7 4 3 8 5 --duplicates I want to include--
I need the end result to be
main_df (after code execution)
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
7 1 5 4 8
8 7 3 5 7
9 4 3 8 5
10 4 3 8 5
I hope I have explained my issue in a clear way. Thank you
Check for every row in df1 whether it exists in main_df using pandas apply, and turn that into a mask by negating it with the ~ operator. I like using functools partial to make explicit that we are comparing to main_df.
import pandas as pd
from functools import partial
main_df = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[2,5,4,5],
[9,8,7,6],
[8,5,6,7]
])
df1 = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[1,5,4,8],
[7,3,5,7],
[4,3,8,5],
[4,3,8,5]
])
def has_row(df, row):
return (df == row).all(axis = 1).any()
main_df_has_row = partial(has_row, main_df)
duplicate_rows = df1.apply(main_df_has_row, axis = 1)
df1_add = df1.loc[~duplicate_rows]
pd.concat([main_df, df1_add])

Create a new column and assign value for each group starting using groupby

I want to create a new column as 'fold' and assign new values to it depending on group of quote_id.Let's say if 3 quote_id is same then it should assign 1 and next 4 quote_id is same then it should assign 2.
In short it should assign a number to a particular group of quote_id.
I have been trying from long time but I am not getting expected results.
i=1 def func(x): x['fold']=i return x in_df.groupby('quote_id').apply(func) i=i+1
My output should look like below.
quote_id fold
1300079-DE 1
1300079-DE 1
1300079-DE 1
1300185-DE 2
1300560-DE 3
1301011-DE 4
1301011-DE 4
1301011-DE 4
1301644-DE 5
1301907-DE 6
1301907-DE 6
1301907-DE 6
call rank with method='dense':
In [10]:
df['fold'] = df['quote_id'].rank(method='dense')
df
Out[10]:
quote_id fold
0 1300079-DE 1
1 1300079-DE 1
2 1300079-DE 1
3 1300185-DE 2
4 1300560-DE 3
5 1301011-DE 4
6 1301011-DE 4
7 1301011-DE 4
8 1301644-DE 5
9 1301907-DE 6
10 1301907-DE 6
11 1301907-DE 6

Python: Applying a function to DataFrame taking input from the new calculated column

Im facing a problem with applying a function to a DataFrame (to model a solar collector based on annual hourly weather data)
Suppose I have the following (simplified) DataFrame:
df2:
A B C
0 11 13 5
1 6 7 4
2 8 3 6
3 4 8 7
4 0 1 7
Now I have defined a function that takes all rows as input to create a new column called D, but I want the function to also take the last calculated value of D (except of course for the first row as no value for D is calculated) as input.
def Funct(x):
D = x['A']+x['B']+x['C']+(x-1)['D']
I know that the function above is not working, but it gives an idea of what I want.
So to summarise:
Create a function that creates a new column in the dataframe and takes the value of the new column one row above it as input
Can somebody help me?
Thanks in advance.
It sounds like you are calculating a cumulative sum. In that case, use cumsum:
In [45]: df['D'] = (df['A']+df['B']+df['C']).cumsum()
In [46]: df
Out[46]:
A B C D
0 11 13 5 29
1 6 7 4 46
2 8 3 6 63
3 4 8 7 82
4 0 1 7 90
[5 rows x 4 columns]
Are you looking for this?
You can use shift to align the previous row with current row and then you can do your operation.
In [7]: df
Out[7]:
a b
1 1 1
2 2 2
3 3 3
4 4 4
[4 rows x 2 columns]
In [8]: df['c'] = df['b'].shift(1) #First row will be Nan
In [9]: df
Out[9]:
a b c
1 1 1 NaN
2 2 2 1
3 3 3 2
4 4 4 3
[4 rows x 3 columns]

Categories