Selecting columns - dataframes, pandas - python

How do you go about selecting a column in a Pandas Dataframe, where the column name depends on a value, which we have located in another dataframe ? For example, if [1,2,3..] are column names of dataframe 1 and [1,2,3..] are values of different cells in dataframe 2. How do you select a column in dataframe 1, by matching the column name with cell value in dataframe 2.

df1 = pd.DataFrame([list('abc')], [0], [1, 2, 3])
df2 = pd.DataFrame(dict(A=[2, 3, 1]))
df1
1 2 3
0 a b c
df2
A
0 2
1 3
2 1
df1[df2.A]
2 3 1
0 b c a
response to comment
df1.loc[0, df2.loc[0, 'A']]
'b'
df1.at[0, df2.at[0, 'A']]
'b'
df1.get_value(0, df2.get_value(0, 'A'))
'b'

Related

How to use the pandas 'isin' function to give actual values of the df row instead of a boolean expression?

I have two dataframes and I'm comparing their columns labeled 'B'. If the value of column B in df2 matches the value of column B in df1, I want to extract the value of column C from df2 and add it to a new column in df1.
Example:
df1
df2
Expected Result of df1:
I've tried the following. I know that this checks if there's a match of column B in both the dataframes - it returns a boolean value of True/False in the 'New' column. Is there a way to extract the value indicated under column 'C' when there's a match and add it to the 'New' column in df1 instead of the boolean values?
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df1['New'] = df2['B'].isin(df1['B'])
import pandas as pd
df1 = pd.DataFrame({'B': ['a', 'b', 'f', 'd', 'h'], 'C':[1, 5, 777, 10, 3]})
df2 = pd.DataFrame({'B': ['k', 'l', 'f', 'j', 'h'], 'C':[0, 9, 555, 15, 1]})
ind = df2[df2['B'].isin(df1['B'])].index
df1.loc[ind, 'new'] = df2.loc[ind, 'C']
df2
B C
0 k 0
1 l 9
2 f 555
3 j 15
4 h 1
Output df1
B C new
0 a 1 NaN
1 b 5 NaN
2 f 777 555.0
3 d 10 NaN
4 h 3 1.0
Here in ind are obtained indexes of rows df2 where there are matches. Further using loc, where on the left are the row indices, on the right are the column names.

How can I find and store how many columns it takes to reach a value greater than the first value in each row?

import pandas as pd
df = {'a': [3,4,5], 'b': [1,2,3], 'c': [4,3,3], 'd': [1,5,4], 'e': [9,4,6]}
df1 = pd.DataFrame(df, columns = ['a', 'b', 'c', 'd', 'e'])
dg = {'b': [2,3,4]}
df2 = pd.DataFrame(dg, columns = ['b'])
Original dataframe is df1. For each row, I want to find the first time a value is bigger than the value in the first column and store it in a new dataframe.
df1
a b c d e
0 3 1 4 1 9
1 4 2 3 5 4
2 5 3 3 4 6
df2 is the resulting dataframe. For example, for df1 row 1; the first value is 3 and the first value bigger than 3 is 4 (column c). Hence in df2 row 1, we store 2 (there are two columns from column a to c). For df1 row 2, the first value is 4 and the first value bigger than 4 is 5 (column d). Hence in df2 row 2, we store 3 (there are three columns from column a to d). For df1 row 3, the first value is 5 and the first value bigger than 5 is 6 (column e). Hence in df2 row 3, we store 4 (there are four columns from column a to e).
df2
b
0 2
1 3
2 4
I would appreciate the help.
In your case we can do sub , if the value gt than 0 , we get the id with idxmax
s=df1.columns.get_indexer(df1.drop('a',1).sub(df1.a,0).ge(0).idxmax(1))
array([1, 1, 3])
df['New']=s
You can get the column names by comparing the entire DataFrame index wise against the first columns, replacing false values with NaNs and applying first_valid_index row wise, eg:
names = (
df1.gt(df1.iloc[:, 0], axis=0)
.replace(False, pd.NA) # or use np.nan
.apply(pd.Series.first_valid_index, axis=1)
)
That'll give you:
0 c
1 d
2 e
Then you can convert those to offsets:
offsets = df1.columns.get_indexer(names)
# array([2, 3, 4])

How to split dataframe on based on columns row

I have one excel file , dataframe have 20 rows . after few rows there is again column names row, i want to divide dataframe based on column names row.
here is example:
x
0
1
2
3
4
x
23
34
5
6
expected output is:
df1
x
0
1
2
3
4
df2
x
23
34
5
6
Considering your column name is col , you can first group the dataframe taking a cumsum on the col where the value equals x by df['col'].eq('x').cumsum() , then for each group create a dataframe by taking the values from the 2nd row of that group and the columns as the first value of that group using df.iloc[] and save them in a dictionary:
d={f'df{i}':pd.DataFrame(g.iloc[1:].values,columns=g.iloc[0].values)
for i,g in df.groupby(df['col'].eq('x').cumsum())}
print(d['df1'])
x
0 0
1 1
2 2
3 3
4 4
print(d['df2'])
x
0 23
1 34
2 5
3 6
Use df.index[df['x'] == 'x'] to look for the row index of where the column name appears again.
Then, split the dataframe into 2 based on the index found
df = pd.DataFrame(columns=['x'], data=[[0], [1], [2], [3], [4], ['x'], [23], [34], [5], [6]])
df1 = df.iloc[:df.index[df['x'] == 'x'].tolist()[0]]
df2 = df.iloc[df.index[df['x'] == 'x'].tolist()[0]+1:]
You did't mention this is sample of your dataset.Then you can try this
import pandas as pd
df1 = []
df2 = []
df1 = pd.DataFrame({'df1': ['x', 0, 1, 2, 3, 4]})
df2 = pd.DataFrame({'df2': ['x', 23, 34, 5, 6]})
display(df1, df2)

Join an empty pandas DataFrame with a Multiindex DataFrame

I want to build a large pandas DataFrame in a loop. In the first iteration the DataFrame df1 is still empty. When I join df1 with df2 that has a MultiIndex, the Index gets squashed somehow.
df1 = pd.DataFrame(index=range(6))
df2 = pd.DataFrame(np.random.randn(6, 3),
columns=pd.MultiIndex.from_arrays((['A','A','A'],
['a', 'b', 'c'])))
df1[df2.columns] = df2
df1
(A, a) (A, b) (A, c)
0 -0.673923 1.392369 1.848935
1 1.427368 0.042691 0.130962
2 -0.258589 0.216157 0.196101
3 -1.022283 1.312113 -0.770108
4 0.511127 -0.633477 -0.229149
5 -1.364237 0.713107 2.124274
I was hoping for a DataFrame with the MultiIndex intact like this:
A
a b c
0 -0.673923 1.392369 1.848935
1 1.427368 0.042691 0.130962
2 -0.258589 0.216157 0.196101
3 -1.022283 1.312113 -0.770108
4 0.511127 -0.633477 -0.229149
5 -1.364237 0.713107 2.124274
What am I doing wrong?
The multiple index will not always recognized when we do assign for a simple index , so
df1 = pd.DataFrame(index=range(6),columns=pd.MultiIndex.from_arrays([[],[]]))
df1[df2.columns] = df2
df1
Out[697]:
A
a b c
0 -0.755397 0.574920 0.901570
1 -0.165472 -1.865715 1.583416
2 -0.403287 1.358329 0.706650
3 0.028019 1.432543 -0.586325
4 -0.414851 0.825253 0.745090
5 0.389917 0.940657 0.125837

Replace single value in a pandas dataframe, when index is not known and values in column are unique

There is a question on SO with the title "Set value for particular cell in pandas DataFrame", but it assumes the row index is known. How do I change a value, if the row values in the relevant column are unique? Is using set_index as below the easiest option? Returning a list of index values with index.tolist does not seem like a very elegant solution.
Here is my code:
import pandas as pd
## Fill the data frame.
df = pd.DataFrame([[1, 2], [3, 4]], columns=('a', 'b'))
print(df, end='\n\n')
## Just use the index, if we know it.
index = 0
df.set_value(index, 'a', 5)
print('set_value', df, sep='\n', end='\n\n')
## Define a new index column.
df.set_index('a', inplace=True)
print('set_index', df, sep='\n', end='\n\n')
## Use the index values of column A, which are known to us.
index = 3
df.set_value(index, 'b', 6)
print('set_value', df, sep='\n', end='\n\n')
## Reset the index.
df.reset_index(inplace = True)
print('reset_index', df, sep='\n')
Here is my output:
a b
0 1 2
1 3 4
set_value
a b
0 5 2
1 3 4
set_index
b
a
5 2
3 4
set_value
b
a
5 2
3 6
reset_index
a b
0 5 2
1 3 6
Regardless of the performance, you should be able to do this using loc with boolean indexing:
df = pd.DataFrame([[5, 2], [3, 4]], columns=('a', 'b'))
# modify value in column b where a is 3
df.loc[df.a == 3, 'b'] = 6
df
# a b
#0 5 2
#1 3 6

Categories