python pandas: computing argmax of column in matrix subset

python pandas: computing argmax of column in matrix subset - python

Consider toy dataframes df1 and df2, where df2 is a subset of df1 (excludes the first row).
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'colA':[3.0,9,45,7],'colB':['A','B','C','D']})
df2 = df1[1:]
Now lets find argmax of colA for each frame
np.argmax(df1.colA) ## result is "2", which is what I expected
np.argmax(df2.colA) ## result is still "2", which is not what I expected. I expected "1"
If my matrix of insterest is df2, how do I get around this indexing issue? Is this quirk related to pandas, numpy, or just python memory?

I think it's due to index. You could use reset_index when you assign df2:
df1 = pd.DataFrame({'colA':[3.0,9,45,7],'colB':['A','B','C','D']})
df2 = df1[1:].reset_index(drop=True)
In [464]: np.argmax(df1.colA)
Out[464]: 2
In [465]: np.argmax(df2.colA)
Out[465]: 1
I think it's better to use method argmax instead of np.argmax:
In [467]: df2.colA.argmax()
Out[467]: 1

You need to reset the index of df2:
df2.reset_index(inplace=True, drop=True)
np.argmax(df2.colA)
>> 1

Related

Multiplying 2 Pandas Dataframes with different numbers of columns

I have 2 pandas dataframes with a different number of columns. df1 contains 40 rows x 23320 columns while df2 contains 40 rows x 1 column. All columns of df1 have to be multipyed by df2. But my result only contains either NaN values or an unchanged df1 (depending on what I am trying).
I don't get an error. This is python 2.7 and I have to use it.
Here is a picture of the 2 dataframes.
I tried the following code:
hnklnTnk = df7.mul(lndf)
or
hnklnTnk = df7 * lndf
I suspect that maybe something could be wrong with the dfs, because if I try df7.round(2) it stays the same.

I cannot find documentation on the latest version of pandas for python 2.7, but this should work if you have a transform function: df2.transform(lambda x: x * df1.values).
full example for two DFs: one with many columns and the other with a single column. both have the same number of rows:
df1 = pd.DataFrame([10,20,30,40,50,60,70,80,90,100])
df2 = pd.DataFrame({
'col1': [.1,.20,.30,.40,.50,.60,.70,.80,.90,1],
'col2': [1,2,3,4,5,6,7,8,9,10],
'col3': [10,20,30,40,50,60,70,80,90,100]
})
df2 = df2.transform(lambda x: x * df1.values)
documentation on pd.DataFrame.transform: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html

df1.multiply(df2.values, axis=1)
or try:
df = pd.DataFrame(df1.values * df2.values, columns=df2.columns)

Filter Subset Pandas Dataframe by rows and columns

I have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array(([1,2,3], [1,2,3], [1,2,3], [4,5,6])),
columns=['one','two','three'])
#BelowI am sub setting by rows and columns. But I want to have more than just one column.
#In this case Column 'One' and 'two'
small=df[df.one==1].one
What is the alternative here?

You can use loc:
df = pd.DataFrame(np.array(([1,2,3], [1,2,3], [1,2,3], [4,5,6])),
columns=['one','two','three'])
small=df.loc[df.one==1, ["one", "two"]]
# > one two
# 0 1 2
# 1 1 2
# 2 1 2
The first element of loc is the wanted rows ; the second is the wanted columns. As demonstrated here, it allows both masking and indexing.

How to insert Pandas dataframe into another Pandas dataframe without wrapping it in a Series?

import pandas as pd
df1 = pd.DataFrame([{'a':1,'b':2}])
df2 = pd.DataFrame([{'x':100}])
df2[y] = [df1]
#df2.iloc[:,'y'].shape = (1,)
# type(df2.iloc[:,1][0]) = pandas.core.frame.DataFrame
I want to make a df a column in an existing row. However Pandas wraps this df in a Series object so that I cannot access it with dot notation such as df2.y.a to get the value 1. Is there a way to make this not occur or is there some constraint on object type for df elements such that this is impossible?
the desired output is a df like:
x y
0 100 a b
0 1 2
and type(df2.y) == pd.DataFrame

You can combine two DataFrame objects along the columns axis, which I think achieves what you're trying to. Let me know if this is what you're looking for
import pandas as pd
df1 = pd.DataFrame([{'a':1,'b':2}])
df2 = pd.DataFrame([{'x':100}])
combined_df = pd.concat([df1, df2], axis=1)
print(combined_df)
a b x
0 1 2 100

Pandas drop subset of dataframe

Assume we have df and df_drop:
df = pd.DataFrame({'A': [1,2,3], 'B': [1,1,1]})
df_drop = df[df.A==df.B]
I want to delete df_drop from df without using the explicit conditions used when creating df_drop. I.e. I'm not after the solution df[df.A!=df.B], but would like to, basically, take df minus df_drop somehow. Hopes this is clear enough. Otherwise happy to elaborate!

You can merge both dataframes setting indicator=True and drop those columns where the indicator column is both:
out = pd.merge(df,df_drop, how='outer', indicator=True)
out[out._merge.ne('both')].drop('_merge',1)
A B
1 2 1
2 3 1
Or as jon clements points out, if checking by index is enough, you could simply use:
df.drop(df_drop.index)

In this case, drop_duplicates works because the test criteria is the equality of two rows.
More generally, you can use loc to find the rows that meet or do not meet the specified criteria.
a = np.random.randint(1, 50, 100)
b = np.random.randint(1, 50, 100)
df = pd.DataFrame({'a': a, 'b': b})
criteria = df.a > 2 * df.b
df.loc[criteria, :]

Like this maybe:
In [1468]: pd.concat([df, df_drop]).drop_duplicates(keep=False)
Out[1468]:
A B
1 2 1
2 3 1

Python Pandas dividing columns by column [duplicate]

I have a DataFrame (df1) with a dimension 2000 rows x 500 columns (excluding the index) for which I want to divide each row by another DataFrame (df2) with dimension 1 rows X 500 columns. Both have the same column headers. I tried:
df.divide(df2) and
df.divide(df2, axis='index') and multiple other solutions and I always get a df with nan values in every cell. What argument am I missing in the function df.divide?

In df.divide(df2, axis='index'), you need to provide the axis/row of df2 (ex. df2.iloc[0]).
import pandas as pd
data1 = {"a":[1.,3.,5.,2.],
"b":[4.,8.,3.,7.],
"c":[5.,45.,67.,34]}
data2 = {"a":[4.],
"b":[2.],
"c":[11.]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1.div(df2.iloc[0], axis='columns')
or you can use df1/df2.values[0,:]

You can divide by the series i.e. the first row of df2:
In [11]: df = pd.DataFrame([[1., 2.], [3., 4.]], columns=['A', 'B'])
In [12]: df2 = pd.DataFrame([[5., 10.]], columns=['A', 'B'])
In [13]: df.div(df2)
Out[13]:
A B
0 0.2 0.2
1 NaN NaN
In [14]: df.div(df2.iloc[0])
Out[14]:
A B
0 0.2 0.2
1 0.6 0.4

Small clarification just in case: the reason why you got NaN everywhere while Andy's first example (df.div(df2)) works for the first line is div tries to match indexes (and columns). In Andy's example, index 0 is found in both dataframes, so the division is made, not index 1 so a line of NaN is added. This behavior should appear even more obvious if you run the following (only the 't' line is divided):
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'])
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'])
df_a.div(df_b)
So in your case, the index of the only row of df2 was apparently not present in df1. "Luckily", the column headers are the same in both dataframes, so when you slice the first row, you get a series, the index of which is composed by the column headers of df2. This is what eventually allows the division to take place properly.
For a case with index and column matching:
df_a = pd.DataFrame(np.random.rand(3,5), index= ['x', 'y', 't'], columns = range(5))
df_b = pd.DataFrame(np.random.rand(2,5), index= ['z','t'], columns = [1,2,3,4,5])
df_a.div(df_b)

If you want to divide each row of a column with a specific value you could try:
df['column_name'] = df['column_name'].div(10000)
For me, this code divided each row of 'column_name' with 10,000.

to divide a row (with single or multiple columns), we need to do the below:
df.loc['index_value'] = df.loc['index_value'].div(10000)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python pandas: computing argmax of column in matrix subset - python

You need to reset the index of df2: df2.reset_index(inplace=True, drop=True) np.argmax(df2.colA) >> 1

Related

Multiplying 2 Pandas Dataframes with different numbers of columns

Filter Subset Pandas Dataframe by rows and columns

How to insert Pandas dataframe into another Pandas dataframe without wrapping it in a Series?

Pandas drop subset of dataframe

Python Pandas dividing columns by column [duplicate]

Categories

Resources