pandas compare and select the smallest number from another dataframe - python

I have two dataframes.
df1
Out[162]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
df2
Out[194]:
A B
0 a 3
1 b 4
2 c 5
I wish to create a 3rd column in df2 that maps df2['A'] to df1 and find the smallest number in df1 that's greater than the number in df2['B']. For example, for df2['C'].ix[0], it should go to df1['a'] and search for the smallest number that's greater than df2['B'].ix[0], which should be 4.
I had something like df2['C'] = df2['A'].map( df1[df1 > df2['B']].min() ). But this doesn't work as it won't go to df2['B'] search for corresponding rows. Thanks.

Use apply for row-wise methods:
In [54]:
# create our data
import pandas as pd
df1 = pd.DataFrame({'a':list(range(12)), 'b':list(range(12)), 'c':list(range(12))})
df1
Out[54]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
[12 rows x 3 columns]
In [68]:
# create our 2nd dataframe, note I have deliberately used alternate values for column 'B'
df2 = pd.DataFrame({'A':list('abc'), 'B':[3,5,7]})
df2
Out[68]:
A B
0 a 3
1 b 5
2 c 7
[3 rows x 2 columns]
In [69]:
# apply row-wise function, must use axis=1 for row-wise
df2['C'] = df2.apply(lambda row: df1[row['A']].ix[df1[row.A] > row.B].min(), axis=1)
df2
Out[69]:
A B C
0 a 3 4
1 b 5 6
2 c 7 8
[3 rows x 3 columns]
There is some example usage in the pandas docs

Related

How to select special rows from a dataframe and escape others iteratively?

Let's assume there's a panda's data frame A, defined as follows:
df_A = pd.read_csv('A.csv') #read data
How to assign df_A to a new data frame df_B such that df_B selects m rows and drops n rows of df_A.
Concrete example: df_B selects 5 rows of df_A and escapes 3, selects the next 5 rows and escapes again 3, and so on.
We can try:
df = pd.DataFrame(dict(zip(range(10), range(1, 11))), index=range(10))
s = pd.Series([True, False]).repeat(pd.Series({True : 5, False : 3}))
df[np.tile(s, int(np.ceil(len(df) / len(s))))[:len(df)]]
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 10
1 1 2 3 4 5 6 7 8 9 10
2 1 2 3 4 5 6 7 8 9 10
3 1 2 3 4 5 6 7 8 9 10
4 1 2 3 4 5 6 7 8 9 10
8 1 2 3 4 5 6 7 8 9 10
9 1 2 3 4 5 6 7 8 9 10
What you could do is itter over the rows with df_A.iterrows() and then add five rows to df_B with df_B.append()
something like this:
i_a = 0
i_b = 0
m = 5
n = 3
for index,row in df_A.itterrows():
i_a += 1
if i_a >= m:
i_b += 1
if i_b >= n:
i_a = 0
i_b = 0
continue
df_B.append(row)
this will perform decently well depending on how large your dataframe is

Column Names in Pandas (Python)

Python : Pandas : Data Frame : Column Names
I have large number of columns and column names are also very large. I would like to see few columns and rows but view becoming restricted to size of column names. How can I temporarily see dataframe in Python without column names (just display data )
Convert DataFrame to numpy array:
print (df.values)
But maybe here is possible select values of columns by positions first by iloc:
print (df.iloc[:, 5:8].values)
Sample:
df = pd.DataFrame(np.random.randint(10, size=(3,10)))
print (df)
0 1 2 3 4 5 6 7 8 9
0 8 4 9 1 3 7 6 3 0 3
1 3 2 6 8 9 3 7 5 7 4
2 0 0 7 5 7 3 9 3 9 3
print (df.iloc[:, 5:8])
5 6 7
0 7 6 3
1 3 7 5
2 3 9 3
print (df.iloc[:, 5:8].values)
[[7 6 3]
[3 7 5]
[3 9 3]]

Add column from one dataframe to another, for values present in overlapping column

Here is an example of what I am trying to do:
In [46]: import pandas as pd
In [47]: df_3 = pd.DataFrame(np.arange(12).reshape(6,2), columns=["a", "z"])
In [48]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=["a", "b", "c"])
In [49]: df
Out[49]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
[4 rows x 3 columns]
In [50]: df_3
Out[50]:
a z
0 0 1 # present in df
1 2 3
2 4 5
3 6 7 # present in df
4 8 9
5 10 11
[6 rows x 2 columns]
I want to add column z to df, but I want the values be added only for rows that match on column a. If not I want a null value in place.
My desired output would look like this:
In [52]: df["z"] = [1, np.nan, 7, np.nan]
In [53]: df
Out[53]:
a b c z
0 0 1 2 1
1 3 4 5 NaN
2 6 7 8 7
3 9 10 11 NaN
[4 rows x 4 columns]
I tried naive attempts, like
In [57]: df.merge(df_3, on=["a"])
Out[57]:
a b c z
0 0 1 2 1
1 6 7 8 7
[2 rows x 4 columns]
Which does not give me the result I am looking for.
Just perform a merge on 'a' column and perform a left type merge:
In [72]:
df.merge(df_3, on='a', how='left')
Out[72]:
a b c z
0 0 1 2 1
1 3 4 5 NaN
2 6 7 8 7
3 9 10 11 NaN
The reason you got this result:
In [57]: df.merge(df_3, on=["a"])
Out[57]:
a b c z
0 0 1 2 1
1 6 7 8 7
[2 rows x 4 columns]
is because the default type of merge is 'inner' so values have to exist in both lhs and rhs, see the docs: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

Summing over a DataFrame with two conditions and multiple values

I have a DataFrame x with three columns;
a b c
1 1 10 4
2 5 6 5
3 4 6 5
4 2 11 9
5 1 2 10
... and a Series y of two values;
t
1 3
2 7
Now I'd like to get a DataFrame z with two columns;
t sum_c
1 3 18
2 7 13
... with t from y and sum_c the sum of c from x for all rows where t was larger than a and smaller than b.
Would anybody be able to help me with this?
here is a possible solution based on the given condition (the expected results listed in ur question dont quite line up with the given condition):
In[99]: df1
Out[99]:
a b c
0 1 10 4
1 5 6 5
2 4 6 5
3 2 11 9
4 1 2 10
In[100]: df2
Out[100]:
t
0 3
1 5
then write a function which would be used by pandas.apply() later:
In[101]: def cond_sum(x):
return sum(df1['c'].ix[np.logical_and(df1['a']<x.ix[0],df1['b']>x.ix[0])])
finally:
In[102]: df3 = df2.apply(cond_sum,axis=1)
In[103]: df3
Out[103]:
0 13
1 18
dtype: int64

python pandas drop value from column 'B' if that value appears in column 'A'

This code appears to drop duplicates from 'A', but leaves 'B' untouched:
df1.drop_duplicates(['A', 'B'], inplace=True)
Edit: That actually drops nothing... What's going on here?
Code (boiled down):
import pandas
df1 = pandas.DataFrame({'A':[1,4,0,8,3,4,5,3,3,3,9,9],
'B':[5,5,7,4,2,0,0,0,0,0,0,0]})
print(df1)
df1.drop_duplicates(['A', 'B'], inplace=True)
print(df1)
Output:
$ python test.py
A B
0 1 5
1 4 5
2 0 7
3 8 4
4 3 2
5 4 0
6 5 0
7 3 0
8 3 0
9 3 0
10 9 0
11 9 0
[12 rows x 2 columns]
A B
0 1 5
1 4 5
2 0 7
3 8 4
4 3 2
5 4 0
6 5 0
7 3 0
10 9 0
[9 rows x 2 columns]
I think I see what's happening above since it was these asterisked rows dropped:
7 3 0
8 3 0*
9 3 0*
10 9 0
11 9 0*
But I still can't see how to remove duplicates in 'B' (or return unique values in 'B'). The two columns are actually coming from separate CSV files. Should I not join them into a single DataFrame? Is there are way to compare and drop duplicates if I don't?
Edit:
This is the output I'm looking for (asterisked values deleted, or plus-marked values to be returned):
A B
0 1 5*
1 4* 5*
2 0* 7+
3 8 4*
4 3* 2+
5 4* 0*
6 5* 0*
7 3* 0*
10 9 0*
[9 rows x 2 columns]
This works:
import pandas
df1 = pandas.DataFrame({'A':[1,4,0,8,3,4,5,3,3,3,9,9],
'B':[5,5,7,4,2,0,0,0,0,0,0,0]})
print(df1)
cln = df1.unstack().drop_duplicates()
cln.drop(['A'], inplace=True)
print(cln)
cln = cln.reset_index(drop=True)
print(cln)
Output:
$ python test.py
A B
0 1 5
1 4 5
2 0 7
3 8 4
4 3 2
5 4 0
6 5 0
7 3 0
8 3 0
9 3 0
10 9 0
11 9 0
[12 rows x 2 columns]
B 2 7
4 2
dtype: int64
0 7
1 2
dtype: int64

Categories