Summing over a DataFrame with two conditions and multiple values - python

I have a DataFrame x with three columns;
a b c
1 1 10 4
2 5 6 5
3 4 6 5
4 2 11 9
5 1 2 10
... and a Series y of two values;
t
1 3
2 7
Now I'd like to get a DataFrame z with two columns;
t sum_c
1 3 18
2 7 13
... with t from y and sum_c the sum of c from x for all rows where t was larger than a and smaller than b.
Would anybody be able to help me with this?

here is a possible solution based on the given condition (the expected results listed in ur question dont quite line up with the given condition):
In[99]: df1
Out[99]:
a b c
0 1 10 4
1 5 6 5
2 4 6 5
3 2 11 9
4 1 2 10
In[100]: df2
Out[100]:
t
0 3
1 5
then write a function which would be used by pandas.apply() later:
In[101]: def cond_sum(x):
return sum(df1['c'].ix[np.logical_and(df1['a']<x.ix[0],df1['b']>x.ix[0])])
finally:
In[102]: df3 = df2.apply(cond_sum,axis=1)
In[103]: df3
Out[103]:
0 13
1 18
dtype: int64

Related

Pandas how to output distinct values in column based on duplicate in another column

Here an example:
import pandas as pd
df = pd.DataFrame({
'product':['1','1','1','2','2','2','3','3','3','4','4','4','5','5','5'],
'value':['a','a','a','a','a','b','a','b','a','b','b','b','a','a','a']
})
product value
0 1 a
1 1 a
2 1 a
3 2 a
4 2 a
5 2 b
6 3 a
7 3 b
8 3 a
9 4 b
10 4 b
11 4 b
12 5 a
13 5 a
14 5 a
I need to output:
1 a
4 b
5 a
Because 'value' values for distinct 'product' values all are same
I'm sorry for bad English
I think you need this
m=df.groupby('product')['value'].transform('nunique')
df.loc[m==1].drop_duplicates(). reset_index(drop=True)
Output
product value
0 1 a
1 4 b
2 5 a
Details
df.groupby('product')['value'].transform('nunique') returns a series as below
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 2
9 1
10 1
11 1
12 1
13 1
14 1
where the numbers of the number of unique values in each group. Then we use df.loc to get only the rows in which this value is 1, so, the groups with unique values.
The we drop duplicates since you need only the group & its unique value.
If I undestand correctly your question, this simple code is for your:
distinct_prod_df = df.drop_duplicates(['product'])
and gives:
product value
0 1 a
3 2 a
6 3 a
9 4 b
12 5 a
You can try this:
mask = df.groupby('product').apply(lambda x: x.nunique() == 1)
df = df[mask].drop_duplicates()

Filter pandas dataframe rows by multiple column values

I have a pandas dataframe containing rows with numbered columns:
1 2 3 4 5
a 0 0 0 0 1
b 1 1 2 1 9
c 2 2 2 2 2
d 5 5 5 5 5
e 8 9 9 9 9
How can I filter out the rows where a subset of columns are all above or below a certain value?
So, for example: I want to remove all rows where columns 1 to 3 all values are not > 3. In the above, that would leave me with only rows d and e.
The columns I am filtering and the value I am checking against are both arguments.
I've tried a few things, this is the closest I've gotten:
df[df[range(1,3)]>3]
Any ideas?
I used loc and all
in this function:
def filt(df, cols, thresh):
return df.loc[(df[cols] > thresh).all(axis=1)]
filt(df, [1, 2, 3], 3)
1 2 3 4 5
d 5 5 5 5 5
e 8 9 9 9 9
You can achieve this without using apply:
In [73]:
df[(df.ix[:,0:3] > 3).all(axis=1)]
Out[73]:
1 2 3 4 5
d 5 5 5 5 5
e 8 9 9 9 9
So this slices the df to just the first 3 columns using ix and then we compare against the scalar 3 and then call all(axis=1) to create a boolean series to mask the index

How to Pandas read_csv multiple records per line

(I'm a pandas n00b) I have some oddly formatted CSV data that resembles this:
i A B C
x y z x y z x y z
-------------------------------------
1 1 2 3 4 5 6 7 8 9
2 1 2 3 3 2 1 2 1 3
3 9 8 7 6 5 4 3 2 1
where A, B, C are categorical and the properties x, y, z are present for each. What I think I want to do (part of a larger split-apply-combine step) is to read data with Pandas such that I have dimensionally homogenous observations like this:
i id GRP x y z
-----------------------
1 1 A 1 2 3
2 1 B 4 5 6
3 1 C 7 8 9
4 2 A 1 2 3
5 2 B 3 2 1
6 2 C 2 1 3
7 3 A 9 8 7
8 3 B 6 5 4
9 3 C 3 2 1
So how best to accomplish this?
#1: I thought about reading the file using basic read_csv() options, then iterating/ slicing/transposing/whatever to create another dataframe that has the structure i want. But in my case the number of categories (A,B,C) and properties (x,y,z) is large and is not known ahead of time. I'm also worried about memory issues if scaling to large datasets.
#2: I like the idea of setting the iterator param in read_csv() and then yielding multiple observations per line. (any reason y not set chunksize=1?) I wouldn't be creating multiple dataframes this way at least.
What's the smarter way to do this?
First I constructed the sample dataframe like yours:
column = pd.MultiIndex(levels=[['A', 'B', 'C'], ['x', 'y', 'z']],
labels=[[i for i in range(3) for _ in range(3)], [0, 1, 2]*3])
df = pd.DataFrame(np.random.randint(1,10, size=(3, 9)),
columns=column, index=[1, 2, 3])
print df
# A B C
# x y z x y z x y z
# 1 5 7 4 7 7 8 9 1 9
# 2 8 5 1 8 5 9 4 4 2
# 3 4 9 6 2 1 4 6 1 6
To get your desired output, reshape the dataframe using df.stack() and then reset the index:
df = df.stack(0).reset_index()
df.index += 1 # to make index begin from 1
print df
# level_0 level_1 x y z
# 1 1 A 5 7 4
# 2 1 B 7 7 8
# 3 1 C 9 1 9
# 4 2 A 8 5 1
# 5 2 B 8 5 9
# 6 2 C 4 4 2
# 7 3 A 4 9 6
# 8 3 B 2 1 4
# 9 3 C 6 1 6
Then you can just rename the columns as you want. Hope it helps.

How do i find the minimum of two dataframe columns with multi-indices in python pandas?

I have got two pandas Dataframes df1 and df2. df1 has a multi-index:
A
instance index
a 0 10
1 11
2 7
b 0 8
1 9
2 13
The frame df2 has the same first-level index as df1:
B
instance
a 5
b 12
I want to do two things:
1) Assign the values in df2 to the all the rows of df1
A B
instance index
a 0 10 5
1 11 5
2 7 5
b 0 8 12
1 9 12
2 13 12
2) Create a dataframe object that represents the minimum of values in A and B without concatenating the two dataframes like above:
min(df1,df2):
min
instance index
a 0 5
1 5
2 5
b 0 8
1 9
2 12
For your first request, you can use DataFrame.join:
>>> df1.join(df2)
A B
instance index
a 0 10 5
1 11 5
2 7 5
b 0 8 12
1 9 12
2 13 12
For your second, you can simply call min(axis=1) on that object:
>>> df1.join(df2).min(axis=1).to_frame("min")
min
instance index
a 0 5
1 5
2 5
b 0 8
1 9
2 12

pandas compare and select the smallest number from another dataframe

I have two dataframes.
df1
Out[162]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
df2
Out[194]:
A B
0 a 3
1 b 4
2 c 5
I wish to create a 3rd column in df2 that maps df2['A'] to df1 and find the smallest number in df1 that's greater than the number in df2['B']. For example, for df2['C'].ix[0], it should go to df1['a'] and search for the smallest number that's greater than df2['B'].ix[0], which should be 4.
I had something like df2['C'] = df2['A'].map( df1[df1 > df2['B']].min() ). But this doesn't work as it won't go to df2['B'] search for corresponding rows. Thanks.
Use apply for row-wise methods:
In [54]:
# create our data
import pandas as pd
df1 = pd.DataFrame({'a':list(range(12)), 'b':list(range(12)), 'c':list(range(12))})
df1
Out[54]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
[12 rows x 3 columns]
In [68]:
# create our 2nd dataframe, note I have deliberately used alternate values for column 'B'
df2 = pd.DataFrame({'A':list('abc'), 'B':[3,5,7]})
df2
Out[68]:
A B
0 a 3
1 b 5
2 c 7
[3 rows x 2 columns]
In [69]:
# apply row-wise function, must use axis=1 for row-wise
df2['C'] = df2.apply(lambda row: df1[row['A']].ix[df1[row.A] > row.B].min(), axis=1)
df2
Out[69]:
A B C
0 a 3 4
1 b 5 6
2 c 7 8
[3 rows x 3 columns]
There is some example usage in the pandas docs

Categories