I want to process each N rows of a DataFrame separately.If my data has 15 row indexed from 0 to 14 I want to process rows from index 0 to 3 , 4 to 7, 8 to 11, 12 to 15
for example let's say for each 4 rows I want the sum(A) and the mean(B)
Index
A
B
0
4
4
1
7
9
2
9
3
3
0
4
4
7
9
5
9
2
6
3
0
7
7
4
8
7
2
9
1
6
The Resulted DataFrame should be
Index
A
B
0
20
5
1
26
3.75
2
8
4
TLDR: how to let DataFrame.apply takes multiple rows instead of a single row at a time
Use GroupBy.agg with integer division by 4 by index:
#default RangeIndex
df = df.groupby(df.index // 4).agg({'A':'sum', 'B':'mean'})
#any index
df = df.groupby(np.arange(len(df.index)) // 4).agg({'A':'sum', 'B':'mean'})
print (df)
A B
0 20 5.00
1 26 3.75
2 8 4.00
Related
I am dealing with the following dataframe:
p q
0 11 2
1 11 2
2 11 2
3 11 3
4 11 3
5 12 2
6 12 2
7 13 2
8 13 2
I want create a new column, say s, which starts with 0 and goes on. This new col is based on "p" column, whenever the p changes, the "s" should change too.
For the first 4 rows, the "p" = 11, so "s" column should have values 0 for this first 4 rows, and so on...
Below is the expectant df:
s p q
0 0 11 2
1 0 11 2
2 0 11 2
3 0 11 2
4 1 11 4
5 1 11 4
6 1 11 4
7 1 11 4
8 2 12 2
9 2 12 2
10 2 12 2
11 3 12 3
12 3 12 3
You need diff with cumsum (subtract one if you want the id to start from 0):
df["finalID"] = (df.ProjID.diff() != 0).cumsum()
df
Update, if you want to take both voyg_id and ProjID into consideration, you can use a OR condition on the two columns difference, so that whichever column changes, you get an increase in the final id.
df['final_id'] = ((df.voyg_id.diff() != 0) | (df.proj_id.diff() != 0)).cumsum()
df
What is the best way to fill missing values in dataframe with items from list?
For example:
pd.DataFrame([[1,2,3],[4,5],[7,8],[10,11,12],[13,14]])
0 1 2
0 1 2 3
1 4 5 NaN
2 7 8 NaN
3 10 11 12
4 13 14 NaN
list = [6, 9, 150]
to get some something like this:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 15
this is actually a little tricky and a bit of a hack, if you know the column you want to fill the NaN values for then you can construct a df for that column with the indices of the missing values and pass the df to fillna:
In [33]:
fill = pd.DataFrame(index =df.index[df.isnull().any(axis=1)], data= [6, 9, 150],columns=[2])
df.fillna(fill)
Out[33]:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 150
You can't pass a dict (my original answer) as the dict key values are the column values to match on and the scalar value will be used for all NaN values for that column which is not what you want:
In [40]:
l=[6, 9, 150]
df.fillna(dict(zip(df.index[df.isnull().any(axis=1)],l)))
Out[40]:
0 1 2
0 1 2 3
1 4 5 9
2 7 8 9
3 10 11 12
4 13 14 9
You can see that it has replaced all NaNs with 9 as it matched the missing NaN index value of 2 with column 2.
I have got two pandas Dataframes df1 and df2. df1 has a multi-index:
A
instance index
a 0 10
1 11
2 7
b 0 8
1 9
2 13
The frame df2 has the same first-level index as df1:
B
instance
a 5
b 12
I want to do two things:
1) Assign the values in df2 to the all the rows of df1
A B
instance index
a 0 10 5
1 11 5
2 7 5
b 0 8 12
1 9 12
2 13 12
2) Create a dataframe object that represents the minimum of values in A and B without concatenating the two dataframes like above:
min(df1,df2):
min
instance index
a 0 5
1 5
2 5
b 0 8
1 9
2 12
For your first request, you can use DataFrame.join:
>>> df1.join(df2)
A B
instance index
a 0 10 5
1 11 5
2 7 5
b 0 8 12
1 9 12
2 13 12
For your second, you can simply call min(axis=1) on that object:
>>> df1.join(df2).min(axis=1).to_frame("min")
min
instance index
a 0 5
1 5
2 5
b 0 8
1 9
2 12
I have a DataFrame x with three columns;
a b c
1 1 10 4
2 5 6 5
3 4 6 5
4 2 11 9
5 1 2 10
... and a Series y of two values;
t
1 3
2 7
Now I'd like to get a DataFrame z with two columns;
t sum_c
1 3 18
2 7 13
... with t from y and sum_c the sum of c from x for all rows where t was larger than a and smaller than b.
Would anybody be able to help me with this?
here is a possible solution based on the given condition (the expected results listed in ur question dont quite line up with the given condition):
In[99]: df1
Out[99]:
a b c
0 1 10 4
1 5 6 5
2 4 6 5
3 2 11 9
4 1 2 10
In[100]: df2
Out[100]:
t
0 3
1 5
then write a function which would be used by pandas.apply() later:
In[101]: def cond_sum(x):
return sum(df1['c'].ix[np.logical_and(df1['a']<x.ix[0],df1['b']>x.ix[0])])
finally:
In[102]: df3 = df2.apply(cond_sum,axis=1)
In[103]: df3
Out[103]:
0 13
1 18
dtype: int64
I have two dataframes.
df1
Out[162]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
df2
Out[194]:
A B
0 a 3
1 b 4
2 c 5
I wish to create a 3rd column in df2 that maps df2['A'] to df1 and find the smallest number in df1 that's greater than the number in df2['B']. For example, for df2['C'].ix[0], it should go to df1['a'] and search for the smallest number that's greater than df2['B'].ix[0], which should be 4.
I had something like df2['C'] = df2['A'].map( df1[df1 > df2['B']].min() ). But this doesn't work as it won't go to df2['B'] search for corresponding rows. Thanks.
Use apply for row-wise methods:
In [54]:
# create our data
import pandas as pd
df1 = pd.DataFrame({'a':list(range(12)), 'b':list(range(12)), 'c':list(range(12))})
df1
Out[54]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
[12 rows x 3 columns]
In [68]:
# create our 2nd dataframe, note I have deliberately used alternate values for column 'B'
df2 = pd.DataFrame({'A':list('abc'), 'B':[3,5,7]})
df2
Out[68]:
A B
0 a 3
1 b 5
2 c 7
[3 rows x 2 columns]
In [69]:
# apply row-wise function, must use axis=1 for row-wise
df2['C'] = df2.apply(lambda row: df1[row['A']].ix[df1[row.A] > row.B].min(), axis=1)
df2
Out[69]:
A B C
0 a 3 4
1 b 5 6
2 c 7 8
[3 rows x 3 columns]
There is some example usage in the pandas docs