Counting rows between a set of indices in Dataframes - python

I have a problem with counting the amount of rows between two indices from another Dataframe. Let me explain this in an example:
The index of DF2 is the reference vector and I want to count the amount rows/entries in DF1 that are between the couple indices.
DF1 DF2
data data
index index
3 1 2 1
9 1 11 1
15 0 33 1
21 0 34 1
23 0
30 1
34 0
Now I want to count all rows that lie between a couple of indices in DF 2.
The reference vector is the index vector of DF2: [2, 11, 33, 34]
Between index 2 and 11 of DF2 is index: 3 and 9 of DF1 -> result 2
Between index 11 and 33 of DF2 is index: 15, 21, 23, 30 of DF1 -> result 4
Between index 33 and 34 of DF2 is index: 34 of DF1 -> result 1
Therefore the result vector should be: [2, 4, 1]
It is really struggling, so I hope you can help me.

If would first build a dataframe giving the min and max indexes from df2:
limits = pd.DataFrame({'mn': np.roll(df2.index, 1), 'mx': df2.index}).iloc[1:]
It gives:
mn mx
1 2 11
2 11 33
3 33 34
It is then easy to use a comprehension to get the expected list:
result = [len(df1[(i[0]<=df1.index)&(df1.index<=i[1])]) for i in limits.values]
and obtain as expected:
[2, 4, 1]

Related

Iterating over dataframe and get columns as new dataframes

I'm trying to create a set of dataframes from one big dataframe. Theses dataframes consists of the columns of the original dataframe in this manner:
1st dataframe is the 1st column of the original one,
2nd dataframe is the 1st and 2nd columns of the original one,
and so on.
I use this code to iterate over the dataframe:
for i, data in enumerate(x):
data = x.iloc[:,:i]
print(data)
This works but I also get an empty dataframe in the beginning and an index vector I don't need.
any suggestions on how to remove those 2?
thanks
Instead of enumerating the dataframe, since you are not using the outcome after enumerating but using only the index value, you can just iterate in the range 1 through the number of columns added one, then take the slice df.iloc[:, :i] for each value of i, you can use list-comprehension to achieve this.
>>> [df.iloc[:, :i] for i in range(1,df.shape[1]+1)]
[ A
0 1
1 2
2 3,
A B
0 1 2
1 2 4
2 3 6]
The equivalent traditional loop would look something like this:
for i in range(1,df.shape[1]+1):
print(df.iloc[:, :i])
A
0 1
1 2
2 3
A B
0 1 2
1 2 4
2 3 6
you can also do something like this:
data = {
'col_1': np.random.randint(0, 10, 5),
'col_2': np.random.randint(10, 20, 5),
'col_3': np.random.randint(0, 10, 5),
'col_4': np.random.randint(10, 20, 5),
}
df = pd.DataFrame(data)
all_df = {col: df.iloc[:, :i] for i, col in enumerate(df, start=1)}
# For example we can print the last one
print(all_df['col_4'])
col_1 col_2 col_3 col_4
0 1 13 5 10
1 8 16 1 18
2 6 11 5 18
3 3 11 1 10
4 7 14 8 12

Function in pandas to stack rows into columns by number of rows?

Suppose I have heterogeneous dataframe:
a b c d
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
4 13 14 15 16
And i want to stack the rows like so:
a b c d
1 1,5,8,13 2,6,10,14 3,7,11,15 4,8,12,16
Etc...
All the references for grouby etc seem to require some feature of grouping, I just want to put x rows into columns, regardless of their content. Each row has a timestamp, I am looking to group values by sample count, so i want 1 row with all the values of x sample rows as columns.
I should end up with a dataframe that has x*original number of columns and original number of rows/x
I'm sure there must be some simple method I'm missing here without a series of loop etc
If need join all values to strings use:
df1 = df.astype(str).agg(','.join).to_frame().T
print (df1)
a b c d
0 1,5,9,13 2,6,10,14 3,7,11,15 4,8,12,16
Or if need create lists use:
df2 = pd.DataFrame([[list(df[x]) for x in df]], columns=df.columns)
print (df2)
a b c d
0 [1, 5, 9, 13] [2, 6, 10, 14] [3, 7, 11, 15] [4, 8, 12, 16]
If need scalars with MultiIndex (generated fro index nad columns labels) use:
df3 = df.unstack().to_frame().T
print (df3)
a b c d
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
0 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

Creating a new dataframe off of duplicate indexes

I'm working in pandas and I have a dataframe X
idx
0
1
2
3
4
I want to create a new dataframe with the following indexes from ths list. There are duplicate indexes because I want some rows to repeat.
idx = [0,0,1,2,3,2,4]
My expected output is
idx
0
0
1
2
3
2
4
I cant use
X.iloc[idx]
because of the duplicated indexes
code i tried:
d = {'idx': [0,1,3,4]}
df = pd.DataFrame(data=d)
idx = [0,0,1,2,3,2,4]
df.iloc[idx] # errors here with IndexError: indices are out-of-bounds
What you want to do is weird, but here is one way to do it.
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'ONE', 'TWO'])
OUTPUT
A B C
ONE 11 12 13
ONE 21 22 23
TWO 31 32 33
Read: pandas: Rename columns / index names (labels) of DataFrame
your current dataframe df:-
idx
0 0
1 1
2 3
3 4
Now just use reindex() method:-
idx = [0,0,1,2,3,2,4]
df=df.reindex(idx)
Now if you print df you get:-
idx
0 0.0
0 0.0
1 1.0
2 3.0
3 4.0
2 3.0
4 NaN

How to create a cumulative sum column in python if column value is greater than other value

I am working now in getting a cumulative sum column using pandas. However, this column most include cumulative sum only if other column value is greater than other column value. Here's an example of my current data:
Index A B C
0 1 20 3
1 10 15 11
2 20 12 25
3 30 18 32
4 40 32 17
5 50 12 4
Then I want to cumsum() column A if column B is greater than C, if not value is zero. Result column D in original df should look like:
Index A B C D
0 1 20 3 1
1 10 15 11 11
2 20 12 25 0
3 30 18 32 0
4 40 32 17 40
5 50 12 4 90
I appreciate any support in advance.
df = pd.DataFrame({'A': {0: 1, 1: 10, 2: 20, 3: 30, 4: 40, 5: 50},
'B': {0: 20, 1: 15, 2: 12, 3: 18, 4: 32, 5: 12},
'C': {0: 3, 1: 11, 2: 25, 3: 32, 4: 17, 5: 4}})
Make a boolean Series for your condition and identify consecutive groups of True or False
b_gt_c = df.B > df.C
groups = b_gt_c.ne(b_gt_c.shift()).cumsum()
In [107]: b_gt_c
Out[107]:
0 True
1 True
2 False
3 False
4 True
5 True
dtype: bool
In [108]: groups
Out[108]:
0 1
1 1
2 2
3 2
4 3
5 3
dtype: int32
Group by those groups; multiply the cumsum of each group by the condition; assign the result to the new df column.
gb = df.groupby(groups)
for k,g in gb:
df.loc[g.index,'D'] = g['A'].cumsum() * b_gt_c[g.index]
In [109]: df
Out[109]:
A B C D
0 1 20 3 1.0
1 10 15 11 11.0
2 20 12 25 0.0
3 30 18 32 0.0
4 40 32 17 40.0
5 50 12 4 90.0
You could skip the for loop as well :
df['G'] = np.where(df.B.gt(df.C), df.A, np.NaN)
group = df.B.gt(df.C).ne(df.B.gt(df.C).shift()).cumsum()
df['G'] = df.groupby(group).G.cumsum().fillna(0)
Identifying consecutive occurrence of values from SO Q&A: Grouping dataframe based on consecutive occurrence of values
There probably is more legant solution, but this also works.
We first create two dummy columns - x and x_shift.
df.x is conditional where we retain values of df.A where df.B > df.C.
df.x_shift is where we shift values one row below and fill na with 0.
In last step we conditionally add df.A and df.x_shift and then drop df.x and df.x_shift
df['x'] = pd.DataFrame(np.where(df.B>df.C, df.A ,0))
df['x_shift'] = df.x.shift(1).fillna(0)
df['D'] = pd.DataFrame(np.where(df.B >df.C, df.A+df.x_shift,0))
df= df.drop(['x','x_shift'], axis=1
While it's a little barbaric you could convert to numpy arrays and then write a simple catch that goes through the 3 arrays and compares values.

How to replace elements inside the list in series

I have a DataFrame like below,
df1
col1
0 10
1 [5, 8, 11]
2 15
3 12
4 13
5 33
6 [12, 19]
Code to generate this df1:
df1 = pd.DataFrame({"col1":[10,[5,8,11],15,12,13,33,[12,19]]})
df2
col1 col2
0 12 1
1 10 2
2 5 3
3 11 10
4 7 5
5 13 4
6 8 7
Code to generate this df2:
df2 = pd.DataFrame({"col1":[12,10,5,11,7,13,8],"col2":[1,2,3,10,5,4,7]})
I want to replace elements in df1 with df2 values.
If the series values contains non list elements,
I could simply replace with map
df1['res'] = df1['col1'].map(df2.set_index('col1')["col2"].to_dict())
But now this series contains mixed of list and scalar.
How to replace elements in list and scalar values in series in effective way.
Expected Output
col1 res
0 10 2
1 [5, 8, 11] [3,7,10]
2 15 15
3 12 1
4 13 4
5 33 33
Your series is of dtype object, as it contains int and list objects. This is inefficient for Pandas and means a vectorised solution won't be possible.
You can create a mapping dictionary and use pd.Series.apply. To account for list objects, you can catch TypeError. You meet this specific error for lists since they are not hashable, and therefore cannot be used as dictionary keys.
d = df2.set_index('col1')['col2'].to_dict()
def mapvals(x):
try:
return d.get(x, x)
except TypeError:
return [d.get(i, i) for i in x]
df1['res'] = df1['col1'].apply(mapvals)
print(df1)
col1 res
0 10 2
1 [5, 8, 11] [3, 7, 10]
2 15 15
3 12 1
4 13 4
5 33 33
6 [12, 19] [1, 19]

Categories