Creating a new dataframe off of duplicate indexes - python

I'm working in pandas and I have a dataframe X
idx
0
1
2
3
4
I want to create a new dataframe with the following indexes from ths list. There are duplicate indexes because I want some rows to repeat.
idx = [0,0,1,2,3,2,4]
My expected output is
idx
0
0
1
2
3
2
4
I cant use
X.iloc[idx]
because of the duplicated indexes
code i tried:
d = {'idx': [0,1,3,4]}
df = pd.DataFrame(data=d)
idx = [0,0,1,2,3,2,4]
df.iloc[idx] # errors here with IndexError: indices are out-of-bounds

What you want to do is weird, but here is one way to do it.
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'ONE', 'TWO'])
OUTPUT
A B C
ONE 11 12 13
ONE 21 22 23
TWO 31 32 33
Read: pandas: Rename columns / index names (labels) of DataFrame

your current dataframe df:-
idx
0 0
1 1
2 3
3 4
Now just use reindex() method:-
idx = [0,0,1,2,3,2,4]
df=df.reindex(idx)
Now if you print df you get:-
idx
0 0.0
0 0.0
1 1.0
2 3.0
3 4.0
2 3.0
4 NaN

Related

Using .loc[index] to set a rows value by index and add any additional columns

Is it possible to use .loc[] to set a rows value (using a series) as well as add any additional columns that might exist in this series. Perhaps there is a type of merge I am unaware of that could merge a series into a dataframe by index.
import pandas as pd
index = 1
series = pd.Series({'a':2, 'b':54, 'c':945})
df = pd.DataFrame({'a': {0: 1, 1: 2, 2: 3}, 'b': {0: 3, 1: 54, 2: 1}})
df.loc[index] = series
output:
a b
0 1 3
1 2 54
2 3 1
Desired output:
a b c
0 1 3
1 2 54 945
2 3 1
You can use index of the series as columns:
>>> df.loc[index, series.index] = series
>>> df
a b c
0 1 3 NaN
1 2 54 945.0
2 3 1 NaN
You can reindex the dataframe first:
df2 = df.reindex(columns=series.index)
df2.loc[index] = series
Alternatively, use combine_first:
df2 = df.combine_first(series.to_frame(name=index).T)
output:
a b c
0 1 3 NaN
1 2 54 945.0
2 3 1 NaN

Iterating over dataframe and get columns as new dataframes

I'm trying to create a set of dataframes from one big dataframe. Theses dataframes consists of the columns of the original dataframe in this manner:
1st dataframe is the 1st column of the original one,
2nd dataframe is the 1st and 2nd columns of the original one,
and so on.
I use this code to iterate over the dataframe:
for i, data in enumerate(x):
data = x.iloc[:,:i]
print(data)
This works but I also get an empty dataframe in the beginning and an index vector I don't need.
any suggestions on how to remove those 2?
thanks
Instead of enumerating the dataframe, since you are not using the outcome after enumerating but using only the index value, you can just iterate in the range 1 through the number of columns added one, then take the slice df.iloc[:, :i] for each value of i, you can use list-comprehension to achieve this.
>>> [df.iloc[:, :i] for i in range(1,df.shape[1]+1)]
[ A
0 1
1 2
2 3,
A B
0 1 2
1 2 4
2 3 6]
The equivalent traditional loop would look something like this:
for i in range(1,df.shape[1]+1):
print(df.iloc[:, :i])
A
0 1
1 2
2 3
A B
0 1 2
1 2 4
2 3 6
you can also do something like this:
data = {
'col_1': np.random.randint(0, 10, 5),
'col_2': np.random.randint(10, 20, 5),
'col_3': np.random.randint(0, 10, 5),
'col_4': np.random.randint(10, 20, 5),
}
df = pd.DataFrame(data)
all_df = {col: df.iloc[:, :i] for i, col in enumerate(df, start=1)}
# For example we can print the last one
print(all_df['col_4'])
col_1 col_2 col_3 col_4
0 1 13 5 10
1 8 16 1 18
2 6 11 5 18
3 3 11 1 10
4 7 14 8 12

Counting rows between a set of indices in Dataframes

I have a problem with counting the amount of rows between two indices from another Dataframe. Let me explain this in an example:
The index of DF2 is the reference vector and I want to count the amount rows/entries in DF1 that are between the couple indices.
DF1 DF2
data data
index index
3 1 2 1
9 1 11 1
15 0 33 1
21 0 34 1
23 0
30 1
34 0
Now I want to count all rows that lie between a couple of indices in DF 2.
The reference vector is the index vector of DF2: [2, 11, 33, 34]
Between index 2 and 11 of DF2 is index: 3 and 9 of DF1 -> result 2
Between index 11 and 33 of DF2 is index: 15, 21, 23, 30 of DF1 -> result 4
Between index 33 and 34 of DF2 is index: 34 of DF1 -> result 1
Therefore the result vector should be: [2, 4, 1]
It is really struggling, so I hope you can help me.
If would first build a dataframe giving the min and max indexes from df2:
limits = pd.DataFrame({'mn': np.roll(df2.index, 1), 'mx': df2.index}).iloc[1:]
It gives:
mn mx
1 2 11
2 11 33
3 33 34
It is then easy to use a comprehension to get the expected list:
result = [len(df1[(i[0]<=df1.index)&(df1.index<=i[1])]) for i in limits.values]
and obtain as expected:
[2, 4, 1]

Flatten nested pandas dataframe columns

After some aggregation, my dataframe looks something like this
A B
B_min B_max
0 11 3 6
1 22 1 2
2 33 4 4
How do I make the columns be A, B_min and B_max, without any nesting? Simple and standard. I've tried reindex_axix() and unstack(), but nothing worked.
Here is one way, but I wish there was an in-built way to do this.
import pandas as pd
df = pd.DataFrame({'A': [11, 11, 22, 22, 33, 33],
'B': [3, 6, 1, 2, 4, 4]})
g = df.groupby('A', as_index=False).agg({'B': ['min', 'max']})
g.columns = ['_'.join(col).strip() if col[1] else col[0] for col in g.columns.values]
# A B_min B_max
# 0 11 3 6
# 1 22 1 2
# 2 33 4 4

Dataframe is not updated when columns are passed to function using apply

I have two dataframes like this:
A B
a 1 10
b 2 11
c 3 12
d 4 13
A B
a 11 NaN
b NaN NaN
c NaN 20
d 16 30
They have identical column names and indices. My goal is to replace the NAs in df2 by the values of df1. Currently, I do this like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': range(1, 5), 'B': range(10, 14)}, index=list('abcd'))
df2 = pd.DataFrame({'A': [11, np.nan, np.nan, 16], 'B': [np.nan, np.nan, 20, 30]}, index=list('abcd'))
def repl_na(s, d):
s[s.isnull().values] = d[s.isnull().values][s.name]
return s
df2.apply(repl_na, args=(df1, ))
which gives me the desired output:
A B
a 11 10
b 2 11
c 3 20
d 16 30
My question is now how this could be accomplished if the indices of the dataframes are different (column names are still the same, and the columns have the same length). So I would have a df2 like this(df1 is unchanged):
A B
0 11 NaN
1 NaN NaN
2 NaN 20
3 16 30
Then the above code does not work anymore since the indices of the dataframes are different. Could someone tell me how the line
s[s.isnull().values] = d[s.isnull().values][s.name]
has to be modified in order to get the same result as above?
You could temporarily change the indexes on df1 to be the same as df2and just combine_first with df2;
df2.combine_first(df1.set_index(df2.index))
A B
1 11 10
2 2 11
3 3 20
4 16 30

Categories