Python - Replicate rows in Pandas Dataframe based on condition - python

I have a Pandas DataFrame on which I need to replicate some of the rows based on the presence of a given list of values in certain columns. If a row contains one of these values in the specified columns, then I need to replicate that row.
df = pd.DataFrame({"User": [1, 2], "col_01": ["C", "A"], "col_02": ["A", "C"], "col_03": ["B", "B"], "Block": ["01", "03"]})
User col_01 col_02 col_03 Block
0 1 C A B 01
1 2 A C B 03
values = ["C", "D"]
columns = ["col_01", "col_02", "col_03"]
rep_times = 3
Given these two lists of values and columns, each row that contains either 'C' or 'D' in the columns named 'col_01', 'col_02' or 'col_03' has to be repeated rep_times times, therefore the output table has to be like this:
User col_01 col_02 col_03 Block
0 1 C A B 01
1 1 C A B 01
2 1 C A B 01
3 2 A A B 03
I tried something like the following but it doesn't work, I don't know how to create this final table. The preferred way would be a one-line operation that does the work.
df2 = pd.DataFrame((pd.concat([row] * rep_times, axis=0, ignore_index=True)
if any(x in values for x in list(row[columns])) else row for index, row in df.iterrows()), columns=df.columns)

import pandas as pd
Firstly create a boolean mask to check your condition by using isin() method:
mask=df[columns].isin(values).any(1)
Finally use reindex() method ,repeat those rows rep_times and append() method to append rows back to dataframe that aren't satisfying the condition:
df=df.reindex(df[mask].index.repeat(rep_times)).append(df[~mask])

Related

How to remove certain values from a pandas dataframe, which are not in a list?

By writing the following code I create a dataframe
data = [['A', 'B','D'], ['A','D'], ['F', 'G','C','B','A']]
df = pd.DataFrame(data)
df
My goal is to remove the values from the dataframe that are not in the list below.
list_items = ['A','B','C']
My expected output is as under
I have tried traversing the values in loops and check one by one, but let's say the dataframe is very large in size (9108, 1616) and the list has over 130 items that need to be checked. In that case it's taking too long to run the code. Please suggest the most efficient way to achieve the expected output.
I don't think doing it in pandas is a good ideas as columns don't matter here. It's easier to do it with lists, that you can convert to a pandas dataframe in the end if you really need it.
# convert df to list of lists
data = df.values.tolist()
# filter each element of the list to contain only list_items values
data_filtered = [ [el for el in l if el in list_items] for l in data]
# convert back to dataframe
df_filtered = pd.DataFrame(data_filtered)
print(df_filtered)
# 0 1 2
#0 A B None
#1 A None None
#2 C B A
Let us try not use for loop
s=df.where(df.isin(list_items)).reset_index().melt('index').dropna()
s=s.assign(Key=s.groupby('index').cumcount()).pivot('index','Key','value')
Key 0 1 2
index
0 A B NaN
1 A NaN NaN
2 C B A
Method two not good for the large dataframe
s=df.where(df.isin(list_items)).T.apply(lambda x : sorted(x,key=pd.isnull)).T.dropna(thresh=1, axis=1)
0 1 2
0 A B NaN
1 A NaN NaN
2 C B A

What's the fastest way to select values from columns based on keys in another columns in pandas?

I need a fast way to extract the right values from a pandas dataframe:
Given a dataframe with (a lot of) data in several named columns and an additional columns whose values only contains names of the other columns, how do I select values from the data-columns with the additional columns as keys?
It's simple to do via an explicit loop, but this is extremely slow with something like .iterrows() directly on the DataFrame. If converting to numpy-arrays, it's faster, but still not fast. Can I combine methods from pandas to do it even faster?
Example: This is the kind of DataFrame structure, where columns A and B contain data and column keys contains the keys to select from:
import pandas
df = pandas.DataFrame(
{'A': [1,2,3,4],
'B': [5,6,7,8],
'keys': ['A','B','B','A']},
)
print(df)
output:
Out[1]:
A B keys
0 1 5 A
1 2 6 B
2 3 7 B
3 4 8 A
Now I need some fast code that returns a DataFrame like
Out[2]:
val_keys
0 1
1 6
2 7
3 4
I was thinking something along the lines of this:
tmp = df.melt(id_vars=['keys'], value_vars=['A','B'])
out = tmp.loc[a['keys']==a['variable']]
which produces:
Out[2]:
keys variable value
0 A A 1
3 A A 4
5 B B 6
6 B B 7
but doesn't have the right order or index. So it's not quite a solution.
Any suggestions?
See if either of these work for you
df['val_keys']= np.where(df['keys'] =='A', df['A'],df['B'])
or
df['val_keys']= np.select([df['keys'] =='A', df['keys'] =='B'], [df['A'],df['B']])
No need to specify anything for the code below!
def value(row):
a = row.name
b = row['keys']
c = df.loc[a,b]
return c
df.apply(value, axis=1)
Have you tried filtering then mapping:
df_A = df[df['key'].isin(['A'])]
df_B = df[df['key'].isin(['B'])]
A_dict = dict(zip(df_A['key'], df_A['A']))
B_dict = dict(zip(df_B['key'], df_B['B']))
df['val_keys'] = df['key'].map(A_dict)
df['val_keys'] = df['key'].map(B_dict).fillna(df['val_keys']) # non-exhaustive mapping for the second one
Your df['val_keys'] column will now contain the result as in your val_keys output.
If you want you can just retain that column as in your expected output by:
df = df[['val_keys']]
Hope this helps :))

How to modify data after replicate in Pandas?

I am trying to edit values after making duplicate rows in Pandas.
I want to edit only one column ("code"), but i see that since it has duplicates , it will affect the entire rows.
Is there any method to first create duplicates and then modify data only of duplicates created ?
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 1234
b = df[a]
df=df.append(b)
print('\n\nafter replicate')
print(df)
Current output after making duplicates is as below:
coun code name
0 A 123 AR
1 F 123 AD
2 N 7 AR
3 I 0 AA
4 T 10 AS
2 N 7 AR
3 I 7 AA
Now I expect to change values only on duplicates created , in this case bottom two rows. But now I see the indexes are duplicated as well.
You can avoid the duplicate indices by using the ignore_index argument to append.
df=df.append(b, ignore_index=True)
You may also find it easier to modify your data in b, before appending it to the frame.
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 3
b = df[a]
b["region"][2] = "N"
df=df.append(b, ignore_index=True)
print('\n\nafter replicate')
print(df)

How to return the index (or column) lable of a value if column (or index) is known in pandas

How can I return the lable of a column if the row is knows as well as the value?
I have a pandas dataframe with rows called "A", "B", "C" and Column called "X", "Y", "Z". Knowing a value being in row (e.g. A), I want to have the Column returned. Looking at the example I want to have "X" returned, when I know that the value is "1" in Row "A". How can this be achieved?
data=[[1,2,3],[4,5,6],[7,8,9]]
d=pd.DataFrame(data, ["A", "B", "C"], ["X", "Y", "Z"])
X Y Z
A 1 2 3
B 4 5 6
C 7 8 9
If you know 1 is in A, use loc and get the resulting index
s = df.loc['A'].eq(1)
s[s].index
which returns
Index(['X'], dtype='object')
If you know there is only one cell with value 1 in your row, then use .item()
>>> s[s].index.item()
'X'
You can using dot
d.eq(1).dot(d.columns).loc[lambda x : x!='']
A X
dtype: object

Selecting a subset of values in python

I have a pandas dataframe, df, which contains a feature ('alpha') which is a list of letters {'A','B',...,'G'}
I'd like to select from df all rows which belong to a subset of this feature, say {'A','B','C'}.
What's the most 'pythonic' way to do this?
I was thinking something along the lines of:
subset = {'A','B','C'}
df1 = df[df['alpha'] == subset]
...but this generates an error:
"need more than 0 values to unpack"
I think you want to use isin to test for membership, example:
In [79]:
subset = {'a','b','c'}
df = pd.DataFrame({'a':list('abasbvggcgasgfdasgcdce')})
df[df['a'].isin(subset)]
Out[79]:
a
0 a
1 b
2 a
4 b
8 c
10 a
15 a
18 c
20 c

Categories