I have a dataframe and I'd like to insert a blank row as a separator whenever the value in the first column changes.
For example:
Column 1 Col2 Col3 Col4
A s b d
A s j k
A b d q
B b a d
C l k p
becomes:
Column 1 Col2 Col3 Col4
A s b d
A s j k
A b d q
B b a d
C l k p
because the value in Column 1 changed
The only way that I figured out how to do this is using VBA as indicated by the correctly marked answer here:
How to automatically insert a blank row after a group of data
But I need to do this in Python.
Any help would be really appreciated!
Create helper DataFrame with index values of last changes, add .5, join together with original by concat, sorting indices by sort_index, create default index by reset_index and lasr remove last row by positions with iloc:
mask = df['Column 1'].ne(df['Column 1'].shift(-1))
df1 = pd.DataFrame('',index=mask.index[mask] + .5, columns=df.columns)
df = pd.concat([df, df1]).sort_index().reset_index(drop=True).iloc[:-1]
print (df)
Column 1 Col2 Col3 Col4
0 A s b d
1 A s j k
2 A b d q
3
4 B b a d
5
6 C l k p
Related
I basically want the solution from this question to be applied to 2 blank rows.
Insert Blank Row In Python Data frame when value in column changes?
I've messed around with the solution but don't understand the code enough to alter it correctly.
You can do:
num_empty_rows = 2
df = (df.groupby('Col1',as_index=False).apply(lambda g: g.append(
pd.DataFrame(data=[['']*len(df.columns)]*num_empty_rows,
columns=df.columns))).reset_index(drop=True).iloc[:-num_empty_rows])
As you can see, after each group df is appended by a dataframe to accommodate num_empty_rows and then at the end reset_index is performed. The last iloc[:-num_empty_rows] is optional i.e. to remove empty rows at the end.
Example input:
df = pd.DataFrame({'Col1': ['A', 'A', 'A', 'B', 'C'],
'Col2':['s','s','b','b','l'],
'Col3':['b','j','d','a','k'],
'Col4':['d','k','q','d','p']
})
Output:
Col1 Col2 Col3 Col4
0 A s b d
1 A s j k
2 A b d q
3
4
5 B b a d
6
7
8 C l k p
I have a dataframe (pandas) and a dictionary with keys and values as list. The values in lists are unique across all the keys. I want to add a new column to my dataframe based on values of the dictionary having keys in it. E.g. suppose I have a dataframe like this
import pandas as pd
df = {'a':1, 'b':2, 'c':2, 'd':4, 'e':7}
df = pd.DataFrame.from_dict(df, orient='index', columns = ['col2'])
df = df.reset_index().rename(columns={'index':'col1'})
df
col1 col2
0 a 1
1 b 2
2 c 2
3 d 4
4 e 7
Now I also have dictionary like this
my_dict = {'x':['a', 'c'], 'y':['b'], 'z':['d', 'e']}
I want the output like this
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z
Presently I am doing this by reversing the dictionary first, i.e. like this
my_dict_rev = {value:key for key in my_dict for value in my_dict[key]}
df['col3']= df['col1'].map(my_dict_rev)
df
But I am sure that there must be some direct method.
I know this is an old question but here are two other ways to do the same job. First convert my_dict to a Series object, then explode it. Then reverse the mapping and use map:
tmp = pd.Series(my_dict).explode()
df['col3'] = df['col1'].map(pd.Series(tmp.index, tmp))
Another option (starts similar to above) but instead of map, merge:
df = df.merge(pd.Series(my_dict, name='col1').explode().rename_axis('col3').reset_index())
Output:
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z
I have a dataframe such as :
the_list =['LjHH','Lhy_kd','Ljk']
COL1 COL2
A ADJJDUD878_Lhy_kd
B Y0_0099JJ_Ljk
C YTUUDBBDHHD
D POL0990E_LjHH'
And I would like to add a new COL3 column where if within COL2 I have a match with a value in the_list, I add in that column the matching element of the_list.
Expected result;
COL1 COL2 COL3
A ADJJDUD878_Lhy_kd Lhy_kd
B Y0_0099JJ_2_Ljk Ljk
C YTUUDBBDHHD NA
D POL0990E_LjHH' LjHH
For get only first matched values use Series.str.extract with joined values of lists by | for regex or:
the_list =['LjHH','Lhy_kd','Ljk']
df['COL3'] = df['COL2'].str.extract(f'({"|".join(the_list)})', expand=False)
print (df)
COL1 COL2 COL3
0 A ADJJDUD878_Lhy_kd Lhy_kd
1 B Y0_0099JJ_Ljk Ljk
2 C YTUUDBBDHHD NaN
3 D POL0990E_LjHH' LjHH
For get all matched values (if possible multiple values) use Series.str.findall with Series.str.join and last repalce empty string to NaNs:
the_list =['LjHH','Lhy_kd','Ljk']
df['COL3']=df['COL2'].str.findall(f'{"|".join(the_list)}').str.join(',').replace('',np.nan)
print (df)
COL1 COL2 COL3
0 A ADJJDUD878_Lhy_kd Lhy_kd
1 B Y0_0099JJ_Ljk Ljk
2 C YTUUDBBDHHD NaN
3 D POL0990E_LjHH' LjHH
I have two df with the same numbers of columns but different numbers of rows.
df1
col1 col2
0 a 1,2,3,4
1 b 1,2,3
2 c 1
df2
col1 col2
0 b 1,3
1 c 1,2
2 d 1,2,3
3 e 1,2
df1 is the existing list, df2 is the updated list. The expected result is whatever in df2 that was previously not in df1.
Expected result:
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2
I've tried with
mask = df1['col2'] != df2['col2']
but it doesn't work with different rows of df.
Use DataFrame.explode by splitted values in columns col2, then use DataFrame.merge with right join and indicato parameter, filter by boolean indexing only rows with right_only and last aggregate join:
df11 = df1.assign(col2 = df1['col2'].str.split(',')).explode('col2')
df22 = df2.assign(col2 = df2['col2'].str.split(',')).explode('col2')
df = df11.merge(df22, indicator=True, how='right', on=['col1','col2'])
df = (df[df['_merge'].eq('right_only')]
.groupby('col1')['col2']
.agg(','.join)
.reset_index(name='col2'))
print (df)
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2
I have a dataframe:
df
Col1 Col2 Col3
A B 5
C D 4
E F 1
I want to see only those rows which contribute to 90% of Col3. In this case the expected output will be :
Col1 Col2 Col3
A B 5
C D 4
I tried the below but is doesnt work as expected:
df['col3'].value_counts(normalize=True) * 100
Is there any solution for the same?
Are you looking for this?
df = df[df.Col3 > 0] # optionally remove 0 valued rows
df = df.sort_values(by='Col3', ascending=False).reset_index(drop=True)
totals = df.Col3.cumsum()
cutoff = totals[totals >= df.Col3.sum() * .7].idxmin()
print(df[:cutoff + 1])
Output
Col1 Col2 Col3
0 A B 5
1 C D 4
#RSM, When you say 90% of the data, do you want the calculation of 90% to always start from the top or do you need it to be random ?
import pandas as pd
import numpy as np
from io import StringIO
d = '''Col1 Col2 Col3
A B 5
C D 4
E F 1'''
df = pd.read_csv(StringIO(d), sep='\s+')
total_value = df['Col3'].sum()
target_value = 0.9 * total_value
df['Cumulative_Sum'] = df['Col3'].cumsum()
desired_df = df.loc[df['Cumulative_Sum'] <=target_value]
print(desired_df)