I am using the Dash Datatable code to create the table in Plotly/Python. I would like to format the various rows in value column. For example, I would like to format Row[1] with $ sign, while Row[2] with %. TIA
#Row KPI Value
0 AA 1
1 BB $230.
2 CC 54%
3 DD 5.6.
4 EE $54000
Table
I have been looking into this issue as well. unfortunately I didn't succeed with any thing built-in either. If you do in the future, please let me know.
However, the solution that I implemented was the following function to easily change the format of DataFrame elements to strings with the formatting I would like:
def dt_formatter(df:pd.DataFrame,
formatter:str,
slicer:pd.IndexSlice=None)->pd.DataFrame:
if not slicer:
for col in df.columns:
df[col] = df[col].apply(formatter.format,axis = 0)
return df
else:
dfs = df.loc[slicer].copy()
for col in dfs.columns:
dfs[col] = dfs[col].apply(formatter.format,axis = 0)
df.loc[slicer] = dfs
return df
and the using your regular slicing / filtering with your base dataframe df. Assuming your base df looks like this:
>>> df
#Row KPI Value
0 AA 1
1 BB 230
2 CC 54
3 DD 5.6
4 EE 54000
>>> df = dt_formatter(df, '{:.0%}', pd.IndexSlice[df['#Row'] == 1,'Value')
>>> df
#Row KPI Value
0 AA 1
1 BB 230%
2 CC 54
3 DD 5.6
4 EE 54000
using a different slicer and different formatting string, you could "build" your DataFrame using such a helper function.
Related
I have a dataframe:
d= {'page_number':[0,0,0,0,0,0,1,1,1,1], 'text':[aa,ii,cc,dd,ee,ff,gg,hh,ii,jj]}
df = pd.DataFrame(data=d)
df
page_number text
0 0 aa
1 0 ii
2 0 cc
3 0 dd
4 0 ee
5 0 ff
6 1 gg
7 1 hh
8 1 ii
9 1 jj
I want to spot the page_numer where 'gg' appears, now on the same page_number there can be many different substrings, but I'm interested in extracting the row number of where 'ii' appears on the same page_number of 'gg' (not interested in getting results of other 'ii' substrings appearances)
idx=np.where(df['text'].str.contains(r'gg', na=True))[0][0]
won't necessarily help here as it retrieves the row number of 'gg' but not its 'page_number'.
Many thanks
You first leave only 'ii' and 'gg' appearances:
df = df[df['text'].isin(['ii', 'gg'])
Then by groupby page number we can assume that when ever we got 2 then they are on the same page:
df2 = df.groupby('page_number').count()
df2[df2['text'] == 2]
You can use pandas to retrieve column value on the basis of another column value. I hope this will retrieve what you are looking for.
df[df['text']=='gg']['page_number']
In case you have several 'gg's and 'ii's on any page:
This will return a boolean Series:
df = df.groupby(by='page_number').agg(lambda x: True if 'gg' in x.values
and 'ii' in x.values else False)
And this will get you the numbers of pages
df[df.text].index
I have a dataframe with columns for levels of hierarchy (similar to directory path). I am trying to keep only the records with the latest generation in the levels (leaves of the hierarchy tree). I tried couple ways with transform and groupby but unable to get the desired output
Code
import numpy as np
import pandas as pd
df = pd.DataFrame({'lvl1':['aa','aa','aa','aa','bb','bb','bb','bb','cc','aa'],
'lvl2':[np.nan,'xx','xx','xx',np.nan,'yy','yy','zz',np.nan,'sa'],
'lvl3':[np.nan,np.nan,'ww','qq',np.nan,np.nan,'rr',np.nan,np.nan,'jj'],
'value':[12,4,7,22,76,0,18,47,10,2]})
result = pd.DataFrame({'lvl1':['aa','aa','bb','bb','cc','aa'],
'lvl2':['xx','xx','yy','zz',np.nan,'sa'],
'lvl3':['ww','qq','rr',np.nan,np.nan,'jj'],
'value':[7,22,18,47,10,2]})
Figure
Appreciate your help
If need filter rows by maximum unique values per rows and for each category not maximum only last row use:
s = df[["lvl1", "lvl2", "lvl3"]].nunique(axis=1)
#if need test number of non missing values use count
#s = df[["lvl1", "lvl2", "lvl3"]].count(axis=1)
df = df[~s.duplicated(keep='last') | s.eq(s.max())]
print (df)
lvl1 lvl2 lvl3 value
2 aa xx ww 7
3 aa xx qq 22
6 bb yy rr 18
7 bb zz NaN 47
8 cc NaN NaN 10
9 aa sa jj 2
If I have a cell containing 2 characters and sometimes 3.
I need to format the cell-like:
<2spaces>XX<2spaces>
and if contains 3 characters:
<2spaces>XXX<1space>.
I use a new-style format
dx['C'] = dx['C'].map('{:^4s}'.format)
Note: dx['C'] is a column in pandas table.
Given:
C
0 aaa
1 aa
Doing:
df.C = df.C.str.center(6) if len(df.C)%2 else (' ' + df.C).str.center(6)
Output:
C
0 aaa
1 aa
I have a dataframe
df = pd.DataFrame({'col1': [1,2,1,2], 'col2': ['aa bb cc', 'ee-ff-gg', 'hh ii kk', 'll-mm-nn']})
I want to:
Split col2 on the ' ' where col1==1
Split on the '-' where col1==2
Append this data to 3 new columns: (col20, col21, col22)
Ideally the code would look like this:
subdf=df.loc[df['col1']==1]
#list of columns to use
col_list=['col20', 'col21', 'col22']
#append to dataframe new columns from split function
subdf[col_list]=(subdf.col2.str.split(' ', 2, expand=True)
however this hasn't worked.
I have tried using merge and join, however:
join doesn't work if the columns are already populated
merge doesn't work if they aren't.
I have also tried:
#subset dataframes
subdf=df.loc[df['col1']==1]
subdf2=df.loc[df['col1']==2]
#trying the join method, only works if columns aren't already present
subdf.join(subdf.col2.str.split(' ', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'}))
#merge doesn't work if columns aren't present
subdf2=subdf2.merge(subdf2.col2.str.split('-', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'}))
subdf2
the error messages when I run it:
subdf2=subdf2.merge(subdf2.col2.str.split('-', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'})
MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
EDIT givin information after mark's comment on regex
My original col1 was actually the regex combination I had used to extract col2 from some strings.
#the combination I used to extract the col2
combinations= ['(\d+)[-](\d+)[-](\d+)[-](\d+)', '(\d+)[-](\d+)[-](\d+)'... ]
here is the original dataframe
col1 col2
(\d+)[-](\d+)[-](\d+)[-](\d+) 350-300-50-10
(\d+)[-](\d+)[-](\w+)(\d+) 150-180-G31
I then created a dictionary that connected every combination to what the split values of col2 represented:
filtermap={'(\d+)[-](\d+)[-](\w+)(\d+)': 'thickness temperature sample', '(\d+)[-](\d+)[-](\d+)[-](\d+)': 'thickness temperature width height' }
with this filter I wanted to:
Subset the dattaframe based on regex combinations
use split on col2 to find the values corresponding to the combination using the filtermap (thickness temperature..)
add these values to the new columns on the dataframe
col1 col2 thickness temperature width length sample
(\d+)[-](\d+)[-](\d+)[-](\d+) 350-300-50-10 350 300 50 10
(\d+)[-](\d+)[-](\w+)(\d+) 150-180-G31 150 180 G31
since you mentioned regex maybe you know of a way to do this directly ?
EDIT 2; input-output
in the input there are strings like so:
'this is the first example string 350-300-50-10 ',
'this is the second example string 150-180-G31'
formats that are:
number-number-number-number(350-300-50-10 ) have this orded information in them: thickness(350)-temperature(300)-width(50)-length(10)
number-number-letternumber (150-180-G31 ) have this ordered information in them: thickness-temperature-sample
desired output:
col2, thickness, temperature, width, length, sample
350-300-50-10 350 300 50 10 None
150-180-G31 150 180 None None G31
I used eg:
re.search('(\d+)[-](\d+)[-](\d+)[-](\d+)'))
to find the col2 in the strings
You can use np.where to simplify this problem.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [1,2,1,2],
'col2': ['aa bb cc', 'ee-ff-gg', 'hh ii kk', 'll-mm-nn']
})
temp = np.where(df['col1'] == 1, #a boolean array/series indicating where the values are equal to 1.
df['col2'].str.split(' '), #Use the output of this if True
df['col2'].str.split('-') #Else use this.
)
temp_df = pd.DataFrame(temp.tolist()) #create a new dataframe with the columns we need
#Output:
0 1 2
0 aa bb cc
1 ee ff gg
2 hh ii kk
3 ll mm nn
Now just assign the result back to the original df. You can use a concat or join, but a simple assignment suffices as well.
df[[f'col2_{i}' for i in temp_df.columns]] = temp_df
print(df)
col1 col2 col2_0 col2_1 col2_2
0 1 aa bb cc aa bb cc
1 2 ee-ff-gg ee ff gg
2 1 hh ii kk hh ii kk
3 2 ll-mm-nn ll mm nn
EDIT: To address more than two conditional splits
If you need more than two conditions, np.where was only designed to work for a binary selection. You can Opt for a "custom" approach that works with as many splits as you like here.
splits = [ ' ', '-', '---']
all_splits = pd.DataFrame({s:df['col2'].str.split(s).values for s in splits})
#Output:
- ---
0 [aa, bb, cc] [aa bb cc] [aa bb cc]
1 [ee-ff-gg] [ee, ff, gg] [ee-ff-gg]
2 [hh, ii, kk] [hh ii kk] [hh ii kk]
3 [ll-mm-nn] [ll, mm, nn] [ll-mm-nn]
First we split df['col2'] on all splits, without expanding. Now, it's just a question of selecting the correct list based on the value of df['col1']
We can use numpy's advanced indexing for this.
temp = all_splits.values[np.arange(len(df)), df['col1']-1]
After this point, the steps should be same as above, starting with creating temp_df
You are pretty close. To generate a column based on some condition, where is often handy, see code below,
col2_exp1 = df.col2.str.split(' ',expand=True)
col2_exp2 = df.col2.str.split('-',expand=True)
col2_combine = (col2_exp1.where(df.col1.eq(1),col2_exp2)
.rename(columns=lambda x:f'col2{x}'))
Finally,
df.join(col2_combine)
I have a CSV file with 6 cols. I load it to memory and process by some methods. My result is a data frame with 4 cols looks like:
name number Allele Allele
aaa 111 A B
aab 112 A A
aac 113 A B
But now I got csv with another format (no Illumina) and I need to change it to above.
I have a result:
name number Allele1 Allele2
aaa 111 A C
aab 112 A G
aac 113 G G
I know how to change format, for example AG == AB, GG == AA, CC == AA (too) etc.
But it the better way to do this than for loop?
Lets say:
for line in range(len(dataframe)):
if(dataframe.Allele1[line] == A and dataframe.Allele2[line] == G):
dataframe.Allele1[line] = A
dataframe.Allele2[line] = B
elif:
etc.
I feel that this is not the best method to accomplish this task. Meaby is a better way in pandas or just python?
I need to change thath format to Illumina format because database deal with Illumina.
And: in illumina AA = AA,CC,GG; AB = AC, AG, AT, CT, GT; BB = CG, TT etc.
So if row[1] in col Allele1 is A and in Allele2 is T, edited row will be: Allele1 = A, Allele2 = B.
The expected result is:
name number Allele1 Allele2
aaa 111 A B
aab 112 A B
aac 113 A A
In result I MUST have a 4 cols.
Have you tried using pandas.DataFrame.replace? For instance:
df['Allele1'].replace(['GC', 'CC'], 'AA')
With that line you could replace in the column "Allele1" the values GC and CC for the one you look for, AA. You can apply that logic for all the substitutions you need, and If you desire to do it in the whole dataframe just don't specify the column, do instead something like:
df.replace(['GC', 'CC'], 'AA')
You can try this (to convert AG to AB) :
df.loc[df['Allele1'] == 'A' & df['Allele1'] == 'G', 'Allele1'] = 'A'
df.loc[df['Allele1'] == 'A' & df['Allele1'] == 'G', 'Allele2'] = 'B'