How to change value in two DataFrame columns in python - python

I have a CSV file with 6 cols. I load it to memory and process by some methods. My result is a data frame with 4 cols looks like:
name number Allele Allele
aaa 111 A B
aab 112 A A
aac 113 A B
But now I got csv with another format (no Illumina) and I need to change it to above.
I have a result:
name number Allele1 Allele2
aaa 111 A C
aab 112 A G
aac 113 G G
I know how to change format, for example AG == AB, GG == AA, CC == AA (too) etc.
But it the better way to do this than for loop?
Lets say:
for line in range(len(dataframe)):
if(dataframe.Allele1[line] == A and dataframe.Allele2[line] == G):
dataframe.Allele1[line] = A
dataframe.Allele2[line] = B
elif:
etc.
I feel that this is not the best method to accomplish this task. Meaby is a better way in pandas or just python?
I need to change thath format to Illumina format because database deal with Illumina.
And: in illumina AA = AA,CC,GG; AB = AC, AG, AT, CT, GT; BB = CG, TT etc.
So if row[1] in col Allele1 is A and in Allele2 is T, edited row will be: Allele1 = A, Allele2 = B.
The expected result is:
name number Allele1 Allele2
aaa 111 A B
aab 112 A B
aac 113 A A
In result I MUST have a 4 cols.

Have you tried using pandas.DataFrame.replace? For instance:
df['Allele1'].replace(['GC', 'CC'], 'AA')
With that line you could replace in the column "Allele1" the values GC and CC for the one you look for, AA. You can apply that logic for all the substitutions you need, and If you desire to do it in the whole dataframe just don't specify the column, do instead something like:
df.replace(['GC', 'CC'], 'AA')

You can try this (to convert AG to AB) :
df.loc[df['Allele1'] == 'A' & df['Allele1'] == 'G', 'Allele1'] = 'A'
df.loc[df['Allele1'] == 'A' & df['Allele1'] == 'G', 'Allele2'] = 'B'

Related

I want to apply multiple filters and change a column value accordingly in pandas [Working now]

Suppose I have a dataframe like so:
Fil1 Fil2 A B C D
a crossdev radio com Act 1 23 324
b crossdev webapp radio Act 4 45 343
a Streaming webapp radio Act 3 23 566
a crossdev com Act 1 12 746
The Fil1 column in the actual file is really long name that I'm filtering, but here I'm referencing it as just 'a'.
The code I'm using is --
df.loc[(df['Fil1'] == 'a') & (df['Fil2'].str.contains('com')) , 'C'] = 0
df.loc[(df['Fil1'] == 'a') & (df['Fil2'].str.contains('com')) , 'D'] = 0
df.loc[(df['Fil1'] == 'a') & (df['Fil2'].str.contains('com')) , 'A'] = 'Fail'
Outputting this df to excel.
Desired Excel Output:
Fil1 Fil2 A B C D
a crossdev radio com Fail 1 0 0
b crossdev webapp radio Act 4 45 343
a Streaming webapp radio Act 3 23 566
a crossdev com Fail 1 0 0
My code is not giving me any error but it is not even giving me desired result.
Is there any other workaround?
The code is working! It had no error.
The value that I referred to here as 'a' was a mess in my real dataset that caused the problem.

Formatting specific rows in Dash Datatable with %, $, etc

I am using the Dash Datatable code to create the table in Plotly/Python. I would like to format the various rows in value column. For example, I would like to format Row[1] with $ sign, while Row[2] with %. TIA
#Row KPI Value
0 AA 1
1 BB $230.
2 CC 54%
3 DD 5.6.
4 EE $54000
Table
I have been looking into this issue as well. unfortunately I didn't succeed with any thing built-in either. If you do in the future, please let me know.
However, the solution that I implemented was the following function to easily change the format of DataFrame elements to strings with the formatting I would like:
def dt_formatter(df:pd.DataFrame,
formatter:str,
slicer:pd.IndexSlice=None)->pd.DataFrame:
if not slicer:
for col in df.columns:
df[col] = df[col].apply(formatter.format,axis = 0)
return df
else:
dfs = df.loc[slicer].copy()
for col in dfs.columns:
dfs[col] = dfs[col].apply(formatter.format,axis = 0)
df.loc[slicer] = dfs
return df
and the using your regular slicing / filtering with your base dataframe df. Assuming your base df looks like this:
>>> df
#Row KPI Value
0 AA 1
1 BB 230
2 CC 54
3 DD 5.6
4 EE 54000
>>> df = dt_formatter(df, '{:.0%}', pd.IndexSlice[df['#Row'] == 1,'Value')
>>> df
#Row KPI Value
0 AA 1
1 BB 230%
2 CC 54
3 DD 5.6
4 EE 54000
using a different slicer and different formatting string, you could "build" your DataFrame using such a helper function.

Conditionally populate new columns using split on another column

I have a dataframe
df = pd.DataFrame({'col1': [1,2,1,2], 'col2': ['aa bb cc', 'ee-ff-gg', 'hh ii kk', 'll-mm-nn']})
I want to:
Split col2 on the ' ' where col1==1
Split on the '-' where col1==2
Append this data to 3 new columns: (col20, col21, col22)
Ideally the code would look like this:
subdf=df.loc[df['col1']==1]
#list of columns to use
col_list=['col20', 'col21', 'col22']
#append to dataframe new columns from split function
subdf[col_list]=(subdf.col2.str.split(' ', 2, expand=True)
however this hasn't worked.
I have tried using merge and join, however:
join doesn't work if the columns are already populated
merge doesn't work if they aren't.
I have also tried:
#subset dataframes
subdf=df.loc[df['col1']==1]
subdf2=df.loc[df['col1']==2]
#trying the join method, only works if columns aren't already present
subdf.join(subdf.col2.str.split(' ', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'}))
#merge doesn't work if columns aren't present
subdf2=subdf2.merge(subdf2.col2.str.split('-', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'}))
subdf2
the error messages when I run it:
subdf2=subdf2.merge(subdf2.col2.str.split('-', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'})
MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
EDIT givin information after mark's comment on regex
My original col1 was actually the regex combination I had used to extract col2 from some strings.
#the combination I used to extract the col2
combinations= ['(\d+)[-](\d+)[-](\d+)[-](\d+)', '(\d+)[-](\d+)[-](\d+)'... ]
here is the original dataframe
col1 col2
(\d+)[-](\d+)[-](\d+)[-](\d+) 350-300-50-10
(\d+)[-](\d+)[-](\w+)(\d+) 150-180-G31
I then created a dictionary that connected every combination to what the split values of col2 represented:
filtermap={'(\d+)[-](\d+)[-](\w+)(\d+)': 'thickness temperature sample', '(\d+)[-](\d+)[-](\d+)[-](\d+)': 'thickness temperature width height' }
with this filter I wanted to:
Subset the dattaframe based on regex combinations
use split on col2 to find the values corresponding to the combination using the filtermap (thickness temperature..)
add these values to the new columns on the dataframe
col1 col2 thickness temperature width length sample
(\d+)[-](\d+)[-](\d+)[-](\d+) 350-300-50-10 350 300 50 10
(\d+)[-](\d+)[-](\w+)(\d+) 150-180-G31 150 180 G31
since you mentioned regex maybe you know of a way to do this directly ?
EDIT 2; input-output
in the input there are strings like so:
'this is the first example string 350-300-50-10 ',
'this is the second example string 150-180-G31'
formats that are:
number-number-number-number(350-300-50-10 ) have this orded information in them: thickness(350)-temperature(300)-width(50)-length(10)
number-number-letternumber (150-180-G31 ) have this ordered information in them: thickness-temperature-sample
desired output:
col2, thickness, temperature, width, length, sample
350-300-50-10 350 300 50 10 None
150-180-G31 150 180 None None G31
I used eg:
re.search('(\d+)[-](\d+)[-](\d+)[-](\d+)'))
to find the col2 in the strings
You can use np.where to simplify this problem.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [1,2,1,2],
'col2': ['aa bb cc', 'ee-ff-gg', 'hh ii kk', 'll-mm-nn']
})
temp = np.where(df['col1'] == 1, #a boolean array/series indicating where the values are equal to 1.
df['col2'].str.split(' '), #Use the output of this if True
df['col2'].str.split('-') #Else use this.
)
temp_df = pd.DataFrame(temp.tolist()) #create a new dataframe with the columns we need
#Output:
0 1 2
0 aa bb cc
1 ee ff gg
2 hh ii kk
3 ll mm nn
Now just assign the result back to the original df. You can use a concat or join, but a simple assignment suffices as well.
df[[f'col2_{i}' for i in temp_df.columns]] = temp_df
print(df)
col1 col2 col2_0 col2_1 col2_2
0 1 aa bb cc aa bb cc
1 2 ee-ff-gg ee ff gg
2 1 hh ii kk hh ii kk
3 2 ll-mm-nn ll mm nn
EDIT: To address more than two conditional splits
If you need more than two conditions, np.where was only designed to work for a binary selection. You can Opt for a "custom" approach that works with as many splits as you like here.
splits = [ ' ', '-', '---']
all_splits = pd.DataFrame({s:df['col2'].str.split(s).values for s in splits})
#Output:
- ---
0 [aa, bb, cc] [aa bb cc] [aa bb cc]
1 [ee-ff-gg] [ee, ff, gg] [ee-ff-gg]
2 [hh, ii, kk] [hh ii kk] [hh ii kk]
3 [ll-mm-nn] [ll, mm, nn] [ll-mm-nn]
First we split df['col2'] on all splits, without expanding. Now, it's just a question of selecting the correct list based on the value of df['col1']
We can use numpy's advanced indexing for this.
temp = all_splits.values[np.arange(len(df)), df['col1']-1]
After this point, the steps should be same as above, starting with creating temp_df
You are pretty close. To generate a column based on some condition, where is often handy, see code below,
col2_exp1 = df.col2.str.split(' ',expand=True)
col2_exp2 = df.col2.str.split('-',expand=True)
col2_combine = (col2_exp1.where(df.col1.eq(1),col2_exp2)
.rename(columns=lambda x:f'col2{x}'))
Finally,
df.join(col2_combine)

Comparing two excel file with pandas

I have two excel file, A and B. A is Master copy where updated record of employee Name and Organization Name (Name and Org) is available. File B contains Name and Org columns with bit older record and many other columns which we are not interested in.
Name Org
0 abc ddc systems
1 sdc ddc systems
2 csc ddd systems
3 rdc kbf org
4 rfc kbf org
I want to do two operation on this:
1) I want to compare Excel B (column Name and Org) with Excel A (column Name and Org) and update file B with all the missing entries of Name and corresponding Org.
2) For all existing entries in File B (column Name and Org), I would like to compare file and with file A and update Org column if any employee organization has changed.
For Solution 1) to find the new entries tried below approach (Not sure if this approach is correct though), output is tuple which I was not sure how to update back to DataFrame.
diff = set(zip(new_df.Name, new_df.Org)) - set(zip(old_df.Name, old_df.Org))
Any help will be appreciated. Thanks.
If names are unique, just concatenate A and B, and drop duplicates. Assuming A and B are your DataFrames,
df = pd.concat([A, B]).drop_duplicates(subset=['Name'], keep='first')
Or,
A = A.set_index('Name')
B = B.set_index('Name')
idx = B.index.difference(A.index)
df = pd.concat([A, B.loc[idx]]).reset_index()
Both should be approximately the same in terms of performance.
Solution:
diff=pd.DataFrame(list(set(zip(df['aa'], df['bb'])) - set(zip(df2['aa'], df2['bb']))),columns=df.columns)
print(diff.sort_values(by='aa').reset_index(drop=True))
Example:
import pandas as pd
aa = ['aa1', 'aa2', 'aa3', 'aa4', 'aa5']
bb = ['bb1', 'bb2', 'bb3', 'bb4','bb5']
nest = [aa, bb]
df = pd.DataFrame(nest, ['aa', 'bb']).T
df2 = pd.DataFrame(nest, ['aa', 'bb']).T
df2['aa']=df2['aa'].shift(2)
diff=pd.DataFrame(list(set(zip(df['aa'], df['bb'])) - set(zip(df2['aa'], df2['bb']))),columns=df.columns)
print(diff.sort_values(by='aa').reset_index(drop=True))
Output:
aa bb
0 aa1 bb1
1 aa2 bb2
2 aa3 bb3
3 aa4 bb4
4 aa5 bb5

df.apply -- Only on columns of certain type

In Pandas, is there an easy way to apply a function only on columns of a specific type?
In one example, I need to pre-process a dataframe with control characters before I save it to a csv file.
I currently do the following:
df[string_column] = df[string_column].apply(
lambda x:
x.encode('ascii', errors='ignore').replace('\n',' ').replace('\t', ' '))
but this requires knowing what columns have strings.
What is an easy way to apply a function only on columns of a certain type?
Well, I think I would just make a list of the string columns based on the dtype (they will have object dtype). So something like the following:
>>> df = pd.read_csv(StringIO(data),header=True)
>>> print df
A B C D
0 1 a 6 ff
1 2 b 7 cc
2 3 c 8 dd
3 4 d 9 ee
4 5 e 10 gg
>>> print df.dtypes
A int64
B object
C int64
D object
And then you can get a list of object/str columns with something like the following:
>>> print df.dtypes[df.dtypes == 'object'].index.tolist()
['B', 'D']
And now you can use that list with an apply or whatever.

Categories