Dataframe/Row Indexing for Pandas - python

I was wondering how could I index datasets so that a row number from df1 can equal a different row number for df2? eg. row 1 in df 1 = row 3 in df2
What I would like. (In this case: row 1 2011 = row 2 2016)
row 49:50 2011 b1 is the same as row 51:52 bt 2016 (both the same item, but different value in different years) but is sliced differently due to being in a different cell in 2016
I've been using pd.concat and pd.Series but still no success.
# slicing 2011 data (total)
b1 = df1.iloc[49:50, 6:7]
m1 = df1.iloc[127:128, 6:7]
a1 = df1.iloc[84:85, 6:7]
data2011 = pd.concat([b1, m1, a1])
# slicing 2016 data (total)
bt = df2.iloc[51:52, 6:7]
mt = df2.iloc[129:130, 6:7]
at = df2.iloc[86:87, 6:7]
data2016 = pd.concat([bt, mt, at])
data20112016 = pd.concat([data2011, data2016])
print(data20112016)
Output I'm getting:
What I need to fix. (In this case : row 49 = row 51, so 11849 in the left column and 13500 in the right coloumn)
49 11849
127 22622
84 13658
51 13500
129 25281
86 18594
I would like to do a bar graph comparing b12011 to bt2016 and so on. meaning 42 = 51, 127 = 129 etc
# Tot_x Tot_y
# 49=51 11849 13500
# 127=129 22622 25281
# 84=86 13658 18594
I hope this clear things up.
Thanks in advance.

If I understood your question correctly, here is solution using merge:
df1 = pd.DataFrame([9337, 2953, 8184], index=[49, 127, 84], columns=['Tot'])
df2 = pd.DataFrame([13500, 25281, 18594], index=[51, 129, 86], columns=['Tot'])
total_df = (df1.reset_index()
.merge(df2.reset_index(), left_index=True, right_index=True))
And here is, using concat:
total_df = pd.concat([df1.reset_index(), df2.reset_index()], axis=1)
And here is resulting barplot:
total_df.index = total_df['index_x'].astype(str) + '=' + total_df['index_y'].astype(str)
total_df
# index_x Tot_x index_y Tot_y
# 49=51 49 9337 51 13500
# 127=129 127 2953 129 25281
# 84=86 84 8184 86 18594
(total_df.drop(['index_x', 'index_y'], axis=1)
.plot(kind='bar', rot=0))

Related

Python Data Transformation--EDA

Trying to transform my data from
lm-stands for last month
hopefully this makes sense ,how i have it
import pandas as pd
df = pd.read_excel('data.xlsx') #reading data
output = []
grouped = df.groupby('txn_id')
for txn_id, group in grouped:
avg_amt = group['avg_amount'].iloc[-1]
min_amt = group['min_amount'].iloc[-1]
lm_avg = group['avg_amount'].iloc[-6:-1]
min_amt_list = group['min_amount'].iloc[-6:-1]
output.append([txn_id, *lm_avg, min_amt, *min_amt_list])
result_df = pd.DataFrame(output, columns=['txn_id', 'lm_avg', 'lm_avg-1', 'lm_avg-2', 'lm_avg-3', 'lm_avg-4', 'lm_avg-5', 'min_am', 'min_amt-1', 'min_amt-2', 'min_amt-3', 'min_amt-4', 'min_amt-5'])#getting multiple crows for 1 txn_id which is not expected
Use pivot_table:
# Rename columns before reshaping your dataframe with pivot_table
cols = df[::-1].groupby('TXN_ID').cumcount().astype(str)
out = (df.rename(columns={'AVG_Amount': 'lm_avg', 'MIN_AMOUNT': 'min_amnt'})
.pivot_table(index='TXN_ID', values=['lm_avg', 'min_amnt'], columns=cols))
# Flat columns name
out.columns = ['-'.join(i) if i[1] != '0' else i[0] for i in out.columns.to_flat_index()]
# Reset index
out = out.reset_index()
Output:
>>> out
TXN_ID lm_avg lm_avg-1 lm_avg-2 lm_avg-3 lm_avg-4 lm_avg-5 min_amnt min_amnt-1 min_amnt-2 min_amnt-3 min_amnt-4 min_amnt-5
0 1 578 688 589 877 556 78 400 31 20 500 300 30
1 2 578 688 589 877 556 78 400 31 20 0 0 90

Renaming some part of columns of dataframe with values from another dataframe

I want to change the column names from another DataFrame.
There are some similar questions in stackoverflow, but I need advanced version of it.
data1 = {
"ABC-123_afd": [420, 380, 390],
"LFK-402_ote": [50, 40, 45],
"BPM-299_qbm": [50, 40, 45],
}
data2 = {
"ID": ['ABC-123', 'LFK-402', 'BPM-299'],
"NewID": ['IQU', 'EUW', 'NMS']
}
data1_df=pd.DataFrame(data1)
# ABC-123_afd LFK-402_ote BPM-299_qbm
#0 420 50 50
#1 380 40 40
#2 390 45 45
data2_df=pd.DataFrame(data2)
# ID NewID
#0 ABC-123 IQU
#1 LFK-402 EUW
#2 BPM-299 NMS
I want to make the final result as below:
data_final_df
# IQU_afd EUW_ote NMS_qbm
#0 420 50 50
#1 380 40 40
#2 390 45 45
I tried the code in Renaming columns of dataframe with values from another dataframe.
It ran without error, but there were no changes. I think column names in data 1 are not perfectly matched to the value in the data2 value.
How can I change some part of the column name from another pandas DataFrame?
We could create a mapping from "ID" to "NewID" and use it to modify column names:
mapping = dict(zip(data2['ID'], data2['NewID']))
data1_df.columns = [mapping[x] + '_' + y for x, y in data1_df.columns.str.split('_')]
print(data1_df)
or
s = data1_df.columns.str.split('_')
data1_df.columns = s.str[0].map(mapping) + '_' + s.str[1]
or use the DataFrame data2_df:
s = data1_df.columns.str.split('_')
data1_df.columns = s.str[0].map(data2_df.set_index('ID')['NewID']) + '_' + s.str[1]
Output:
IQU_afd EUW_ote NMS_qbm
0 420 50 50
1 380 40 40
2 390 45 45
One option is to use replace:
mapping = dict(zip(data2['ID'], data2['NewID']))
s = pd.Series(data1_df.columns)
data1_df.columns = s.replace(regex = mapping)
data1_df
IQU_afd EUW_ote NMS_qbm
0 420 50 50
1 380 40 40
2 390 45 45

Use lambda with pandas to calculate a new column conditional on existing column

I need to create a new column in a pandas DataFrame which is calculated as the ratio of 2 existing columns in the DataFrame. However, the denominator in the ratio calculation will change based on the value of a string which is found in another column in the DataFrame.
Example. Sample dataset :
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
I need to create a new DataFrame column df['ratio'] based on the condition of df['hand'].
If df['hand']=='left' then df['ratio'] = df['exp_force'] / df['left_max']
If df['hand']=='both' then df['ratio'] = df['exp_force'] / df['both_max']
You can use np.where():
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
df['ratio'] = np.where((df['hand']=='left'), df['exp_force'] / df['left_max'], df['exp_force'] / df['both_max'])
df
Out[42]:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
Alternatively, in a real-life scenario, if you have lots of conditions and results, then you can use np.select(), so that you don't have to keep repeating your np.where() statement as I have done a lot in my older code. It's better to use np.select in these situations:
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
c1 = (df['hand']=='left')
c2 = (df['hand']=='both')
r1 = df['exp_force'] / df['left_max']
r2 = df['exp_force'] / df['both_max']
conditions = [c1,c2]
results = [r1,r2]
df['ratio'] = np.select(conditions,results)
df
Out[430]:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
Enumerate
for i,e in enumerate(df['hand']):
if e == 'left':
df.at[i,'ratio'] = df.at[i,'exp_force'] / df.at[i,'left_max']
if e == 'both':
df.at[i,'ratio'] = df.at[i,'exp_force'] / df.at[i,'both_max']
df
Output:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
You can use the apply() method of your dataframe :
df['ratio'] = df.apply(
lambda x: x['exp_force'] / x['left_max'] if x['hand']=='left' else x['exp_force'] / x['both_max'],
axis=1
)

Subtracting values from two pivot tables stored in two dataframes

I have two tables:
df1:[1 rows x 23 columns]
1C 1E 1F 1H 1K ... 2M 2P 2S 2U 2W
total 1057 334 3609 3762 1393 ... 328 1611 1426 87 118
df2:[1 rows x 137 columns]
1CA 1CB 1CC 1CF 1CJ 1CS ... 2UB 2UJ 2WB 2WC 2WF 2WJ
total 11 381 111 20 527 2 ... 47 34 79 2 1 36
I need to subtract the value between two tables.
like 1C-1CF, 1E-1EF, 1F-1FF and so on.
I.e I need to subtract only the column ends with F in sheet2.
Answer: 1C=1C-1CF=1037
How is this possible using Python code?
Note:
Some of ´df1´ has no ´F´ in ´df2´
df1:
['1C', '1E', '1F', '1H', '1K', '1M', '1N', '1P', '1Q', '1R', '1S', '1U', '1W', '2C', '2E', '2F', '2H', '2K', '2M', '2P', '2S', '2U', '2W']
df2:
['1CA', '1CB', '1CC', '1CF', '1CJ', '1CS', '1CU', '1EA', '1EB', '1EC', '1EF', '1EJ', '1ES', '1FA', '1FB', '1FC', '1FF', '1FJ', '1FS', '1FT', '1FU', '1HA', '1HB', '1HC', '1HF', '1HJ', '1HS', '1HT', '1HU', '1KA', '1KB', '1KC', '1KF', '1KJ', '1KS', '1KU', '1MA', '1MB', '1MC', '1MF', '1MJ', '1MS', '1MU', '1NA', '1NB', '1NC', '1NF', '1NJ', '1PA', '1PB', '1PC', '1PF', '1PJ', '1PS', '1PT', '1PU', '1QA', '1QB', '1QC', '1QF', '1QJ', '1RA', '1RB', '1RC', '1RF', '1RJ', '1SA', '1SB', '1SC', '1SF', '1SJ', '1SS', '1ST', '1SU', '1UA', '1UB', '1UC', '1UF', '1UJ', '1US', '1UU', '1WA', '1WB', '1WC', '1WF', '1WJ', '1WS', '1WU', '2CA', '2CB', '2CC', '2CJ', '2CS', '2EA', '2EB', '2EJ', '2FA', '2FB', '2FC', '2FJ', '2FU', '2HB', '2HC', '2HF', '2HJ', '2HU', '2KA', '2KB', '2KC', '2KF', '2KJ', '2KU', '2MA', '2MB', '2MC', '2MF', '2MJ', '2MS', '2MT', '2PA', '2PB', '2PC', '2PF', '2PJ', '2PU', '2SA', '2SB', '2SC', '2SF', '2SJ', '2UA', '2UB', '2UJ', '2WB', '2WC', '2WF', '2WJ']´
sheet1_columns = sheet1.columns.tolist()
sheet2_expected_columns = ['%sF' % (c) for c in sheet1_columns]
common_columns = list(set(sheet2_expected_columns).intersection(set(sheet2.columns.tolist()))
columns_dict = {c:'%sF' % (c) for c in sheet1_columns}
sheet1_with_new_columns_names = sheet1.df.rename(columns=columns_dict)
sheet1_restriction = sheet1_with_new_columns_names[common_columns]
sheets2_restriction = sheets2[common_columns]
result = sheet1_restriction - sheet2_restriction
Can you test this?
You can try this:
sheet2 = sheet2.filter(regex=(".*F$")) # Leave only 'F' columns in sheet2
sheet2.columns = [i[:-1] for i in sheet2.columns] # Remove 'F' in the end for column-wise substraction
result = sheet1 - sheet2 # Substract values
result[result.isnull()] = sheet1 # Leave sheet1 values if there's no appropriate 'F' column in sheet2
Note: It leaves the value of sheet1 untouched if there's no appropriate columns with 'F' in sheet2.
I created your dataframes like so:
sheet1 = pd.DataFrame({'1C': [1057], '1E': [334], '1F': [3609], '2F': [3609]})
sheet2 = pd.DataFrame({'1CA': [11], '1CB': [381], '1CC': [111], '1CF': [20], '1EF': [10], '1FF': [15]})
Solution
Step 1: filter the columns in df2 which have suffix F:
cols = df2.columns[df2.columns.isin([col+'F' for col in df1.columns])]
cols
Index(['1AF', '1GF'], dtype='object')
Step 2: Use string operation on cols and filter for df1 dataframe, then subtract from df2 and assign the values df1:
df1.loc[:,cols.str[:-1]] = df1[cols.str[:-1]].values - df2[cols].values
df1
1A 1B 1C 1D 1E 1F 1G 1H 1I 1J
total 70 72 90 46 30 56 10 51 95 34
Values for 1A: 82-12 = 70 and values for 1G: 34-24=10.
Setup:
df1 = pd.DataFrame(np.random.randint(30,100, size=(1,10)), columns=list('ABCDEFGHIJ'))
df1.columns = ['1'+col for col in df1.columns]
df1.index = ['total']
df1
1A 1B 1C 1D 1E 1F 1G 1H 1I 1J
total 82 72 90 46 30 56 34 51 95 34
df2 = pd.DataFrame(np.random.randint(10,30, size=(1,7)), columns=list('ABFGHIJ'))
df2.index = ['total']
df2.columns = ['1'+col for col in df2.columns]
df2.columns = [col+'D' for col in df2.columns]
df2.rename(columns={'1AD':'1AF','1GD':'1GF'},inplace=True)
df2
1AF 1BD 1FD 1GF 1HD 1ID 1JD
total 12 29 29 24 10 12 17
You can try something like
result_df = df1.join(df2)
for col in df1.columns:
if ((col + 'F' in df2.columns):
result_df[col] = result_df[col] - result_df[col + 'F']

Annotating one dataFrame using another dataFrame, if there is matching value in column

I have two DataFrames I want to first look for matching values in col1 in dataFrame1 to col1 in DataFrame2 and print out all the columns from DataFrame1 with additional columns from DataFrame2.
For Example
I have tried following ,
data = 'file_1'
Up = pd.DataFrame.from_csv(data, sep='\t')
Up = Up.reset_index(drop=False)
Up.head()
Gene_id baseMean log2FoldChange lfcSE stat pvalue padj
0 ENSG.16 176.275036 0.9475260059 0.4310373793 2.1982455617 0.0279316115 0.198658
1 ENSG.10 80.199435 0.4349592748 0.2691551416 1.6160169639 0.1060906455 0.369578
2 ENSG.15 1649.400749 -0.0215428237 0.1285061198 -0.1676404495 0.8668661474 0.947548
3 ENSG.10 25507.767530 0.5145516695 0.2473335499 2.0803957642 0.0374892475 0.229378
4 ENSG.12 70.122885 -0.2612483888 0.2593848667 -1.00718439
and the second dataframe is,
mydata = 'file_2'
annon = pd.DataFrame.from_csv(mydata, sep='\t')
annon = annon.reset_index(drop=False)
annon.head()
Gene_id sam_1 sam2 sam3 sam4 sam5 sam6 sam7 sam8 sam9 sam10 sam11
0 ENSG.16 404 55 33 39 102 43 193 244 600 174 120
1 ENSG.10 58 89 110 69 64 48 61 81 98 75 119
2 ENSG.15 1536 1246 2540 1751 1850 2137 1460 1362 2158 1367 1320
3 ENSG.10 28508 23073 19982 13821 20355 28835 26875 25632 27131 30991 29351
4 ENSG.12 87 81 121 67 98 47 37 59 68 44 81
and following is what i tried so far,
x=pd.merge(Up[['Gene_id' , 'log2FoldChange ', 'pvalue ' , 'padj']] , annon , on = 'Gene_id')
x.head()
Gene_id log2FoldChange pvalue padj sam_1 sam2 sam3 sam4 sam5 sam6 sam7 sam8 sam9 sam10 sam11
Its just giving me header of the file and nothing else..
And so I looked into file1(Up) with one row value like following,
This what i am getting
print(Up.loc[Up['Gene_id'] =='ENSG.16'])
Empty DataFrame
Columns: [Gene_id, baseMean , log2FoldChange , lfcSE , stat , pvalue , padj]
Index: []
But infact that is not empty and it has values in dataframe Up.
Any solutions would be great..!!!
pd.merge(df1[['Gene_Id' , 'log2FoldChange', 'pvalue' , 'padj']] , df2 , left_on='Gene_Id' , right_on= 'Gene_id')
you can then easily drop Gene_id if you want
Hope this helps you.
Let me know if it works.
import pandas as pd
# creating test Dataframe1
df = pd.DataFrame(['ENSG1', 162.315169869338, 0.920583258294463, 0.260406974056691, 3.53517128959092, 0.000407510906151687, 0.0176112964515702])
df=df.T
# important thing is make column 0 as its index
df.index=df[0]
print(df)
# creating test Dataframe2
df2 = pd.DataFrame(['ENSG1', 404, 55, 33, 39, 102, 43, 193, 244, 600, 174, 120])
df2=df2.T
# important thing is make column 0 as its index
df2.index=df2[0]
print(df2)
# concatinate both the frames using axis=1 (outer or inner as per your need)
x = pd.concat([df,df2],axis=1,join='outer')
print(x)

Categories