Dataframe/Row Indexing for Pandas

Dataframe/Row Indexing for Pandas - python

I was wondering how could I index datasets so that a row number from df1 can equal a different row number for df2? eg. row 1 in df 1 = row 3 in df2
What I would like. (In this case: row 1 2011 = row 2 2016)
row 49:50 2011 b1 is the same as row 51:52 bt 2016 (both the same item, but different value in different years) but is sliced differently due to being in a different cell in 2016
I've been using pd.concat and pd.Series but still no success.
# slicing 2011 data (total)
b1 = df1.iloc[49:50, 6:7]
m1 = df1.iloc[127:128, 6:7]
a1 = df1.iloc[84:85, 6:7]
data2011 = pd.concat([b1, m1, a1])
# slicing 2016 data (total)
bt = df2.iloc[51:52, 6:7]
mt = df2.iloc[129:130, 6:7]
at = df2.iloc[86:87, 6:7]
data2016 = pd.concat([bt, mt, at])
data20112016 = pd.concat([data2011, data2016])
print(data20112016)
Output I'm getting:
What I need to fix. (In this case : row 49 = row 51, so 11849 in the left column and 13500 in the right coloumn)
49 11849
127 22622
84 13658
51 13500
129 25281
86 18594
I would like to do a bar graph comparing b12011 to bt2016 and so on. meaning 42 = 51, 127 = 129 etc
# Tot_x Tot_y
# 49=51 11849 13500
# 127=129 22622 25281
# 84=86 13658 18594
I hope this clear things up.
Thanks in advance.

If I understood your question correctly, here is solution using merge:
df1 = pd.DataFrame([9337, 2953, 8184], index=[49, 127, 84], columns=['Tot'])
df2 = pd.DataFrame([13500, 25281, 18594], index=[51, 129, 86], columns=['Tot'])
total_df = (df1.reset_index()
.merge(df2.reset_index(), left_index=True, right_index=True))
And here is, using concat:
total_df = pd.concat([df1.reset_index(), df2.reset_index()], axis=1)
And here is resulting barplot:
total_df.index = total_df['index_x'].astype(str) + '=' + total_df['index_y'].astype(str)
total_df
# index_x Tot_x index_y Tot_y
# 49=51 49 9337 51 13500
# 127=129 127 2953 129 25281
# 84=86 84 8184 86 18594
(total_df.drop(['index_x', 'index_y'], axis=1)
.plot(kind='bar', rot=0))

Related

Python Data Transformation--EDA

Trying to transform my data from
lm-stands for last month
hopefully this makes sense ,how i have it
import pandas as pd
df = pd.read_excel('data.xlsx') #reading data
output = []
grouped = df.groupby('txn_id')
for txn_id, group in grouped:
avg_amt = group['avg_amount'].iloc[-1]
min_amt = group['min_amount'].iloc[-1]
lm_avg = group['avg_amount'].iloc[-6:-1]
min_amt_list = group['min_amount'].iloc[-6:-1]
output.append([txn_id, *lm_avg, min_amt, *min_amt_list])
result_df = pd.DataFrame(output, columns=['txn_id', 'lm_avg', 'lm_avg-1', 'lm_avg-2', 'lm_avg-3', 'lm_avg-4', 'lm_avg-5', 'min_am', 'min_amt-1', 'min_amt-2', 'min_amt-3', 'min_amt-4', 'min_amt-5'])#getting multiple crows for 1 txn_id which is not expected

Use pivot_table:
# Rename columns before reshaping your dataframe with pivot_table
cols = df[::-1].groupby('TXN_ID').cumcount().astype(str)
out = (df.rename(columns={'AVG_Amount': 'lm_avg', 'MIN_AMOUNT': 'min_amnt'})
.pivot_table(index='TXN_ID', values=['lm_avg', 'min_amnt'], columns=cols))
# Flat columns name
out.columns = ['-'.join(i) if i[1] != '0' else i[0] for i in out.columns.to_flat_index()]
# Reset index
out = out.reset_index()
Output:
>>> out
TXN_ID lm_avg lm_avg-1 lm_avg-2 lm_avg-3 lm_avg-4 lm_avg-5 min_amnt min_amnt-1 min_amnt-2 min_amnt-3 min_amnt-4 min_amnt-5
0 1 578 688 589 877 556 78 400 31 20 500 300 30
1 2 578 688 589 877 556 78 400 31 20 0 0 90

Renaming some part of columns of dataframe with values from another dataframe

I want to change the column names from another DataFrame.
There are some similar questions in stackoverflow, but I need advanced version of it.
data1 = {
"ABC-123_afd": [420, 380, 390],
"LFK-402_ote": [50, 40, 45],
"BPM-299_qbm": [50, 40, 45],
}
data2 = {
"ID": ['ABC-123', 'LFK-402', 'BPM-299'],
"NewID": ['IQU', 'EUW', 'NMS']
}
data1_df=pd.DataFrame(data1)
# ABC-123_afd LFK-402_ote BPM-299_qbm
#0 420 50 50
#1 380 40 40
#2 390 45 45
data2_df=pd.DataFrame(data2)
# ID NewID
#0 ABC-123 IQU
#1 LFK-402 EUW
#2 BPM-299 NMS
I want to make the final result as below:
data_final_df
# IQU_afd EUW_ote NMS_qbm
#0 420 50 50
#1 380 40 40
#2 390 45 45
I tried the code in Renaming columns of dataframe with values from another dataframe.
It ran without error, but there were no changes. I think column names in data 1 are not perfectly matched to the value in the data2 value.
How can I change some part of the column name from another pandas DataFrame?

We could create a mapping from "ID" to "NewID" and use it to modify column names:
mapping = dict(zip(data2['ID'], data2['NewID']))
data1_df.columns = [mapping[x] + '_' + y for x, y in data1_df.columns.str.split('_')]
print(data1_df)
or
s = data1_df.columns.str.split('_')
data1_df.columns = s.str[0].map(mapping) + '_' + s.str[1]
or use the DataFrame data2_df:
s = data1_df.columns.str.split('_')
data1_df.columns = s.str[0].map(data2_df.set_index('ID')['NewID']) + '_' + s.str[1]
Output:
IQU_afd EUW_ote NMS_qbm
0 420 50 50
1 380 40 40
2 390 45 45

One option is to use replace:
mapping = dict(zip(data2['ID'], data2['NewID']))
s = pd.Series(data1_df.columns)
data1_df.columns = s.replace(regex = mapping)
data1_df
IQU_afd EUW_ote NMS_qbm
0 420 50 50
1 380 40 40
2 390 45 45

Use lambda with pandas to calculate a new column conditional on existing column

I need to create a new column in a pandas DataFrame which is calculated as the ratio of 2 existing columns in the DataFrame. However, the denominator in the ratio calculation will change based on the value of a string which is found in another column in the DataFrame.
Example. Sample dataset :
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
I need to create a new DataFrame column df['ratio'] based on the condition of df['hand'].
If df['hand']=='left' then df['ratio'] = df['exp_force'] / df['left_max']
If df['hand']=='both' then df['ratio'] = df['exp_force'] / df['both_max']

You can use np.where():
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
df['ratio'] = np.where((df['hand']=='left'), df['exp_force'] / df['left_max'], df['exp_force'] / df['both_max'])
df
Out[42]:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
Alternatively, in a real-life scenario, if you have lots of conditions and results, then you can use np.select(), so that you don't have to keep repeating your np.where() statement as I have done a lot in my older code. It's better to use np.select in these situations:
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
c1 = (df['hand']=='left')
c2 = (df['hand']=='both')
r1 = df['exp_force'] / df['left_max']
r2 = df['exp_force'] / df['both_max']
conditions = [c1,c2]
results = [r1,r2]
df['ratio'] = np.select(conditions,results)
df
Out[430]:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333

Enumerate
for i,e in enumerate(df['hand']):
if e == 'left':
df.at[i,'ratio'] = df.at[i,'exp_force'] / df.at[i,'left_max']
if e == 'both':
df.at[i,'ratio'] = df.at[i,'exp_force'] / df.at[i,'both_max']
df
Output:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333

You can use the apply() method of your dataframe :
df['ratio'] = df.apply(
lambda x: x['exp_force'] / x['left_max'] if x['hand']=='left' else x['exp_force'] / x['both_max'],
axis=1
)

Subtracting values from two pivot tables stored in two dataframes

I have two tables:
df1:[1 rows x 23 columns]
1C 1E 1F 1H 1K ... 2M 2P 2S 2U 2W
total 1057 334 3609 3762 1393 ... 328 1611 1426 87 118
df2:[1 rows x 137 columns]
1CA 1CB 1CC 1CF 1CJ 1CS ... 2UB 2UJ 2WB 2WC 2WF 2WJ
total 11 381 111 20 527 2 ... 47 34 79 2 1 36
I need to subtract the value between two tables.
like 1C-1CF, 1E-1EF, 1F-1FF and so on.
I.e I need to subtract only the column ends with F in sheet2.
Answer: 1C=1C-1CF=1037
How is this possible using Python code?
Note:
Some of ´df1´ has no ´F´ in ´df2´
df1:
['1C', '1E', '1F', '1H', '1K', '1M', '1N', '1P', '1Q', '1R', '1S', '1U', '1W', '2C', '2E', '2F', '2H', '2K', '2M', '2P', '2S', '2U', '2W']
df2:
['1CA', '1CB', '1CC', '1CF', '1CJ', '1CS', '1CU', '1EA', '1EB', '1EC', '1EF', '1EJ', '1ES', '1FA', '1FB', '1FC', '1FF', '1FJ', '1FS', '1FT', '1FU', '1HA', '1HB', '1HC', '1HF', '1HJ', '1HS', '1HT', '1HU', '1KA', '1KB', '1KC', '1KF', '1KJ', '1KS', '1KU', '1MA', '1MB', '1MC', '1MF', '1MJ', '1MS', '1MU', '1NA', '1NB', '1NC', '1NF', '1NJ', '1PA', '1PB', '1PC', '1PF', '1PJ', '1PS', '1PT', '1PU', '1QA', '1QB', '1QC', '1QF', '1QJ', '1RA', '1RB', '1RC', '1RF', '1RJ', '1SA', '1SB', '1SC', '1SF', '1SJ', '1SS', '1ST', '1SU', '1UA', '1UB', '1UC', '1UF', '1UJ', '1US', '1UU', '1WA', '1WB', '1WC', '1WF', '1WJ', '1WS', '1WU', '2CA', '2CB', '2CC', '2CJ', '2CS', '2EA', '2EB', '2EJ', '2FA', '2FB', '2FC', '2FJ', '2FU', '2HB', '2HC', '2HF', '2HJ', '2HU', '2KA', '2KB', '2KC', '2KF', '2KJ', '2KU', '2MA', '2MB', '2MC', '2MF', '2MJ', '2MS', '2MT', '2PA', '2PB', '2PC', '2PF', '2PJ', '2PU', '2SA', '2SB', '2SC', '2SF', '2SJ', '2UA', '2UB', '2UJ', '2WB', '2WC', '2WF', '2WJ']´

sheet1_columns = sheet1.columns.tolist()
sheet2_expected_columns = ['%sF' % (c) for c in sheet1_columns]
common_columns = list(set(sheet2_expected_columns).intersection(set(sheet2.columns.tolist()))
columns_dict = {c:'%sF' % (c) for c in sheet1_columns}
sheet1_with_new_columns_names = sheet1.df.rename(columns=columns_dict)
sheet1_restriction = sheet1_with_new_columns_names[common_columns]
sheets2_restriction = sheets2[common_columns]
result = sheet1_restriction - sheet2_restriction
Can you test this?

You can try this:
sheet2 = sheet2.filter(regex=(".*F$")) # Leave only 'F' columns in sheet2
sheet2.columns = [i[:-1] for i in sheet2.columns] # Remove 'F' in the end for column-wise substraction
result = sheet1 - sheet2 # Substract values
result[result.isnull()] = sheet1 # Leave sheet1 values if there's no appropriate 'F' column in sheet2
Note: It leaves the value of sheet1 untouched if there's no appropriate columns with 'F' in sheet2.
I created your dataframes like so:
sheet1 = pd.DataFrame({'1C': [1057], '1E': [334], '1F': [3609], '2F': [3609]})
sheet2 = pd.DataFrame({'1CA': [11], '1CB': [381], '1CC': [111], '1CF': [20], '1EF': [10], '1FF': [15]})

Solution
Step 1: filter the columns in df2 which have suffix F:
cols = df2.columns[df2.columns.isin([col+'F' for col in df1.columns])]
cols
Index(['1AF', '1GF'], dtype='object')
Step 2: Use string operation on cols and filter for df1 dataframe, then subtract from df2 and assign the values df1:
df1.loc[:,cols.str[:-1]] = df1[cols.str[:-1]].values - df2[cols].values
df1
1A 1B 1C 1D 1E 1F 1G 1H 1I 1J
total 70 72 90 46 30 56 10 51 95 34
Values for 1A: 82-12 = 70 and values for 1G: 34-24=10.
Setup:
df1 = pd.DataFrame(np.random.randint(30,100, size=(1,10)), columns=list('ABCDEFGHIJ'))
df1.columns = ['1'+col for col in df1.columns]
df1.index = ['total']
df1
1A 1B 1C 1D 1E 1F 1G 1H 1I 1J
total 82 72 90 46 30 56 34 51 95 34
df2 = pd.DataFrame(np.random.randint(10,30, size=(1,7)), columns=list('ABFGHIJ'))
df2.index = ['total']
df2.columns = ['1'+col for col in df2.columns]
df2.columns = [col+'D' for col in df2.columns]
df2.rename(columns={'1AD':'1AF','1GD':'1GF'},inplace=True)
df2
1AF 1BD 1FD 1GF 1HD 1ID 1JD
total 12 29 29 24 10 12 17

You can try something like
result_df = df1.join(df2)
for col in df1.columns:
if ((col + 'F' in df2.columns):
result_df[col] = result_df[col] - result_df[col + 'F']

Annotating one dataFrame using another dataFrame, if there is matching value in column

I have two DataFrames I want to first look for matching values in col1 in dataFrame1 to col1 in DataFrame2 and print out all the columns from DataFrame1 with additional columns from DataFrame2.
For Example
I have tried following ,
data = 'file_1'
Up = pd.DataFrame.from_csv(data, sep='\t')
Up = Up.reset_index(drop=False)
Up.head()
Gene_id baseMean log2FoldChange lfcSE stat pvalue padj
0 ENSG.16 176.275036 0.9475260059 0.4310373793 2.1982455617 0.0279316115 0.198658
1 ENSG.10 80.199435 0.4349592748 0.2691551416 1.6160169639 0.1060906455 0.369578
2 ENSG.15 1649.400749 -0.0215428237 0.1285061198 -0.1676404495 0.8668661474 0.947548
3 ENSG.10 25507.767530 0.5145516695 0.2473335499 2.0803957642 0.0374892475 0.229378
4 ENSG.12 70.122885 -0.2612483888 0.2593848667 -1.00718439
and the second dataframe is,
mydata = 'file_2'
annon = pd.DataFrame.from_csv(mydata, sep='\t')
annon = annon.reset_index(drop=False)
annon.head()
Gene_id sam_1 sam2 sam3 sam4 sam5 sam6 sam7 sam8 sam9 sam10 sam11
0 ENSG.16 404 55 33 39 102 43 193 244 600 174 120
1 ENSG.10 58 89 110 69 64 48 61 81 98 75 119
2 ENSG.15 1536 1246 2540 1751 1850 2137 1460 1362 2158 1367 1320
3 ENSG.10 28508 23073 19982 13821 20355 28835 26875 25632 27131 30991 29351
4 ENSG.12 87 81 121 67 98 47 37 59 68 44 81
and following is what i tried so far,
x=pd.merge(Up[['Gene_id' , 'log2FoldChange ', 'pvalue ' , 'padj']] , annon , on = 'Gene_id')
x.head()
Gene_id log2FoldChange pvalue padj sam_1 sam2 sam3 sam4 sam5 sam6 sam7 sam8 sam9 sam10 sam11
Its just giving me header of the file and nothing else..
And so I looked into file1(Up) with one row value like following,
This what i am getting
print(Up.loc[Up['Gene_id'] =='ENSG.16'])
Empty DataFrame
Columns: [Gene_id, baseMean , log2FoldChange , lfcSE , stat , pvalue , padj]
Index: []
But infact that is not empty and it has values in dataframe Up.
Any solutions would be great..!!!

pd.merge(df1[['Gene_Id' , 'log2FoldChange', 'pvalue' , 'padj']] , df2 , left_on='Gene_Id' , right_on= 'Gene_id')
you can then easily drop Gene_id if you want

Hope this helps you.
Let me know if it works.
import pandas as pd
# creating test Dataframe1
df = pd.DataFrame(['ENSG1', 162.315169869338, 0.920583258294463, 0.260406974056691, 3.53517128959092, 0.000407510906151687, 0.0176112964515702])
df=df.T
# important thing is make column 0 as its index
df.index=df[0]
print(df)
# creating test Dataframe2
df2 = pd.DataFrame(['ENSG1', 404, 55, 33, 39, 102, 43, 193, 244, 600, 174, 120])
df2=df2.T
# important thing is make column 0 as its index
df2.index=df2[0]
print(df2)
# concatinate both the frames using axis=1 (outer or inner as per your need)
x = pd.concat([df,df2],axis=1,join='outer')
print(x)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe/Row Indexing for Pandas - python

Related

Python Data Transformation--EDA

Renaming some part of columns of dataframe with values from another dataframe

Use lambda with pandas to calculate a new column conditional on existing column

Subtracting values from two pivot tables stored in two dataframes

Annotating one dataFrame using another dataFrame, if there is matching value in column

Categories

Resources