Subtracting values from two pivot tables stored in two dataframes - python

I have two tables:
df1:[1 rows x 23 columns]
1C 1E 1F 1H 1K ... 2M 2P 2S 2U 2W
total 1057 334 3609 3762 1393 ... 328 1611 1426 87 118
df2:[1 rows x 137 columns]
1CA 1CB 1CC 1CF 1CJ 1CS ... 2UB 2UJ 2WB 2WC 2WF 2WJ
total 11 381 111 20 527 2 ... 47 34 79 2 1 36
I need to subtract the value between two tables.
like 1C-1CF, 1E-1EF, 1F-1FF and so on.
I.e I need to subtract only the column ends with F in sheet2.
Answer: 1C=1C-1CF=1037
How is this possible using Python code?
Note:
Some of ´df1´ has no ´F´ in ´df2´
df1:
['1C', '1E', '1F', '1H', '1K', '1M', '1N', '1P', '1Q', '1R', '1S', '1U', '1W', '2C', '2E', '2F', '2H', '2K', '2M', '2P', '2S', '2U', '2W']
df2:
['1CA', '1CB', '1CC', '1CF', '1CJ', '1CS', '1CU', '1EA', '1EB', '1EC', '1EF', '1EJ', '1ES', '1FA', '1FB', '1FC', '1FF', '1FJ', '1FS', '1FT', '1FU', '1HA', '1HB', '1HC', '1HF', '1HJ', '1HS', '1HT', '1HU', '1KA', '1KB', '1KC', '1KF', '1KJ', '1KS', '1KU', '1MA', '1MB', '1MC', '1MF', '1MJ', '1MS', '1MU', '1NA', '1NB', '1NC', '1NF', '1NJ', '1PA', '1PB', '1PC', '1PF', '1PJ', '1PS', '1PT', '1PU', '1QA', '1QB', '1QC', '1QF', '1QJ', '1RA', '1RB', '1RC', '1RF', '1RJ', '1SA', '1SB', '1SC', '1SF', '1SJ', '1SS', '1ST', '1SU', '1UA', '1UB', '1UC', '1UF', '1UJ', '1US', '1UU', '1WA', '1WB', '1WC', '1WF', '1WJ', '1WS', '1WU', '2CA', '2CB', '2CC', '2CJ', '2CS', '2EA', '2EB', '2EJ', '2FA', '2FB', '2FC', '2FJ', '2FU', '2HB', '2HC', '2HF', '2HJ', '2HU', '2KA', '2KB', '2KC', '2KF', '2KJ', '2KU', '2MA', '2MB', '2MC', '2MF', '2MJ', '2MS', '2MT', '2PA', '2PB', '2PC', '2PF', '2PJ', '2PU', '2SA', '2SB', '2SC', '2SF', '2SJ', '2UA', '2UB', '2UJ', '2WB', '2WC', '2WF', '2WJ']´

sheet1_columns = sheet1.columns.tolist()
sheet2_expected_columns = ['%sF' % (c) for c in sheet1_columns]
common_columns = list(set(sheet2_expected_columns).intersection(set(sheet2.columns.tolist()))
columns_dict = {c:'%sF' % (c) for c in sheet1_columns}
sheet1_with_new_columns_names = sheet1.df.rename(columns=columns_dict)
sheet1_restriction = sheet1_with_new_columns_names[common_columns]
sheets2_restriction = sheets2[common_columns]
result = sheet1_restriction - sheet2_restriction
Can you test this?

You can try this:
sheet2 = sheet2.filter(regex=(".*F$")) # Leave only 'F' columns in sheet2
sheet2.columns = [i[:-1] for i in sheet2.columns] # Remove 'F' in the end for column-wise substraction
result = sheet1 - sheet2 # Substract values
result[result.isnull()] = sheet1 # Leave sheet1 values if there's no appropriate 'F' column in sheet2
Note: It leaves the value of sheet1 untouched if there's no appropriate columns with 'F' in sheet2.
I created your dataframes like so:
sheet1 = pd.DataFrame({'1C': [1057], '1E': [334], '1F': [3609], '2F': [3609]})
sheet2 = pd.DataFrame({'1CA': [11], '1CB': [381], '1CC': [111], '1CF': [20], '1EF': [10], '1FF': [15]})

Solution
Step 1: filter the columns in df2 which have suffix F:
cols = df2.columns[df2.columns.isin([col+'F' for col in df1.columns])]
cols
Index(['1AF', '1GF'], dtype='object')
Step 2: Use string operation on cols and filter for df1 dataframe, then subtract from df2 and assign the values df1:
df1.loc[:,cols.str[:-1]] = df1[cols.str[:-1]].values - df2[cols].values
df1
1A 1B 1C 1D 1E 1F 1G 1H 1I 1J
total 70 72 90 46 30 56 10 51 95 34
Values for 1A: 82-12 = 70 and values for 1G: 34-24=10.
Setup:
df1 = pd.DataFrame(np.random.randint(30,100, size=(1,10)), columns=list('ABCDEFGHIJ'))
df1.columns = ['1'+col for col in df1.columns]
df1.index = ['total']
df1
1A 1B 1C 1D 1E 1F 1G 1H 1I 1J
total 82 72 90 46 30 56 34 51 95 34
df2 = pd.DataFrame(np.random.randint(10,30, size=(1,7)), columns=list('ABFGHIJ'))
df2.index = ['total']
df2.columns = ['1'+col for col in df2.columns]
df2.columns = [col+'D' for col in df2.columns]
df2.rename(columns={'1AD':'1AF','1GD':'1GF'},inplace=True)
df2
1AF 1BD 1FD 1GF 1HD 1ID 1JD
total 12 29 29 24 10 12 17

You can try something like
result_df = df1.join(df2)
for col in df1.columns:
if ((col + 'F' in df2.columns):
result_df[col] = result_df[col] - result_df[col + 'F']

Related

Is there any example of code in python which i get table of numbers from the range in the first table?

In my first table I have columns: indeks, il, start and stop. The last two define a range. I need to list (in a new table) all numbers in the range from start to stop, but also save indeks and the other values belonging to the range.
This table shows what kind of data I have (sample):
ID
Indeks
Start
Stop
il
0
A1
1
3
25
1
B1
31
55
5
2
C1
36
900
865
3
D1
900
2500
20
...
...
...
...
...
And this is the table I want to get:
Indeks
Start
Stop
il
kod
A1
1
3
25
1
A1
1
3
25
2
A1
1
3
25
3
B1
31
55
5
31
B1
31
55
5
32
B1
31
55
5
33
...
...
...
...
...
B1
31
55
5
53
B1
31
55
5
54
B1
31
55
5
55
C1
36
900
865
36
C1
36
900
865
37
C1
36
900
865
38
...
...
...
...
...
C1
36
900
865
898
C1
36
900
865
899
C1
36
900
865
900
...
...
...
...
...
EDITET
lidy=pd.read_excel('path' )
lid=pd.DataFrame(lidy)
output = []
for i in range (0,len(lid)):
for j in range (lid.iloc[i,1],lid.iloc[i,2]+1):
y=((lid.iloc[i,0], j))output.append(y)
print(output)
OR
lidy=pd.read_excel('path' )
lid=pd.DataFrame(lidy)
for i in range (0,len(lid)):
for j in range (lid.iloc[i,1],lid.iloc[i,2]+1):
y=((lid.iloc[i,0], j))
print(y)
Two options:
(1 - preferred) Use Pandas (in combination with openpyxl as engine): The Excel-file I'm using is named data.xlsx, and sheet Sheet1 contains your data. Then this
import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
df["kod"] = df[["Start", "Stop"]].apply(
lambda row: range(row.iat[0], row.iat[1] + 1), axis=1
)
df = df.iloc[:, 1:].explode("kod", ignore_index=True)
with pd.ExcelWriter("data.xlsx", mode="a", if_sheet_exists="replace") as writer:
df.to_excel(writer, sheet_name="Sheet2", index=False)
should produce the required output in sheet Sheet2. The work is done by putting the required range()s in the new column kod, and then .explode()-ing it.
(2) Use only openpyxl:
from openpyxl import load_workbook
wb = load_workbook(filename="data.xlsx")
ws = wb["Sheet1"]
rows = ws.iter_rows(values_only=True)
# Reading the required column names
data = [list(next(rows)[1:]) + ["kod"]]
for row in rows:
# Read the input data (a row)
base = list(row[1:])
# Create the new data via iterating over the the given range
data.extend(base + [n] for n in range(base[1], base[2] + 1))
if "Sheet2" in wb.sheetnames:
del wb["Sheet2"]
ws_new = wb.create_sheet(title="Sheet2")
for row in data:
ws_new.append(row)
wb.save("data.xlsx")

Create columns from a .log/.txt file - python

I am importing a .log file with pandas that looks like this
10:30:03:8600 Rx 1 0x014 9 B5 45 5B 81 95 02 50 01 0x6E (Enhanced)
10:30:04:0280 Rx 1 0x015 8 77 B9 60 AE 8C 47 E6 20 0x3A (Enhanced)
...
[93 rows x 1 columns]
So aparrently everything is in 1 column
What I want to do is:
Split the 1 column that I have into each column that is separated by each space " " and add a header.
For this I have tried:
df = pd.read_csv('df.log',
delimiter = ' ',
names = ['Time', 'Tx/Rx','ID' 'Temp','Pressure' ...])
I want to be able to read the values from B5 to 01 from the 1st row. So, after I split the one column into more columns I plan to use .iloc like df.iloc[5:12] for all the rows.
I want in to look this, so I can easily read the data:
'ID' 'Temp', 'Pressure' ...
B5 45 5B ...
77 B9 60 ...
I have a single line solution for you:
import pandas as pd
df= pd.read_csv("df.log", delimiter = " ", names = ['Time', 'Tx/Rx','ID','Temp','Pressure' ...])
Notice that delimiter is in this case a whitespace.
If you want to get rid of the index when printing it:
print(df.to_string(index=False))
Please try it out and tell me if it works for you :)
What you can try is apply str.split and then convert the column with list to individual columns.
>>> df1 = df.loc[:,0].apply(lambda x: x.split())
>>> df1
0 [10:30:03:8600, Rx, 1, 0x014, 9, B5, 45, 5B, 8...
1 [10:30:04:0280, Rx, 1, 0x015, 8, 77, B9, 60, A...
Name: 0, dtype: object
>>> pd.DataFrame(df1.values.tolist(), columns=list("ABCDEFGHIJKLMNO"))
A B C D E F G H I J K L M N O
0 10:30:03:8600 Rx 1 0x014 9 B5 45 5B 81 95 02 50 01 0x6E (Enhanced)
1 10:30:04:0280 Rx 1 0x015 8 77 B9 60 AE 8C 47 E6 20 0x3A (Enhanced)

Dataframe/Row Indexing for Pandas

I was wondering how could I index datasets so that a row number from df1 can equal a different row number for df2? eg. row 1 in df 1 = row 3 in df2
What I would like. (In this case: row 1 2011 = row 2 2016)
row 49:50 2011 b1 is the same as row 51:52 bt 2016 (both the same item, but different value in different years) but is sliced differently due to being in a different cell in 2016
I've been using pd.concat and pd.Series but still no success.
# slicing 2011 data (total)
b1 = df1.iloc[49:50, 6:7]
m1 = df1.iloc[127:128, 6:7]
a1 = df1.iloc[84:85, 6:7]
data2011 = pd.concat([b1, m1, a1])
# slicing 2016 data (total)
bt = df2.iloc[51:52, 6:7]
mt = df2.iloc[129:130, 6:7]
at = df2.iloc[86:87, 6:7]
data2016 = pd.concat([bt, mt, at])
data20112016 = pd.concat([data2011, data2016])
print(data20112016)
Output I'm getting:
What I need to fix. (In this case : row 49 = row 51, so 11849 in the left column and 13500 in the right coloumn)
49 11849
127 22622
84 13658
51 13500
129 25281
86 18594
I would like to do a bar graph comparing b12011 to bt2016 and so on. meaning 42 = 51, 127 = 129 etc
# Tot_x Tot_y
# 49=51 11849 13500
# 127=129 22622 25281
# 84=86 13658 18594
I hope this clear things up.
Thanks in advance.
If I understood your question correctly, here is solution using merge:
df1 = pd.DataFrame([9337, 2953, 8184], index=[49, 127, 84], columns=['Tot'])
df2 = pd.DataFrame([13500, 25281, 18594], index=[51, 129, 86], columns=['Tot'])
total_df = (df1.reset_index()
.merge(df2.reset_index(), left_index=True, right_index=True))
And here is, using concat:
total_df = pd.concat([df1.reset_index(), df2.reset_index()], axis=1)
And here is resulting barplot:
total_df.index = total_df['index_x'].astype(str) + '=' + total_df['index_y'].astype(str)
total_df
# index_x Tot_x index_y Tot_y
# 49=51 49 9337 51 13500
# 127=129 127 2953 129 25281
# 84=86 84 8184 86 18594
(total_df.drop(['index_x', 'index_y'], axis=1)
.plot(kind='bar', rot=0))

pandas: ValueError: Can only compare identically-labeled Series objects

I have 2 csv files as below, I want to find if an individual performance (in df1) is above/below class average (in df2) using come compare function after finding their values.
df1:
Name Class Test1 Test2 Test3
John 9A 75 83 77
David 9B 65 67 55
Peter 9A 85 90 88
Tom 9C 74 92 78
df2:
Class Test1 Test2 Test3
9A 80 82 84
9B 84 75 77
9C 75 78 80
Here's my method, feel free to correct/guide me if I'm wrong. I first find the Class of an individual in df1, e.g., John is 9A, then return the other columns such as Test1 or Test2 in df2 based on 9A
target_class = df1.loc[df1['Name'] == 'John', 'Class']
print(target_class)
>>>>9A
Test1_avg = df2.loc[df2['Class'] == target_class, 'Test1']
# ideally it should return 80
And I got this ValueError: Can only compare identically-labeled Series objects
Or simply, how would you compare John's Test1 in df1 vs Class 9A's Test1 in df2? Is there any easier method than mine? Thanks for your help!
Update: I'll then use a compare function like this to return a score if it fulfills the criteria
def comparison(a, b):
return 2 if a > b else 1 if a == b else -1
This is one way via pandas.merge.
# rename df2 columns
df2 = df2.rename(columns={'Test'+str(x): 'AvgTest'+str(x) for x in range(1, 4)})
# left merge df1 on df2
res = pd.merge(df1, df2, how='left', on=['Class'])
# calculate comparison results
comparison = pd.DataFrame(res.loc[:, res.columns.str.startswith('Test')].values >= \
res.loc[:, res.columns.str.startswith('AvgTest')].values,
columns=['Comp'+str(x) for x in range(1, 4)])
# join results to dataframe
res = res.join(comparison)
print(res)
# Name Class Test1 Test2 Test3 AvgTest1 AvgTest2 AvgTest3 Comp1 \
# 0 John 9A 75 83 77 80 82 84 False
# 1 David 9B 65 67 55 84 75 77 False
# 2 Peter 9A 85 90 88 80 82 84 True
# 3 Tom 9C 74 92 78 75 78 80 False
# Comp2 Comp3
# 0 True False
# 1 False False
# 2 True True
# 3 True False

Annotating one dataFrame using another dataFrame, if there is matching value in column

I have two DataFrames I want to first look for matching values in col1 in dataFrame1 to col1 in DataFrame2 and print out all the columns from DataFrame1 with additional columns from DataFrame2.
For Example
I have tried following ,
data = 'file_1'
Up = pd.DataFrame.from_csv(data, sep='\t')
Up = Up.reset_index(drop=False)
Up.head()
Gene_id baseMean log2FoldChange lfcSE stat pvalue padj
0 ENSG.16 176.275036 0.9475260059 0.4310373793 2.1982455617 0.0279316115 0.198658
1 ENSG.10 80.199435 0.4349592748 0.2691551416 1.6160169639 0.1060906455 0.369578
2 ENSG.15 1649.400749 -0.0215428237 0.1285061198 -0.1676404495 0.8668661474 0.947548
3 ENSG.10 25507.767530 0.5145516695 0.2473335499 2.0803957642 0.0374892475 0.229378
4 ENSG.12 70.122885 -0.2612483888 0.2593848667 -1.00718439
and the second dataframe is,
mydata = 'file_2'
annon = pd.DataFrame.from_csv(mydata, sep='\t')
annon = annon.reset_index(drop=False)
annon.head()
Gene_id sam_1 sam2 sam3 sam4 sam5 sam6 sam7 sam8 sam9 sam10 sam11
0 ENSG.16 404 55 33 39 102 43 193 244 600 174 120
1 ENSG.10 58 89 110 69 64 48 61 81 98 75 119
2 ENSG.15 1536 1246 2540 1751 1850 2137 1460 1362 2158 1367 1320
3 ENSG.10 28508 23073 19982 13821 20355 28835 26875 25632 27131 30991 29351
4 ENSG.12 87 81 121 67 98 47 37 59 68 44 81
and following is what i tried so far,
x=pd.merge(Up[['Gene_id' , 'log2FoldChange ', 'pvalue ' , 'padj']] , annon , on = 'Gene_id')
x.head()
Gene_id log2FoldChange pvalue padj sam_1 sam2 sam3 sam4 sam5 sam6 sam7 sam8 sam9 sam10 sam11
Its just giving me header of the file and nothing else..
And so I looked into file1(Up) with one row value like following,
This what i am getting
print(Up.loc[Up['Gene_id'] =='ENSG.16'])
Empty DataFrame
Columns: [Gene_id, baseMean , log2FoldChange , lfcSE , stat , pvalue , padj]
Index: []
But infact that is not empty and it has values in dataframe Up.
Any solutions would be great..!!!
pd.merge(df1[['Gene_Id' , 'log2FoldChange', 'pvalue' , 'padj']] , df2 , left_on='Gene_Id' , right_on= 'Gene_id')
you can then easily drop Gene_id if you want
Hope this helps you.
Let me know if it works.
import pandas as pd
# creating test Dataframe1
df = pd.DataFrame(['ENSG1', 162.315169869338, 0.920583258294463, 0.260406974056691, 3.53517128959092, 0.000407510906151687, 0.0176112964515702])
df=df.T
# important thing is make column 0 as its index
df.index=df[0]
print(df)
# creating test Dataframe2
df2 = pd.DataFrame(['ENSG1', 404, 55, 33, 39, 102, 43, 193, 244, 600, 174, 120])
df2=df2.T
# important thing is make column 0 as its index
df2.index=df2[0]
print(df2)
# concatinate both the frames using axis=1 (outer or inner as per your need)
x = pd.concat([df,df2],axis=1,join='outer')
print(x)

Categories