This is a follow up to a previous question here
combinations of two columns
I'm trying to take one dataframe and create another, with all possible combinations of 3 columns together and the difference between the corresponding values, i.e on 11-apr column ABC should be (2B -A - C)= 0, then 2*B-A-D = 0 and so on etc.
e.g, starting with
Dt A B C D
11-apr 1 1 1 1
10-apr 2 3 1 2
how do I get a new frame that looks like this:
I think need:
cc = list(combinations(df.columns,3))
df = pd.concat([df[c[1]].mul(2).sub(df[c[2]]).sub(df[c[0]]) for c in cc], axis=1, keys=cc)
df.columns = df.columns.map(''.join)
print (df)
ABC ABD ACD BCD
Dt
11-apr 0 0 0 0
10-apr 3 2 -2 -3
Related
Currently have a dataframe that is countries by series, with values ranging from 0-25
I want to sort the df so that the highest values appear in the top left (first), while the lowest appear in the bottom right (last).
FROM
A B C D ...
USA 4 0 10 16
CHN 2 3 13 22
UK 2 1 8 14
...
TO
D C A B ...
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1
...
In this, the column with the highest values is now first, and the same is true with the index.
I have considered reindexing, but this loses the 'Countries' Index.
D C A B ...
0 22 13 2 3
1 16 10 4 0
2 14 8 2 1
...
I have thought about creating a new column and row that has the Mean or Sum of values for that respective column/row, but is this the most efficient way?
How would I then sort the DF after I have the new rows/columns??
Is there a way to reindex using...
df_mv.reindex(df_mv.mean(or sum)().sort_values(ascending = False).index, axis=1)
... that would allow me to keep the country index, and simply sort it accordingly?
Thanks for any and all advice or assistance.
EDIT
Intended result organizes columns AND rows from largest to smallest.
Regarding the first row of the A and B columns in the intended output, these are supposed to be 2, 3 respectively. This is because the intended result interprets the A column as greater than the B column in both sum and mean (even though either sum or mean can be considered for the 'value' of a row/column).
By saying the higher numbers would be in the top left, while the lower ones would be in the bottom right, I simply meant this as a general trend for the resulting df. It is the columns and rows as whole however, that are the intended focus. I apologize for the confusion.
You could use:
rows_index=df.max(axis=1).sort_values(ascending=False).index
col_index=df.max().sort_values(ascending=False).index
new_df=df.loc[rows_index,col_index]
print(new_df)
D C A B
CHN 22 13 2 3
USA 16 10 4 0
UK 14 8 2 1
Use .T to transpose rows to columns and vice versa:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.T
df = df.sort_values(df.columns[0], ascending=False).T
Result:
>>> df
D C B A
CHN 22 13 3 2
USA 16 10 0 4
UK 14 8 1 2
Here's another way, this time without transposing but using axis=1 as an argument:
df = df.sort_values(df.max().idxmax(), ascending=False)
df = df.sort_values(df.index[0], axis=1, ascending=False)
Using numpy:
arr = df.to_numpy()
arr = arr[np.max(arr, axis=1).argsort()[::-1], :]
arr = np.sort(arr, axis=1)[:, ::-1]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df1)
Output:
A B C D
USA 22 13 3 2
CHN 16 10 4 0
UK 14 8 2 1
If I have a following dataframe:
A B C D E
1 1 2 0 1 0
2 0 0 0 1 -1
3 1 1 3 -5 2
4 -3 4 2 6 0
5 2 4 1 9 -1
T 1 2 2 4 1
The last row is my threshold values for each column. I want to count each column values whether lower its threshold values or not in python pandas.
Desired Output is;
A B C D E
Count 2 2 3 3 4
But, I need to figure it out with a general solution, not for these specific columns. Because I have a large dataset. I cannot specify a column name for each of them in the code.
Could you please help me with this?
Select all rows without first by indexing and compare by DataFrame.lt by last row, then sum and convert Series to one row DataFrame by Series.to_frame with transpose by DataFrame.T:
df = df.iloc[:-1].lt(df.iloc[-1]).sum().to_frame('count').T
print (df)
A B C D E
count 2 2 3 3 4
Numpy alternative with DataFrame constructor:
arr = df.values
df = pd.DataFrame([np.sum(arr[:-1] < arr[-1], axis=0)], columns=df.columns, index=['count'])
print (df)
A B C D E
count 2 2 3 3 4
I have a dataframe df with the shape (4573,64) that I'm trying to pivot. The last column is an 'id' with two possible string values 'old' and 'new'. I would like to set the first 63 columns as index and then have the 'id' column across the top with values being the count of 'old' or 'new' for each index row.
I've created a list object out of columns labels that I want as index named cols.
I tried the following:
df.pivot(index=cols, columns='id')['id']
this gives an error: 'all arrays must be same length'
also tried the following to see if I can get sum but no luck either:
pd.pivot_table(df,index=cols,values=['id'],aggfunc=np.sum)
any ides greatly appreciated
I found a thread online talking about a possible bug in pandas 0.23.0 where the pandas.pivot_table() will not accept the multiindex as long as it contains NaN's (link to github in comments). My workaround was to do
df.fillna('empty', inplace=True)
then the solution below:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
as proposed by jezrael will work as intended hence the answer accepted.
I believe need convert columns names to list and then aggregate size with unstack:
df = pd.DataFrame({'B':[4,4,4,5,5,4],
'C':[1,1,9,4,2,3],
'D':[1,1,5,7,1,0],
'E':[0,0,6,9,2,4],
'id':list('aaabbb')})
print (df)
B C D E id
0 4 1 1 0 a
1 4 1 1 0 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
cols = df.columns.tolist()
df1 = df.groupby(cols)['id'].size().unstack(fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
Solution with pivot_table:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
This my code below as I've carried out a groupby and count on a large dataframe
group3=charity.groupby(['Split', 'B4']).size()
group3=group3.reset_index()
out
split B4 0
0 CRk No 193
1 CRuk Yes 7
2 LLR No 184
3 LLR Yes 15
4 MUK No 188
5 MUK Yes 12
6 MCUK No 186
7 MCUK Yes 14
Below code does not work
group3=group3.rename(columns={"0": "count1"})
group3 #does not work
want to rename the new 0 column created by the count function so I can add other columns to the data frame for a chi-square test
also wondering how I can do count function on multiple columns in addition to the B4 column above
Use parameter name in reset_index:
group3=charity.groupby(['Split', 'B4']).size().reset_index(name='count1')
Or rename scalar number 0:
group3 = group3.rename(columns={0: "count1"})
Sample:
charity = pd.DataFrame({'B4':list('abbbbb'),
'Split':list('aaabbb')})
print (charity)
B4 Split
0 a a
1 b a
2 b a
3 b b
4 b b
5 b b
group3=charity.groupby(['Split', 'B4']).size().reset_index(name='count1')
print (group3)
Split B4 count1
0 a a 1
1 a b 2
2 b b 3
I searched a lot for an answer, the closest question was Compare 2 columns of 2 different pandas dataframes, if the same insert 1 into the other in Python, but the answer to this person's particular problem was a simple merge, which doesn't answer the question in a general way.
I have two large dataframes, df1 (typically about 10 million rows), and df2 (about 130 million rows). I need to update values in three columns of df1 with values from three columns of df2, based on two df1 columns matching two df2 columns. It is imperative that the order of df1 remains unchanged, and that only rows with matching values get updated.
This is how the dataframes look like:
df1
chr snp x pos a1 a2
1 1-10020 0 10020 G A
1 1-10056 0 10056 C G
1 1-10108 0 10108 C G
1 1-10109 0 10109 C G
1 1-10139 0 10139 C T
Note that it's not always the case that the values of "snp" is chr-pos, it can take many other values with no link to any of the columns (like rs1234, indel-6032 etc)
df2
ID CHR STOP OCHR OSTOP
rs376643643 1 10040 1 10020
rs373328635 1 10066 1 10056
rs62651026 1 10208 1 10108
rs376007522 1 10209 1 10109
rs368469931 3 30247 1 10139
I need to update ['snp', 'chr', 'pos'] in df1 with df2[['ID', 'OCHR', 'OSTOP']] only when df1[['chr', 'pos']] matches df2[['OCHR', 'OSTOP']]
so in this case, after update, df1 would look like:
chr snp x pos a1 a2
1 rs376643643 0 10040 G A
1 rs373328635 0 10066 C G
1 rs62651026 0 10208 C G
1 rs376007522 0 10209 C G
3 rs368469931 0 30247 C T
I have used merge as a workaround:
df1 = pd.merge(df1, df2, how='left', left_on=["chr", "pos"], right_on=["OCHR", "OSTOP"],
left_index=False, right_index=False, sort=False)
and then
df1.loc[~df1.OCHR.isnull(), ["snp", "chr", "pos"]] = df1.loc[~df1.OCHR.isnull(), ["ID", "CHR", "STOP"]].values
and then remove the extra columns.
Yes, it works, but what would be a way to do that directly by comparing the values from both dataframes, I just don't know how to formulate it, and I couldn't find an answer anywhere; I guess it could be useful to get a general answer on this.
I tried that but it doesn't work:
df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["snp", "chr", "pos"]] = df2.loc[df2[['OCHR', 'OSTOP']] == df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["chr", "pos"]],['ID', ''CHR', 'STOP']].values
Thanks,
Stephane
You can use the update function (requires setting the matching criteria to index). I've modified your sample data to allow some mismatch.
# your data
# =====================
# df1 pos is modified from 10020 to 10010
print(df1)
chr snp x pos a1 a2
0 1 1-10020 0 10010 G A
1 1 1-10056 0 10056 C G
2 1 1-10108 0 10108 C G
3 1 1-10109 0 10109 C G
4 1 1-10139 0 10139 C T
print(df2)
ID CHR STOP OCHR OSTOP
0 rs376643643 1 10040 1 10020
1 rs373328635 1 10066 1 10056
2 rs62651026 1 10208 1 10108
3 rs376007522 1 10209 1 10109
4 rs368469931 3 30247 1 10139
# processing
# ==========================
# set matching columns to multi-level index
x1 = df1.set_index(['chr', 'pos'])['snp']
x2 = df2.set_index(['OCHR', 'OSTOP'])['ID']
# call update function, this is inplace
x1.update(x2)
# replace the values in original df1
df1['snp'] = x1.values
print(df1)
chr snp x pos a1 a2
0 1 1-10020 0 10010 G A
1 1 rs373328635 0 10056 C G
2 1 rs62651026 0 10108 C G
3 1 rs376007522 0 10109 C G
4 1 rs368469931 0 10139 C T
Start by renaiming the columns you want to merge in df2
df2.rename(columns={'OCHR':'chr','OSTOP':'pos'},inplace=True)
Now merge on these columns
df_merged = pd.merge(df1, df2, how='inner', on=['chr', 'pos']) # you might have to preserve the df1 index at this stage, not sure
Next, you want to
updater = df_merged[['D','CHR','STOP']] #this will be your update frame
updater.rename( columns={'D':'snp','CHR':'chr','STOP':'pos'},inplace=True) # rename columns to update original
Finally update (see bottom of this link):
df1.update( df1_updater) #updates in place
# chr snp x pos a1 a2
#0 1 rs376643643 0 10040 G A
#1 1 rs373328635 0 10066 C G
#2 1 rs62651026 0 10208 C G
#3 1 rs376007522 0 10209 C G
#4 3 rs368469931 0 30247 C T
update works by matching index/column so you might have to string along the index of df1 for the entire process, then do df1_updater.re_index(... before df1.update(df1_updater)