Rows not sorted after concat - python

In the code below I'm attempting to reorder a portion of a dataframe and then join it with another portion. The code will sort but when I attempt to run the last line it returns the unsorted frame. Can anyone help with this?
Code
copied = frame[frame['PLAYVAL'].isin([3,4])].copy()
copied_col = copied['PLAY_EVT']
copied = copied.drop(columns=['PLAY_EVT'],axis=1)
copied = copied.sort_values(['TIME_ELAPSED','SHOTVAL'],ascending=[True,True]).copy()
result = pd.concat([copied_col,copied],axis=1)
Frame
PLAY_EVT
TIME_ELAPSED
INFO
SHOTVAL
0
1
132
1of2
2
1
2
132
2of2
3
2
3
342
3of3
6
3
4
342
2of3
5
4
5
342
1of3
4
5
6
786
2of2
3
6
7
786
1of2
2
Expected Outcome
PLAY_EVT
TIME_ELAPSED
INFO
SHOTVAL
0
1
132
1of2
2
1
2
132
2of2
3
2
3
342
1of3
4
3
4
342
2of3
5
4
5
342
3of3
6
5
6
786
1of2
2
6
7
786
2of2
3

Figured it out. Must've had something to do with indexes.
copied = frame[frame['PLAYVAL'].isin([3,4])]
copied_col = copied['PLAY_EVT'].reset_index(drop=True)
copied = copied.drop(columns=['PLAY_EVT'],axis=1)
copied = copied.sort_values(['TIME_ELAPSED','SHOTVAL'],ascending=[True,True]).reset_index(drop=True)
result = pd.merge(copied_col, copied, left_index=True, right_index=True)

Related

Count consecutive numbers from a column of a dataframe in Python

I have a dataframe that has segments of consecutive values appearing in column a (the value in column b does not matter):
import pandas as pd
import numpy as np
np.random.seed(150)
df = pd.DataFrame(data={'a':[1,2,3,4,5,15,16,17,18,203,204,205],'b':np.random.randint(50000,size=(12))})
>>> df
a b
0 1 27066
1 2 28155
2 3 49177
3 4 496
4 5 2354
5 15 23292
6 16 9358
7 17 19036
8 18 29946
9 203 39785
10 204 15843
11 205 21917
I would like to add a column c whose values are sequential counts according to presenting consecutive values in column a, as shown below:
a b c
1 27066 1
2 28155 2
3 49177 3
4 496 4
5 2354 5
15 23292 1
16 9358 2
17 19036 3
18 29946 4
203 39785 1
204 15843 2
205 21917 3
How to do this?
One solution:
df["c"] = (s := df["a"] - np.arange(len(df))).groupby(s).cumcount() + 1
print(df)
Output
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3
The original idea comes from ancient Python docs.
In order to use the walrus operator ((:=) or assignment expressions) you need Python 3.8+, instead you can do:
s = df["a"] - np.arange(len(df))
df["c"] = s.groupby(s).cumcount() + 1
print(df)
A simple solution is to find consecutive groups, use cumsum to get the number sequence and then remove any extra in later groups.
a = df['a'].add(1).shift(1).eq(df['a'])
df['c'] = a.cumsum() - a.cumsum().where(~a).ffill().fillna(0).astype(int) + 1
df
Result:
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3

How to transpose and merge two dataframes with pandas - obtaining a key error

I have a data frame (file1.txt) like this:
identifer 1 2 3
Fact1 494 43 3
Fact2 383 32 5
Fact3 384 23 5
Fact4 382 21 7
And another data frame (file2.txt) like this:
Sample Char1 Char2 Char3
1 4 5 5
2 5 2 4
3 5 6 2
4 2 4 4
the output should look like this:
Sample Fact1 Fact2 Fact3 Char1 Char2 Char3
1 494 383 384 4 5 5
2 43 32 5 5 2 4
3 384 23 5 5 6 2
I wrote:
#to transpose Table1
df = pd.read_csv('file1.txt', sep='\t',header=0)
df2 = df.T
#To read in table 2
df3 = pd.read_csv('file2.txt', sep='\t',header=0)
df4 = df2.merge(df3,left_on = 'identifier',right_on='Sample',how='inner')
print(df4)
And I'm getting the error: 'KeyError: identifier'
When I print the columns of df2, i.e. the transposed data set, I can see that the columns are just the first row of data, and not the header, and the identifier row is the last row listed in the transposed matrix. Could someone explain to me how to transpose and merge these data frames? I was trying to follow a SO answer that said to .set_index() and then transpose, but when I do df2 = df.set_index('identifier').T I'm getting the same error. Following another SO suggestion I was trying here, I changed from merge to join so I did df2.join(df3.set_index['Sample'],on='identifier) but then I'm getting other errors (in this error 'method object is not subscriptable') so I'm just stuck and would appreciate insight.
A slightly modified version of previous answer:
out = (
df1.set_index('identifer').T.join(df2.astype({'Sample': str}).set_index('Sample'))
.rename_axis('Sample').reset_index().astype({'Sample': int})
)
print(out)
# Output:
Sample Fact1 Fact2 Fact3 Fact4 Char1 Char2 Char3
0 1 494 383 384 382 4 5 5
1 2 43 32 23 21 5 2 4
2 3 3 5 5 7 5 6 2
Note: you need to cat Sample column as string because in the other dataframe the column index has dtype str. To perform the join, the indexes must be of the same type.
You can to set the index
df1 = df1.set_index('identifer')
df1.columns = df1.columns.astype(float)
out = df1.T.join(df2.set_index('Sample'))#.reset_index()
Out[82]:
Fact1 Fact2 Fact3 Fact4 Char1 Char2 Char3
1.0 494 383 384 382 4 5 5
2.0 43 32 23 21 5 2 4
3.0 3 5 5 7 5 6 2

Pandas or other Python application to generate a column with 1 to n value based on other two columns with rules

Hope I can explain the question properly.
In basic terms, imagining the df as below:
print(df)
year id
1 16100
1 150
1 150
2 66
2 370
2 370
2 530
3 41
3 43
3 61
Would need df.seq to be a cycling 1 to n value if the year rows are identical, until it changes.
df.seq2 would be still n, instead of n+1, if the above rows id value is identical.
So if we imagine excel like formula would be something like
df.seq2 = IF(A2=A1,IF(B2=B1,F1,F1+1),1)
which would make the desired output seq and seq2 below:
year id seq seq2
1 16100 1 1
1 150 2 2
1 150 3 2
2 66 1 1
2 370 2 2
2 370 3 2
2 530 4 3
3 41 1 1
3 43 2 2
3 61 3 3
Did test couple things like (assuming I've generated the df.seq)
comb_df['match'] = comb_df.year.eq(comb_df.year.shift())
comb_df['match2'] = comb_df.id.eq(comb_df.id.shift())
comb_df["seq2"] = np.where((comb_df["match"].shift(+1) == True) & (comb_df["match2"].shift(+1) == True), comb_df["seq"] - 1, comb_df["seq2"])
But the problem is this doesn't really work out if there are multiple duplicates in a row etc.
Perhaps issue can not be resolved purely on numpy sort of way but perhaps I'd have to iterate over the rows?
There are 2-3 million rows, so the performance will be an issue if the solution would be very slow.
Would need to generate both df.seq and df.seq2
Any ideas would be extremely helpful!
We can do groupby with cumcount and factorize
df['seq'] = df.groupby('year').cumcount()+1
df['seq2'] = df.groupby('year')['id'].transform(lambda x : x.factorize()[0]+1)
df
Out[852]:
year id seq seq2
0 1 16100 1 1
1 1 150 2 2
2 1 150 3 2
3 2 66 1 1
4 2 370 2 2
5 2 370 3 2
6 2 530 4 3
7 3 41 1 1
8 3 43 2 2
9 3 61 3 3

Most efficient way to pass data from one pandas DataFrame to another

I'm trying to find a more efficient way of transferring information from one DataFrame to another by iterating rows. I have 2 DataFrames, one containing unique values called 'id' in a column and a value called 'region' in another column:
dfkey = DataFrame({'id':[1122,3344,3467,1289,7397,1209,5678,1792,1928,4262,9242],
'region': [1,2,3,4,5,6,7,8,9,10,11]})
id region
0 1122 1
1 3344 2
2 3467 3
3 1289 4
4 7397 5
5 1209 6
6 5678 7
7 1792 8
8 1928 9
9 4262 10
10 9242 11
...the other DataFrame contains these same ids, but now sometimes repeated and without any order:
df2 = DataFrame({'id':[1792,1122,3344,1122,3467,1289,7397,1209,5678],
'other': [3,2,3,4,3,5,7,3,1]})
id other
0 1792 3
1 1122 2
2 3344 3
3 1122 4
4 3467 3
5 1289 5
6 7397 7
7 1209 3
8 5678 1
I want to use the dfkey DataFrame as a key to input the region of each id in the df2 DataFrame. I already found a way to do this with iterrows(), but it involves nested loops:
df2['region']=0
for i, rowk in dfkey.iterrows():
for j, rowd in df2.iterrows():
if rowk['id'] == rowd['id']:
rowd['region'] = rowk['region']
id other region
0 1792 3 8
1 1122 2 1
2 3344 3 2
3 1122 4 1
4 3467 3 3
5 1289 5 4
6 7397 7 5
7 1209 3 6
8 5678 1 7
The actual dfkey I have has 43K rows and the df2 600K rows. The code has been running for an hour now so I'm wondering if there's a more efficient way of doing this...
pandas.merge could be another solution.
newdf = pandas.merge(df2, dfkey, on='id')
In [22]: newdf
Out[22]:
id other region
0 1792 3 8
1 1122 2 1
2 1122 4 1
3 3344 3 2
4 3467 3 3
5 1289 5 4
6 7397 7 5
7 1209 3 6
8 5678 1 7
I would use map() method:
In [268]: df2['region'] = df2['id'].map(dfkey.set_index('id').region)
In [269]: df2
Out[269]:
id other region
0 1792 3 8
1 1122 2 1
2 3344 3 2
3 1122 4 1
4 3467 3 3
5 1289 5 4
6 7397 7 5
7 1209 3 6
8 5678 1 7
Timing for 900K rows df2 DF:
In [272]: df2 = pd.concat([df2] * 10**5, ignore_index=True)
In [273]: df2.shape
Out[273]: (900000, 3)
In [274]: dfkey.shape
Out[274]: (11, 2)
In [275]: %timeit df2['id'].map(dfkey.set_index('id').region)
10 loops, best of 3: 176 ms per loop

how to sort a column and group them on pandas?

I am new on pandas. I try to sort a column and group them by their numbers.
df = pd.read_csv("12Patients150526 mutations-ORIGINAL.txt", sep="\t", header=0)
samp=df["SAMPLE"]
samp
Out[3]:
0 11
1 2
2 9
3 1
4 8
5 2
6 1
7 3
8 10
9 4
10 5
..
53157 12
53158 3
53159 2
53160 10
53161 2
53162 3
53163 4
53164 11
53165 12
53166 11
Name: SAMPLE, dtype: int64
#sorting
grp=df.sort(samp)
This code does not work. Can somebody help me about my problem, please.
How can I sort and group them by their numbers?
To sort df based on a particular column, use df.sort() and pass column name as parameter.
import pandas as pd
import numpy as np
# data
# ===========================
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1,10,1000), columns=['SAMPLE'])
df
SAMPLE
0 6
1 1
2 4
3 4
4 8
5 4
6 6
7 3
.. ...
992 3
993 2
994 1
995 2
996 7
997 4
998 5
999 4
[1000 rows x 1 columns]
# sort
# ======================
df.sort('SAMPLE')
SAMPLE
310 1
710 1
935 1
463 1
462 1
136 1
141 1
144 1
.. ...
174 9
392 9
386 9
382 9
178 9
772 9
890 9
307 9
[1000 rows x 1 columns]

Categories