Join two columns in Pandas, even both of them are null - python

I have dataset with routing list:
| order | point | city | boxes | pallets |
|--| -- | -- | -- | -- |
| o12345 | 1 | X |b0|p0,p1|
|o12345|2|Y|-|p2,p3,p4|
|o12345|3|Z|b1|-|
|o34567|1|Q|-|-|
|o34567|2|W|b2,b3|p5,p6|
|o34567|3|E|-|p7|
|o34567|4|R|b4,b5|p8,p9,p10|
How to join the columns "boxes" and "pallets" to get "cargo" with list both of boxes and pallets inside and the to explode this column to get each value in separate row
import pandas as pd
df=pd.read_excel('example.xlsx')
df['cargo'] = df['pallets']+','+ df['boxes']
print(df)
But not works with null values:(
Firstly expect to get:
And then to explode only for cargo:

Here is an approach using df.explode()
df['cargo'] = (df[['boxes', 'pallets']]
.apply(lambda x: ','.join([i for i in x if i]), axis=1))
df = df.drop(['boxes', 'pallets'], axis=1)
print(df)
order point city cargo
0 o12345 1 X b0,p0,p1
1 o12345 2 Y p2,p3,p4
2 o12345 3 Z b1
3 o34567 1 Q
4 o34567 2 W b2,b3,p5,p6
5 o34567 3 E p7
6 o34567 4 R b4,b5,p8,p9,p10
df['cargo'] = df['cargo'].str.split(',')
df = (df.explode('cargo').sort_values(by=['order', 'point']))
print(df)
order point city cargo
0 o12345 1 X b0
0 o12345 1 X p0
0 o12345 1 X p1
1 o12345 2 Y p2
1 o12345 2 Y p3
1 o12345 2 Y p4
2 o12345 3 Z b1
3 o34567 1 Q
4 o34567 2 W b2
4 o34567 2 W b3
4 o34567 2 W p5
4 o34567 2 W p6
5 o34567 3 E p7
6 o34567 4 R b4
6 o34567 4 R b5
6 o34567 4 R p8
6 o34567 4 R p9
6 o34567 4 R p10

Related

pandas: Aggregate on one column and count based on two columns

Suppose I have the following dataframe:
fid prefix target_text
0 f1 p1 t1
1 f1 p1 t2
2 f1 p2 t1
3 f1 p2 t2
4 f1 p3 t1
5 f1 p3 t3
6 f1 p3 t4
7 f2 p1 t1
8 f2 p1 t2
9 f2 p2 t2
10 f2 p2 t1
If I group them by fid and prefix and count unique target_text I have:
>>> num_targets = df.groupby(['fid','prefix'])['target_text'].transform('nunique')
0 2
1 2
2 2
3 2
4 3
5 3
6 3
7 2
8 2
9 2
10 2
Now I want to group them by only 'fid' but in front of it print the number of distinct [prefix, target_text]
I expect:
num_targets
f1 7
f2 4
However if I gourp the dataframe by fid, then how can I count distinct [prefix, target_text]?
If need unique per both columns output is different:
s = (df['target_text'] + '_' + df['prefix']).groupby(df['fid']).nunique()
print (s)
fid
f1 7
f2 4
dtype: int64
s = df.drop_duplicates(['fid','prefix','target_text'])['fid'].value_counts()
print (s)
f1 7
f2 4
Name: fid, dtype: int64

Replace value with the value of nearest neighbor in Pandas dataframe

I have a problem with getting nearest values for some rows in pandas dataframe and fill another column with values from those rows.
data sample I have:
id su_id r_value match_v
A A1 0 1
A A2 0 1
A A3 70 2
A A4 120 100
A A5 250 3
A A6 250 100
B B1 0 1
B B2 30 2
The thing is, wherever match_v is equal to 100, I need to replace that 100 with a value from the row where r_value is the closest to r_value from origin row(where match_v is equal to 100), but just withing group (grouped by id)
Expected output
id su_id r_value match_v
A A1 0 1
A A2 0 1
A A3 70 2
A A4 120 2
A A5 250 3
A A6 250 3
B B1 0 1
B B2 30 2
I have tried with creating lead and leg with shift and then finding differences. But doesn't work well and it somehow messed up already good values.
I haven't tried anything else cause I really don't have any idea.
Any help or hint is welcomed and I if you need any additional info, I'm here.
Thanks in advance.
More like merge_asof
s=df.loc[df.match_v!=100]
s=pd.merge_asof(df.sort_values('r_value'),s.sort_values('r_value'),on='r_value',by='id',direction='nearest')
df['match_v']=df['su_id'].map(s.set_index('su_id_x')['match_v_y'])
df
Out[231]:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
Here is another way using numpy broadcast , build for speed up calculation
l=[]
for x , y in df.groupby('id'):
s1=y.r_value.values
s=abs((s1-s1[:,None])).astype(float)
s[np.tril_indices(s.shape[0], 0)] = 999999
s=s.argmin(0)
s2=y.match_v.values
l.append(s2[s][s2==100])
df.loc[df.match_v==100,'match_v']=np.concatenate(l)
df
Out[264]:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
You could define a custom function which does the calculation and substitution, and then use it with groupby and apply.
def mysubstitution(x):
for i in x.index[x['match_v'] == 100]:
diff = (x['r_value'] - (x['r_value'].iloc[i])).abs()
exclude = x.index.isin([i])
closer_idx = diff[~exclude].idxmin()
x['match_v'].iloc[i] = x['match_v'].iloc[closer_idx]
return x
ddf = df.groupby('id').apply(mysubstitution)
ddf is:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
Assuming there is always at least one valid value within the group when a 100 is first encountered.
m = dict()
for i in range(len(df)):
if df.loc[i, "match_v"] == 100:
df.loc[i, "match_v"] = m[df.loc[i, "id"]]
else:
m[df.loc[i, "id"]] = df.loc[i, "match_v"]

Group rows where columns have values within range in pandas df

I have a pandas df:
number sample chrom1 start chrom2 end
1 s1 1 0 2 1500
2 s1 2 10 2 50
19 s2 3 3098318 3 3125700
19 s3 3 3098720 3 3125870
20 s4 3 3125694 3 3126976
20 s1 3 3125694 3 3126976
20 s1 3 3125695 3 3126976
20 s5 3 3125700 3 3126976
21 s3 3 3125870 3 3134920
22 s2 3 3126976 3 3135039
24 s5 3 17286051 3 17311472
25 s2 3 17286052 3 17294628
26 s4 3 17286052 3 17311472
26 s1 3 17286052 3 17311472
27 s3 3 17286405 3 17294550
28 s4 3 17293197 3 17294628
28 s1 3 17293197 3 17294628
28 s5 3 17293199 3 17294628
29 s2 3 17294628 3 17311472
I am trying to group lines that have different numbers, but where the start is within +/- 10 AND the end is also within +/- 10 on the same chromosomes.
In this example I want to find these two lines:
24 s5 3 17286051 3 17311472
26 s4 3 17286052 3 17311472
Where both have the same chrom1 [3] and chrom2 [3] , and the start and end values are +/- 10 from each other, and group them under the same number:
24 s5 3 17286051 3 17311472
24 s4 3 17286052 3 17311472 # Change the number to the first seen in this series
Here's what I'm trying:
import pandas as pd
from collections import defaultdict
def parse_vars(inFile):
df = pd.read_csv(inFile, delimiter="\t")
df = df[['number', 'chrom1', 'start', 'chrom2', 'end']]
vars = {}
seen_l = defaultdict(lambda: defaultdict(dict)) # To track the `starts`
seen_r = defaultdict(lambda: defaultdict(dict)) # To track the `ends`
for index in df.index:
event = df.loc[index, 'number']
c1 = df.loc[index, 'chrom1']
b1 = int(df.loc[index, 'start'])
c2 = df.loc[index, 'chrom2']
b2 = int(df.loc[index, 'end'])
print [event, c1, b1, c2, b2]
vars[event] = [c1, b1, c2, b2]
# Iterate over windows +/- 10
for i, j in zip( range(b1-10, b1+10), range(b2-10, b2+10) ):
# if :
# i in seen_l[c1] AND
# j in seen_r[c2] AND
# the 'number' for these two instances is the same:
if i in seen_l[c1] and j in seen_r[c2] and seen_l[c1][i] == seen_r[c2][j]:
print seen_l[c1][i], seen_r[c2][j]
if seen_l[c1][i] != event: print"Seen: %s %s in event %s %s" % (event, [c1, b1, c2, b2], seen_l[c1][i], vars[seen_l[c1][i]])
seen_l[c1][b1] = event
seen_r[c2][b2] = event
The problem I'm having, is that seen_l[3][17286052] exists in both numbers 25 and 26, and as their respective seen_r events (seen_r[3][17294628] = 25, seen_r[3][17311472] = 26) are not equal, I am unable to join these lines together.
Is there a way that I can use a list of start values as the nested key for seen_l dict?
Interval overlaps are easy in pyranges. Most of the code below is to separate out the starts and ends into two different dfs. Then these are joined based on an interval overlap of +-10:
from io import StringIO
import pandas as pd
import pyranges as pr
c = """number sample chrom1 start chrom2 end
1 s1 1 0 2 1500
2 s1 2 10 2 50
19 s2 3 3098318 3 3125700
19 s3 3 3098720 3 3125870
20 s4 3 3125694 3 3126976
20 s1 3 3125694 3 3126976
20 s1 3 3125695 3 3126976
20 s5 3 3125700 3 3126976
21 s3 3 3125870 3 3134920
22 s2 3 3126976 3 3135039
24 s5 3 17286051 3 17311472
25 s2 3 17286052 3 17294628
26 s4 3 17286052 3 17311472
26 s1 3 17286052 3 17311472
27 s3 3 17286405 3 17294550
28 s4 3 17293197 3 17294628
28 s1 3 17293197 3 17294628
28 s5 3 17293199 3 17294628
29 s2 3 17294628 3 17311472"""
df = pd.read_table(StringIO(c), sep="\s+")
df1 = df[["chrom1", "start", "number", "sample"]]
df1.insert(2, "end", df.start + 1)
df2 = df[["chrom2", "end", "number", "sample"]]
df2.insert(2, "start", df.end - 1)
names = ["Chromosome", "Start", "End", "number", "sample"]
df1.columns = names
df2.columns = names
gr1, gr2 = pr.PyRanges(df1), pr.PyRanges(df2)
j = gr1.join(gr2, slack=10)
# +--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------+
# | Chromosome | Start | End | number | sample | Start_b | End_b | number_b | sample_b |
# | (category) | (int32) | (int32) | (int64) | (object) | (int32) | (int32) | (int64) | (object) |
# |--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------|
# | 3 | 3125694 | 3125695 | 20 | s4 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125694 | 3125695 | 20 | s1 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125695 | 3125696 | 20 | s1 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125700 | 3125701 | 20 | s5 | 3125700 | 3125699 | 19 | s2 |
# | ... | ... | ... | ... | ... | ... | ... | ... | ... |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 25 | s2 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s5 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s1 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s4 |
# +--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------+
# Unstranded PyRanges object has 13 rows and 9 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
# to get the data as a pandas df:
jdf = j.df

How to add rows with the average values of selected rows into a data frame

I have a pandas dataframe that looks like this:
The data is made of 3 copies as in the first column. Each of these copies contain the same elements, i.e they have 2 sequences each, which are in turn made up of 3 different types: A, R2 and R3.
Copy sequence type ntv
1 1 A 0.45
1 1 R2 0.878
1 1 R3 1.234
1 2 A -7.890
1 2 R2 2.345
1 2 R3 -0.871
2 1 A -0.098
2 1 R2 -0.007
2 1 R3 9.089
2 2 A 1.567
2 2 R2 -0.764
2 2 R3 17.908
3 1 A 4.980
3 1 R2 2.34
3 1 R3 1.280
3 2 A -9.189
3 2 R2 -7.09
3 2 R3 -0.009
I would like to create a data frame that looks like the one below, such that for each sequence in the same copy, the average of R2 and R3 is given on a new line as type 'R'. What I mean is that in copy 1 for example, how can I find the mean value of R2 and R3 for each of the sequences.
Copy sequence type ntv
1 1 A 0.45
1 1 R2 0.878
1 1 R3 1.234
1 1 R 1.056
1 2 A -7.890
1 2 R2 2.345
1 2 R3 -0.871
1 2 R 0.737
2 1 A -0.098
2 1 R2 -0.007
2 1 R3 9.089
2 1 R 4.541
2 2 A 1.567
2 2 R2 -0.764
2 2 R3 17.908
2 2 R 8.572
3 1 A 4.980
3 1 R2 2.34
3 1 R3 1.280
3 1 R 1.81
3 2 A -9.189
3 2 R2 -7.09
3 2 R3 -0.009
3 2 R -3.549
Here is the code that I have so far:
avg_type = [(('R2','R3'),'R')]
for i in set(df['Copy']):
cp = df[df['Copy'] == i]
for i in set(df['sequence']):
seq = df[df['sequence'] == i]
for oldname, newname in avg_type:
avg = seq.loc[seq['type'].isin(oldname)]
if len(avg) > 1:
newrow = avg.loc[avg.index[0]]
newrow['ntv'] = avg['ntv'].mean()
newrow['type'] = newname
df.loc[-1] = newrow
df.index += 1
I have only managed to somehow figure out how to find the average of R2 and R3 per sequence (in other words I get 2 values instead of 6), but even the new rows are not properly placed as I want.
How can I extend my selection criteria to consider the 'Copy' number as well? I would appreciate any help or directions on how to go about it using pandas or python in general. Thanks in advance!
Try this:
In [68]: df.append(
...: df[df['type'].isin(['R2','R3'])]
...: .groupby(['Copy','sequence'], as_index=False)
...: ['ntv'].mean()
...: .assign(type='R')) \
...: .sort_values(['Copy','sequence'])[df.columns]
...:
Out[68]:
Copy sequence type ntv
0 1 1 A 0.4500
1 1 1 R2 0.8780
2 1 1 R3 1.2340
0 1 1 R 1.0560
3 1 2 A -7.8900
4 1 2 R2 2.3450
5 1 2 R3 -0.8710
1 1 2 R 0.7370
6 2 1 A -0.0980
7 2 1 R2 -0.0070
.. ... ... ... ...
11 2 2 R3 17.9080
3 2 2 R 8.5720
12 3 1 A 4.9800
13 3 1 R2 2.3400
14 3 1 R3 1.2800
4 3 1 R 1.8100
15 3 2 A -9.1890
16 3 2 R2 -7.0900
17 3 2 R3 -0.0090
5 3 2 R -3.5495
[24 rows x 4 columns]
This will also generate type "R". You could append and sort, and assign type "R" like MaxU's function.
df.loc[df.loc[:,"type"] != "A" ].groupby( ("Copy","sequence") , as_index = False).mean()

How to select rows which matches certain row

I have a dataframe below
A B
a0 1
b0 1
c0 2
a1 3
b1 4
b2 3
First,If df.A startswith "a",I would like to cut df.
df[df.A.str.startswith("a")]
A B
a0 1
a1 3
Therefore I would like to cut df like below.
sub1
A B
a0 1
b0 1
c0 2
sub2
A B
a1 3
b1 4
b2 3
then I would like to extract rows whose column B number matches the rows whose column A startswith"a"
sub1
A B
a0 1
b0 1
sub2
A B
a1 3
b2 3
then append.
result
A B
a0 1
b0 1
a1 3
b2 3
How can I cut and append df like this.
I tried cut method but didn't work well.
I think you can use where with mask for creating NaN which are forward filled by B values with ffill:
Notice is necessary values starts with a has to be first in each group for using ffill
print (df.B.where(df.A.str.startswith("a")))
0 1.0
1 NaN
2 NaN
3 3.0
4 NaN
5 NaN
Name: B, dtype: float64
print (df.B.where(df.A.str.startswith("a")).ffill())
0 1.0
1 1.0
2 1.0
3 3.0
4 3.0
5 3.0
Name: B, dtype: float64
df = df[df.B == df.B.where(df.A.str.startswith("a")).ffill()]
print (df)
A B
0 a0 1
1 b0 1
3 a1 3
5 b2 3

Categories