row values from range based on another row in Python - python

My two sample df are as below.
df1
Column1
1
2
3
4
5
6
7
8
9
10
11
12
13
df2
Column 1 Column2
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
9 I
10 J
What I want is merge two dfs on df1, which is quite simple. But if the value is not found in df2 i want to look into a range.
e.g. if NaN then it should further look whether it is between 11 to 13 then it should "C" if its between 14 to 18 it should return "D" and if between 19-25 result should be "E".

You need to use merge and replace the NaNs with fillna().
df1 = pd.DataFrame({'Column1': range(1,26)})
df2 = pd.DataFrame({'Column1': range(1,11),
'Column2': ['A','B','C','D','E','F','G','H','I','J']})
df1 = df1.merge(df2, on=['Column1'], how='left')
fill_dict = {11: 'C', 12: 'C', 13: 'C',
14: 'D', 15: 'D', 16: 'D', 17: 'D', 18: 'D',
19: 'E', 20: 'E', 21: 'E', 22: 'E', 23: 'E', 24: 'E', 25: 'E'}
df1['Column2'] = df1.replace({'Column1':fill_dict})
print(df1)
Output:
Column1 Column2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 9 9
9 10 10
10 11 C
11 12 C
12 13 C
13 14 D
14 15 D
15 16 D
16 17 D
17 18 D
18 19 E
19 20 E
20 21 E
21 22 E
22 23 E
23 24 E
24 25 E
EDIT 1:
If you have a range to create the fill_dict dictionary you can use dict.fromkeys()
fill_dict = dict.fromkeys(range(11,14),'C')
fill_dict.update(dict.fromkeys(range(14,19),'D'))
fill_dict.update(dict.fromkeys(range(19,26),'E'))
Or you can also use list comprehension to create the fill_dict dict
fill_dict = dict([(i, 'C') for i in range(11, 14)] +
[(i, 'D') for i in range(14, 19)] +
[(i, 'E') for i in range(19, 26)])
EDIT 2:
Based on our chat conversation, can you please try this:
Instead of creating the dict with range of int's, as your data has float values, I thought of using np.arange() but identifying the correct key with the decimal precision was a bit problematic. So, I thought of writing a function to generate the keys. I am sure this is not efficient in terms of performance. But it gets the job done. There should be some other effective solution for this.
import pandas as pd
import decimal
def gen_float_range(start, stop, step):
while start < stop:
yield float(start)
start += decimal.Decimal(step)
base1 = pd.DataFrame({'HS CODE': [5004.0000,5005.0000,5006.0000,5007.1000,5007.2000,6115.950,6115.950,6115.960,6115.960,6115.950]})
base2 = pd.DataFrame({'HS CODE': [5004.0000,5005.0000,5006.0000,5007.1000,5007.2000],
'%age': 0.4})
base1 = base1.merge(base2, on=['HS CODE'], how='left')
fill_dict = dict.fromkeys(list(gen_float_range(6110,6121,0.0001)),'0.06')
# base1['%age'] = base1.replace({'HS CODE':fill_dict})
base1['%age'] = base1['%age'].fillna(base1['HS CODE'].map(fill_dict))
print(base1)
Output:
HS CODE %age
0 5004.00 0.4
1 5005.00 0.4
2 5006.00 0.4
3 5007.10 0.4
4 5007.20 0.4
5 6115.95 0.06
6 6115.95 0.06
7 6115.96 0.06
8 6115.96 0.06
9 6115.95 0.06
You have to create the fill_dict with the different ranges and append to your fill_dict using the start and stop values and the step should be how you would increment. Based on the data that you shared, I assumed the step will be 0.0001, but this is going to be too much for the dict. You can look at ways for reducing the step to 0.1 or 0.01 based on your requirement.

Merge with left join, and then refill accordingly.
UPDATE:
df1 = df1.merge(df2, on=['Column1'], how='left)
fill_dict = {11: 'A', 12: 'A', ...}
df1['Column1'] = df1['Column1'].fillna(df1['Column2'].apply(fill_dict))

Related

How to efficiently reorder rows based on condition?

My dataframe:
df = pd.DataFrame({'col_1': [10, 20, 10, 20, 10, 10, 20, 20],
'col_2': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
5 10 f
6 20 g
7 20 h
I don't want consecutive rows with col_1 = 10, instead a row below a repeating 10 should jump up by one (in this case, index 6 should become index 5 and vice versa), so the order is always 10, 20, 10, 20...
My current solution:
for idx, row in df.iterrows():
if row['col_1'] == 10 and df.iloc[idx + 1]['col_1'] != 20:
df = df.rename({idx + 1:idx + 2, idx + 2: idx + 1})
df = df.sort_index()
df
gives me:
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
5 20 g
6 10 f
7 20 h
which is what I want but it is very slow (2.34s for a dataframe with just over 8000 rows).
Is there a way to avoid loop here?
Thanks
You can use a custom key in sort_values with groupby.cumcount:
df.sort_values(by='col_1', kind='stable', key=lambda s: df.groupby(s).cumcount())
Output:
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
6 20 g
5 10 f
7 20 h

How to improve performance of dataframe slices matching?

I need to improve the performance of the following dataframe slices matching.
What I need to do is find the matching trips between 2 dataframes, according to the sequence column values with order conserved.
My 2 dataframes:
>>>df1
trips sequence
0 11 a
1 11 d
2 21 d
3 21 a
4 31 a
5 31 b
6 31 c
>>>df2
trips sequence
0 12 a
1 12 d
2 22 c
3 22 b
4 22 a
5 32 a
6 32 d
Expected output:
['11 match 12']
This is the following code I' m using:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'trips': [11, 11, 21, 21, 31, 31, 31], 'sequence': ['a', 'd', 'd', 'a', 'a', 'b', 'c']})
df2 = pd.DataFrame({'trips': [12, 12, 22, 22, 22, 32, 32], 'sequence': ['a', 'd', 'c', 'b', 'a', 'a', 'd']})
route_match = []
for trip1 in df1['trips'].drop_duplicates():
for trip2 in df2['trips'].drop_duplicates():
route1 = df1[df1['trips'] == trip1]['sequence']
route2 = df2[df2['trips'] == trip2]['sequence']
if np.array_equal(route1.values,route2.values):
route_match.append(str(trip1) + ' match ' + str(trip2))
break
else:
continue
Despite working, this is very time costly and unefficient as my real dataframes are longer.
Any suggestions?
You can aggregate each trip as tuple with groupby.agg, then merge the two outputs to identify the identical routes:
out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
df2.groupby('trips', as_index=False)['sequence'].agg(tuple),
on='sequence'
)
output:
trips_x sequence trips_y
0 11 (a, d) 12
1 11 (a, d) 32
If you only want the first match, drop_duplicates the output of df2 aggregation to prevent unnecessary merging:
out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
df2.groupby('trips', as_index=False)['sequence'].agg(tuple)
.drop_duplicates(subset='sequence'),
on='sequence'
)
output:
trips_x sequence trips_y
0 11 (a, d) 12

Groupby, apply function to each row with shift, and create new column

I want to group by id, apply a function to the data, and create a new column with the results. It seems there must be a faster/more efficient way to do this than to pass the data to the function, make the changes, and return the data. Here is an example.
Example
dat = pd.DataFrame({'id': ['a', 'a', 'a', 'b', 'b', 'b'], 'x': [4, 8, 12, 25, 30, 50]})
def my_func(data):
data['diff'] = (data['x'] - data['x'].shift(1, fill_value=data['x'].iat[0]))
return data
dat.groupby('id').apply(my_func)
Output
> print(dat)
id x diff
0 a 4 0
1 a 8 4
2 a 12 4
3 b 25 0
4 b 30 5
5 b 50 20
Is there a more efficient way to do this?
You can use .groupby.diff() for this and after that fill the NaN with zero like following:
dat['diff'] = dat.groupby('id').x.diff().fillna(0)
print(dat)
id x diff
0 a 4 0.0
1 a 8 4.0
2 a 12 4.0
3 b 25 0.0
4 b 30 5.0
5 b 50 20.0

get first and last values in a groupby

I have a dataframe df
df = pd.DataFrame(np.arange(20).reshape(10, -1),
[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'],
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']],
['X', 'Y'])
How do I get the first and last rows, grouped by the first level of the index?
I tried
df.groupby(level=0).agg(['first', 'last']).stack()
and got
X Y
a first 0 1
last 6 7
b first 8 9
last 12 13
c first 14 15
last 16 17
d first 18 19
last 18 19
This is so close to what I want. How can I preserve the level 1 index and get this instead:
X Y
a a 0 1
d 6 7
b e 8 9
g 12 13
c h 14 15
i 16 17
d j 18 19
j 18 19
Option 1
def first_last(df):
return df.ix[[0, -1]]
df.groupby(level=0, group_keys=False).apply(first_last)
Option 2 - only works if index is unique
idx = df.index.to_series().groupby(level=0).agg(['first', 'last']).stack()
df.loc[idx]
Option 3 - per notes below, this only makes sense when there are no NAs
I also abused the agg function. The code below works, but is far uglier.
df.reset_index(1).groupby(level=0).agg(['first', 'last']).stack() \
.set_index('level_1', append=True).reset_index(1, drop=True) \
.rename_axis([None, None])
Note
per #unutbu: agg(['first', 'last']) take the firs non-na values.
I interpreted this as, it must then be necessary to run this column by column. Further, forcing index level=1 to align may not even make sense.
Let's include another test
df = pd.DataFrame(np.arange(20).reshape(10, -1),
[list('aaaabbbccd'),
list('abcdefghij')],
list('XY'))
df.loc[tuple('aa'), 'X'] = np.nan
def first_last(df):
return df.ix[[0, -1]]
df.groupby(level=0, group_keys=False).apply(first_last)
df.reset_index(1).groupby(level=0).agg(['first', 'last']).stack() \
.set_index('level_1', append=True).reset_index(1, drop=True) \
.rename_axis([None, None])
Sure enough! This second solution is taking the first valid value in column X. It is now nonsensical to have forced that value to align with the index a.
This could be on of the easy solution.
df.groupby(level = 0, as_index= False).nth([0,-1])
X Y
a a 0 1
d 6 7
b e 8 9
g 12 13
c h 14 15
i 16 17
d j 18 19
Hope this helps. (Y)
Please try this:
For last value: df.groupby('Column_name').nth(-1),
For first value: df.groupby('Column_name').nth(0)

Replicating rows in a pandas data frame by a column value [duplicate]

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?
You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.
You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8
It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.

Categories