I have two pandas DataFrames, both containing the same categories but different 'id' columns. In order to illustrate, the first table looks like this:
df = pd.DataFrame({
'id': list(np.arange(1, 12)),
'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c'],
'weight': list(np.random.randint(1, 5, 11))
})
df['weight_sum'] = df.groupby('category')['weight'].transform('sum')
df['p'] = df['weight'] / df['weight_sum']
Output:
id category weight weight_sum p
0 1 a 4 14 0.285714
1 2 a 4 14 0.285714
2 3 a 2 14 0.142857
3 4 a 4 14 0.285714
4 5 b 4 8 0.500000
5 6 b 4 8 0.500000
6 7 c 3 15 0.200000
7 8 c 4 15 0.266667
8 9 c 2 15 0.133333
9 10 c 4 15 0.266667
10 11 c 2 15 0.133333
The second contains only 'id' and 'category'.
What I'm trying to do is to create a third DataFrame, that would have inherit the id of the second DataFrame, plus three new columns for the ids of the first DataFrame - each should be selected based on the p column, which represents its weight within that category.
I've tried multiple solutions and was thinking of applying np.random.choice and .apply(), but couldn't figure out a way to make that work.
EDIT:
The desired output would be something like:
user_id id_1 id_2 id_3
0 2 3 1 2
1 3 2 2 3
2 4 1 3 1
With each id being selected based on the its probability and respective category (both DataFrames have this column), and the same not showing up more than once for the same user_id.
Desired DataFrame
IIUC, you want to select random IDs of the same category with weighted probabilities. For this you can construct a helper dataframe (dfg) and use apply:
df2 = pd.DataFrame({
'id': np.random.randint(1, 12, size=11),
'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c']})
dfg = df.groupby('category').agg(list)
df3 = df2.join(df2['category']
.apply(lambda r: pd.Series(np.random.choice(dfg.loc[r, 'id'],
p=dfg.loc[r, 'p'],
size=3)))
.add_prefix('id_')
)
Output:
id category id_0 id_1 id_2
0 11 a 2 3 3
1 10 a 2 3 1
2 4 a 1 2 3
3 7 a 2 1 4
4 5 b 6 5 5
5 10 b 6 5 6
6 8 c 9 8 8
7 11 c 7 8 7
8 11 c 10 8 8
9 4 c 9 10 10
10 1 c 11 11 9
Related
I would like to transform a Pandas DataFrame of the following wide format
df = pd.DataFrame([['A', '1', '2', '3'], ['B', '4', '5', '6'], ['C', '7', '8', '9']], columns=['ABC', 'def', 'ghi', 'jkl'])
df =
ABC def ghi jkl
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9
into a long format, where the values from the first column still correspond to the values in the lower-case columns. The column names cannot be used as stub names. The names of the new columns are irrelevant and could be renamed later.
The output should look something like this:
df =
0 1
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
I am not sure how to best and efficiently do this. Can this be done with wide_to_long()? Then I would not know how to deal with stub names. The best would be an efficient one-liner that can be used on a large table.
Many thanks!!
Use DataFrame.melt with DataFrame.sort_index and remove variable column:
df1 = (df.melt("ABC", value_name='new', ignore_index=False)
.sort_index(ignore_index=True)
.drop('variable', axis=1)
)
print (df1)
ABC new
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
If need more dynamic solution with generate first value of columns names:
first = df.columns[0]
df1 = (df.melt(first, value_name='new', ignore_index=False)
.sort_index(ignore_index=True)
.drop('variable', axis=1))
You can use df.stack:
>>> df.set_index('ABC') \
.stack() \
.reset_index(level='ABC') \
.reset_index(drop=True)
ABC 0
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
or use df.melt as suggested by #MustafaAydın:
>>> df.melt('ABC') \
.sort_values('ABC') \
.drop(columns='variable') \
.reset_index(drop=True)
ABC value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
I have the following sample DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({'Tag': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C'],
'ID': [11, 12, 16, 19, 14, 9, 4, 13, 6, 18, 21, 1, 2],
'Value': [1, 13, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
to which I add the percentage of the Value using
df['Percent_value'] = df['Value'].rank(method='dense', pct=True)
and add the Order using pd.cut() with pre-defined percentage bins
percentage = np.array([10, 20, 50, 70, 100])/100
df['Order'] = pd.cut(df['Percent_value'], bins=np.insert(percentage, 0, 0), labels = [1,2,3,4,5])
which gives
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 2
5 B 9 3 0.230769 3
6 B 4 4 0.307692 3
7 C 13 5 0.384615 3
8 C 6 6 0.461538 3
9 C 18 7 0.538462 4
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 5
My Question
Now, instead of having a single percentage array (bins) for all Tags (groups), I have a separate percentage array for each Tag group. i.e., A, B and C. How can I apply df.groupby('Tag') and then apply pd.cut() using different percentage bins for each group from the following dictionary? Is there some direct-way avoiding for loops as I do below?
percentages = {'A': np.array([10, 20, 50, 70, 100])/100,
'B': np.array([20, 40, 60, 90, 100])/100,
'C': np.array([30, 50, 60, 80, 100])/100}
Desired outcome (Note: Order is now computed for each Tag using different bins):
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 1
5 B 9 3 0.230769 2
6 B 4 4 0.307692 2
7 C 13 5 0.384615 2
8 C 6 6 0.461538 2
9 C 18 7 0.538462 3
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 4
My Attempt
orders = []
for k, g in df.groupby(['Tag']):
percentage = percentages[k]
g['Order'] = pd.cut(g['Percent_value'], bins=np.insert(percentage, 0, 0), labels = [1,2,3,4,5])
orders.append(g)
df_final = pd.concat(orders, axis=0, join='outer')
You can apply pd.cut within groupby,
df['Order'] = df.groupby('Tag').apply(lambda x: pd.cut(x['Percent_value'], bins=np.insert(percentages[x.name],0,0), labels=[1,2,3,4,5])).reset_index(drop = True)
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 1
5 B 9 3 0.230769 2
6 B 4 4 0.307692 2
7 C 13 5 0.384615 2
8 C 6 6 0.461538 2
9 C 18 7 0.538462 3
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 4
I am a geologist needing to clean up data.
I have a .csv file containing drilling intervals, that I imported as a pandas dataframe that looks like this:
hole_name from to interval_type
0 A 0 1 Gold
1 A 1 2 Gold
2 A 2 4 Inferred_fault
3 A 4 6 NaN
4 A 6 7 NaN
5 A 7 8 NaN
6 A 8 9 Inferred_fault
7 A 9 10 NaN
8 A 10 11 Inferred_fault
9 B2 11 12 Inferred_fault
10 B2 12 13 Inferred_fault
11 B2 13 14 NaN
For each individual "hole_name", I would like to group/merge the "from" and "to" range for consecutive intervals associated with the same "interval_type". The NaN values can be dropped, they are of no use to me (but I already know how to do this, so it is fine).
Based on the example above, I would like to get something like this:
hole_name from to interval_type
0 A 0 2 Gold
2 A 2 4 Inferred_fault
3 A 4 8 NaN
6 A 8 9 Inferred_fault
7 A 9 10 NaN
8 A 10 11 Inferred_fault
9 B2 11 13 Inferred_fault
11 B2 13 14 NaN
I have looked around and tried to use groupby or pyranges but cannot figure how to do this...
Thanks a lot in advance for your help!
This should do the trick:
import pandas as pd
import numpy as np
from itertools import groupby
# create dataframe
data = {
'hole_name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B'],
'from': [0, 1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13],
'to': [1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14],
'interval_type': ['Gold', 'Gold', 'Inferred_fault', np.nan, np.nan, np.nan,
'Inferred_fault', np.nan, 'Inferred_fault', 'Inferred_fault',
'Inferred_fault', np.nan]
}
df = pd.DataFrame(data=data)
# create auxiliar column that groups repetitive consecutive values
grouped = [list(g) for k, g in groupby(list(zip(df.hole_name.tolist(), df.interval_type.tolist())))]
df['interval_type_id'] = np.repeat(range(len(grouped)),[len(x) for x in grouped])+1
# aggregate results
cols = df.columns[:-1]
vals = []
for idx, group in df.groupby(['interval_type_id', 'hole_name']):
vals.append([group['hole_name'].iloc[0], group['from'].min(), group['to'].max(), group['interval_type'].iloc[0]])
result = pd.DataFrame(data=vals, columns=cols)
result
result should be:
hole_name from to interval_type
A 0 2 Gold
A 2 4 Inferred_fault
A 4 8
A 8 9 Inferred_fault
A 9 10
A 10 11 Inferred_fault
B 11 13 Inferred_fault
B 13 14
EDIT: added hole_name to the groupby function.
You can first build an indicator column for grouping. Then use agg to merge the sub groups to get from and to.
(
df.assign(ind=df.interval_type.fillna(''))
.assign(ind=lambda x: x.ind.ne(x.ind.shift(1).bfill()).cumsum())
.groupby(['hole_name', 'ind'])
.agg({'from':'first', 'to':'last', 'interval_type': 'first'})
.reset_index()
.drop('ind',1)
)
hole_name from to interval_type
0 A 0 2 Gold
1 A 2 4 Inferred_fault
2 A 4 8 NaN
3 A 8 9 Inferred_fault
4 A 9 10 NaN
5 A 10 11 Inferred_fault
6 B 11 13 Inferred_fault
7 B 13 14 NaN
I have a data frame like this:
df = pd.DataFrame()
df['id'] = [1,1,1,2,2,3,3,3,3,4,4,5]
df['view'] = ['A', 'B', 'A', 'A','B', 'A', 'B', 'A', 'A','B', 'A', 'B']
df['value'] = np.random.random(12)
id view value
0 1 A 0.625781
1 1 B 0.330084
2 1 A 0.024532
3 2 A 0.154651
4 2 B 0.196960
5 3 A 0.393941
6 3 B 0.607217
7 3 A 0.422823
8 3 A 0.994323
9 4 B 0.366650
10 4 A 0.649585
11 5 B 0.513923
I now want to summarize for each id each view by mean of 'value'.
Think of this as some ids have repeated observations for view, and I want to summarize them. For example, id 1 has two observations for A.
I tried
res = df.groupby(['id', 'view'])['value'].mean()
This actually almost what I want, but pandas combines the id and view column into one, which I do not want.
id view
1 A 0.325157
B 0.330084
2 A 0.154651
B 0.196960
3 A 0.603696
B 0.607217
4 A 0.649585
B 0.366650
5 B 0.513923
also res.shape is of dimension (9,)
my desired output would be this:
id view value
1 A 0.325157
1 B 0.330084
2 A 0.154651
2 B 0.196960
3 A 0.603696
3 B 0.607217
4 A 0.649585
4 B 0.366650
5 B 0.513923
where the column names and dimensions are kept and where the id is repeated. Each id should have only 1 row for A and B.
How can I achieve this?
You need reset_index or parameter as_index=False in groupby, because you get MuliIndex and by default the higher levels of the indexes are sparsified to make the console output a bit easier on the eyes:
np.random.seed(100)
df = pd.DataFrame()
df['id'] = [1,1,1,2,2,3,3,3,3,4,4,5]
df['view'] = ['A', 'B', 'A', 'A','B', 'A', 'B', 'A', 'A','B', 'A', 'B']
df['value'] = np.random.random(12)
print (df)
id view value
0 1 A 0.543405
1 1 B 0.278369
2 1 A 0.424518
3 2 A 0.844776
4 2 B 0.004719
5 3 A 0.121569
6 3 B 0.670749
7 3 A 0.825853
8 3 A 0.136707
9 4 B 0.575093
10 4 A 0.891322
11 5 B 0.209202
res = df.groupby(['id', 'view'])['value'].mean().reset_index()
print (res)
id view value
0 1 A 0.483961
1 1 B 0.278369
2 2 A 0.844776
3 2 B 0.004719
4 3 A 0.361376
5 3 B 0.670749
6 4 A 0.891322
7 4 B 0.575093
8 5 B 0.209202
res = df.groupby(['id', 'view'], as_index=False)['value'].mean()
print (res)
id view value
0 1 A 0.483961
1 1 B 0.278369
2 2 A 0.844776
3 2 B 0.004719
4 3 A 0.361376
5 3 B 0.670749
6 4 A 0.891322
7 4 B 0.575093
8 5 B 0.209202
I have the following DataFrame:
Date best a b c d
1990 a 5 4 7 2
1991 c 10 1 2 0
1992 d 2 1 4 12
1993 a 5 8 11 6
I would like to make a dataframe as follows:
Date best value
1990 a 5
1991 c 2
1992 d 12
1993 a 5
So I am looking to find a value based on another row value by using column names. For instance, the value for 1990 in the second df should lookup "a" from the first df and the second row should lookup "c" (=2) from the first df.
Any ideas?
There is a built in lookup function that can handle this type of situation (looks up by row/column). I don't know how optimized it is, but may be faster than the apply solution.
In [9]: df['value'] = df.lookup(df.index, df['best'])
In [10]: df
Out[10]:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5
You create a lookup function and call apply on your dataframe row-wise, this isn't very efficient for large dfs though
In [245]:
def lookup(x):
return x[x.best]
df['value'] = df.apply(lambda row: lookup(row), axis=1)
df
Out[245]:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5
You can do this using np.where like below. I think it will be more efficient
import numpy as np
import pandas as pd
df = pd.DataFrame([['1990', 'a', 5, 4, 7, 2], ['1991', 'c', 10, 1, 2, 0], ['1992', 'd', 2, 1, 4, 12], ['1993', 'a', 5, 8, 11, 6]], columns=('Date', 'best', 'a', 'b', 'c', 'd'))
arr = df.best.values
cols = df.columns[2:]
for col in cols:
arr2 = df[col].values
arr = np.where(arr==col, arr2, arr)
df.drop(columns=cols, inplace=True)
df["values"] = arr
df
Result
Date best values
0 1990 a 5
1 1991 c 2
2 1992 d 12
3 1993 a 5
lookup is deprecated since version 1.2.0. With melt you can 'unpivot' columns to the row axis, where the column names are stored per default in column variable and their values in value. query returns only such rows where the columns best and variable are equal. drop and sort_values are used to match your requested format.
df_new = (
df.melt(id_vars=['Date', 'best'], value_vars=['a', 'b', 'c', 'd'])
.query('best == variable')
.drop('variable', axis=1)
.sort_values('Date')
)
Output:
Date best value
0 1990 a 5
9 1991 c 2
14 1992 d 12
3 1993 a 5
A simple solution that uses a mapper dictionary:
vals = df[['a','b','c','d']].to_dict('list')
mapper = {k: vals[v][k] for k,v in zip(df.index, df['best'])}
df['value'] = df.index.map(mapper).to_numpy()
Output:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5
Use looking up values by index column labels because DataFrame.lookup is deprecated since version 1.2.0:
idx, cols = pd.factorize(df['best'])
df['value'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5