I have a dataframe of length-interval data (from boreholes) which looks something like this:
df
Out[46]:
from to min intensity
0 0 10 py 2
1 5 15 cpy 3.5
2 14 27 spy 0.7
I need to pivot this data, but also break it on the least common length interval; resulting in the 'min' column as column headers, and the values being the 'rank'. The output would look like this:
df.somefunc(index=['from','to'], columns='min', values='intensity', fill_value=0)
Out[47]:
from to py cpy spy
0 0 5 2 0 0
1 5 10 2 3.5 0
2 10 14 0 3.5 0
3 14 15 0 3.5 0.7
4 15 27 0 0 0.7
so basically the "From" and "To" describe non-overlapping intervals down a borehole, where the intervals have been split by the least common denominator - as you can see the "py" interval from the original table has been split, the first (0-5m) into py:2, cpy:0 and the second (5-10m) into py:2, cpy:3.5.
The result from just a basic pivot_table function is this:
pd.pivot_table(df, values='intensity', index=['from', 'to'], columns="min", aggfunc="first", fill_value=0)
Out[48]:
min cpy py spy
from to
0 10 0 2 0
5 15 3.5 0 0
14 27 0 0 0.75
which just treats the from and to columns combined as an index. An important point is that my output cannot have overlapping from and to values (IE the subsequent 'from' value cannot be less than the previous 'to' value).
Is there an elegant way to accomplish this using Pandas? Thanks for the help!
I don't know natural interval arithmetic in Pandas, so you need do do it.
Here a way to do that, If I correctly understand bound conditions.
This can be a O(n^3) problem, it will create huge table for big entries.
# make the new bounds
bounds=np.unique(np.hstack((df["from"],df["to"])))
df2=pd.DataFrame({"from":bounds[:-1],"to":bounds[1:]})
#find inclusions
isin=df.apply(lambda x :
df2['from'].between(x[0],x[1]-1)
| df2['to'].between(x[0]+1,x[1])
,axis=1).T
#data
data=np.where(isin,df.intensity,0)
#result
df3=pd.DataFrame(data,
pd.MultiIndex.from_arrays(df2.values.T),df["min"])
For :
In [26]: df3
Out[26]:
min py cpy spy
0 5 2.0 0.0 0.0
5 10 2.0 3.5 0.0
10 14 0.0 3.5 0.0
14 15 0.0 3.5 0.7
15 27 0.0 0.0 0.7
Related
I need to set number of last consequence rows less than current.
Below is a sample input and the result.
df = pd.DataFrame([10,9,8,11,10,11,13], columns=['value'])
df_result = pd.DataFrame([[10,9,8,11,10,11,13], [0,0,0,3,0,1,6]], columns=['value', 'number of last consequence rows less than current'])
Is it possible to achieve this without loop?
Otherwise solution with loop would be good.
More question
Could I do it with groupby operation, for the following input?
df = pd.DataFrame([[10,0],[9,0],[7,0],[8,0],[11,1],[10,1],[11,1],[13,1]], columns=['value','group'])
Following printed an error.
df.groupby('group')['value'].expanding()
Assuming this input:
value
0 10
1 9
2 8
3 11
4 10
5 13
You can use a cummax and expanding custom function:
df['out'] = (df['value'].cummax().expanding()
.apply(lambda s: s.lt(df.loc[s.index[-1], 'value']).sum())
)
For the particular case of < comparison, you can use a much faster trick with numpy. If a value is greater than all previous values, then it is greater than n values where n is the rank:
m = df['value'].lt(df['value'].cummax())
df['out'] = np.where(m, 0, np.arange(len(df)))
Output:
value out
0 10 0.0
1 9 0.0
2 8 0.0
3 11 3.0
4 10 0.0
5 13 5.0
update: consecutive values
df['out'] = (
df['value'].expanding()
.apply(lambda s: s.iloc[-2::-1].lt(s.iloc[-1]).cummin().sum())
)
Output:
value out
0 10 0.0
1 9 0.0
2 8 0.0
3 11 3.0
4 10 0.0
5 11 1.0
6 13 6.0
I have a data frame that contains a numeric column and I have a list of tuples and a list of strings.
The list of tuples represents the values that should be added, where each index in that list corresponds to the numeric column in the data frame. The list of strings represents the names of the to be added columns.
Example:
import pandas as pd
df = pd.DataFrame({'number':[0,0,1,1,2,2,3,3]})
# a list of keys and a list of tuples
keys = ['foo','bar']
combinations = [('99%',0.9),('99%',0.8),('1%',0.9),('1%',0.8)]
Expected output:
number foo bar
0 0 99% 0.9
1 0 99% 0.9
2 1 99% 0.8
3 1 99% 0.8
4 2 1% 0.9
5 2 1% 0.9
6 3 1% 0.8
7 3 1% 0.8
Original post
To get that output, you can just try
df2 = pd.DataFrame(combinations, columns = keys)
pd.concat([df, df2], axis=1)
which returns
number foo bar
0 0 99% 0.9
1 1 99% 0.8
2 2 1% 0.9
3 3 1% 0.8
Edit
Based on your new requirements, you can use the following
df.set_index('number', inplace=True)
df = df.merge(df2, left_index = True, right_index=True)
df = df.reset_index().rename(columns={'index':'number'})
This also works for different duplicates amounts, i.e.
df = pd.DataFrame({'number':[0,0,1,1,1,2,2,3,3,3]})
returns
number foo bar
0 0 99% 0.9
1 0 99% 0.9
2 1 99% 0.8
3 1 99% 0.8
4 1 99% 0.8
5 2 1% 0.9
6 2 1% 0.9
7 3 1% 0.8
8 3 1% 0.8
9 3 1% 0.8
You can use list comprehension, in a for loop, I think it's a pretty fast and straightforward approach:
for i in range(len(keys)):
df[keys[i]] = [x[i] for x in combinations]
Output:
number foo bar
0 0 99% 0.9
1 1 99% 0.8
2 2 1% 0.9
3 3 1% 0.8
I found one solution using:
df_new = pd.DataFrame()
for model_number,df_subset in df.groupby('number'):
for key_idx,key in enumerate(keys):
df_subset[key] = combinations[model_number][key_idx]
df_new = df_new.append(df_subset)
But this seems pretty 'dirty' for me, there might be better and more efficient solutions?
I have a nested numpy.ndarray of the following format (each of the sublists has the same size)
len(exp_data) # Timepoints
Out[205]: 42
len(exp_data[0])
Out[206]: 1
len(exp_data[0][0]) # Y_bins
Out[207]: 13
len(exp_data[0][0][0]) # X_bins
Out[208]: 43
type(exp_data[0][0][0][0])
Out[209]: numpy.float64
I want to move these into a pandas DataFrame such that there are 3 columns numbered from 0 to N and the last one with the float value.
I could do this with a series of loops, but that seems like a very non-efficient way of solving the problem.
In addition I would like to get rid of any nan values (not present in sample data). Do I do this after creating the df or is there a way to skip adding them in the first place?
NOTE: code below has been edited and I've added sample data
import random
import numpy as np
import pandas as pd
exp_data = [[[ [random.random() for x in range (5)],
[random.random() for x in range (5)],
[random.random() for x in range (5)],
]]]*5
exp_data[0][0][0][1]=np.nan
df = pd.DataFrame(columns = ['Timepoint','Y_bin','X_bin','Values'])
for t,timepoint in enumerate(exp_data):
for y,y_bin in enumerate(timepoint[0]):
for x,x_bin in enumerate(y_bin):
df.loc[len(df)] = [int(t),int(y),int(x),x_bin]
df = df.dropna().reset_index(drop=True)
The final format should be as follows (except I'd preferably like integers instead of floats in first 3 columns, but not essential; int(t) etc. doesn't do the trick)
df
Out[291]:
Timepoint Y_bin X_bin Values
0 0.0 0.0 0.0 0.095391
1 0.0 0.0 2.0 0.963608
2 0.0 0.0 3.0 0.855735
3 0.0 0.0 4.0 0.392637
4 0.0 1.0 0.0 0.555199
5 0.0 1.0 1.0 0.118981
6 0.0 1.0 2.0 0.201782
...
len(df) # has received a total of 75 (5*3*5) input values of which 5 are nan
Out[293]: 70
change the format of the float out put to this by adding this piece of code
pd.options.display.float_format = '{:,.0f}'.format
to the end of your code like this to change the format
df = pd.DataFrame(columns = columns)
for t,timepoint in enumerate(exp_data):
for y,y_bin in enumerate(timepoint[0]):
for x,x_bin in enumerate(y_bin):
df.loc[len(df)] = [t,y,x,x_bin]
df.dropna().reset_index(drop=True)
pd.options.display.float_format = '{:,.0f}'.format
df
Out[250]:
Timepoint Y_bin X_bin Values
0 0 4 10 -2
1 0 4 11 -1
2 0 4 12 -2
3 0 4 13 -2
4 0 4 14 -2
5 0 4 15 -2
6 0 4 16 -3
...
I have been struggling with finding a way to expand/clone observation rows based on a pre-determined number and a grouping variable (id). For context, here is an example data frame using pandas and numpy (python3).
df = pd.DataFrame([[1, 15], [2, 20]], columns = ['id', 'num'])
df
Out[54]:
id num
0 1 15
1 2 20
I want to expand/clone the rows by the number given in the "num" variable based on their ID group. In this case, I would want 15 rows for id = 1 and 20 rows for id = 2. This is probably an easy question, but I am struggling to make this work. I've been messing around with reindex and np.repeat, but the conceptual pieces are not fitting together for me.
In R, I used the expandRows function found in the splitstackshape package, which would look something like this:
library(splitstackshape)
df <- data.frame(id = c(1, 2), num = c(15, 20))
df
id num
1 1 15
2 2 20
df2 <- expandRows(df, "num", drop = FALSE)
df2
id num
1 1 15
1.1 1 15
1.2 1 15
1.3 1 15
1.4 1 15
1.5 1 15
1.6 1 15
1.7 1 15
1.8 1 15
1.9 1 15
1.10 1 15
1.11 1 15
1.12 1 15
1.13 1 15
1.14 1 15
2 2 20
2.1 2 20
2.2 2 20
2.3 2 20
2.4 2 20
2.5 2 20
2.6 2 20
2.7 2 20
2.8 2 20
2.9 2 20
2.10 2 20
2.11 2 20
2.12 2 20
2.13 2 20
2.14 2 20
2.15 2 20
2.16 2 20
2.17 2 20
2.18 2 20
2.19 2 20
Again, sorry if this is a stupid question and thanks in advance for any help.
I can't replicate your index, but I can replicate your values, using np.repeat, quite easily in fact.
v = df.values
df = pd.DataFrame(v.repeat(v[:, -1], axis=0), columns=df.columns)
If you want the exact index (although I can't see why you'd need to), you'd need a groupby operation -
def f(x):
return x.astype(str) + '.' + np.arange(len(x)).astype(str)
idx = df.groupby('id').id.apply(f).values
Assign idx to df's index -
df.index = idx
I know that Pandas has a get_dummy function which you can use to convert categorical variables to dummy variables in a DataFrame. What I'm trying to do is slightly different.
I have a column containing percentage values from 0.0 to 100.0. I need to convert this to a column that has 1's for any value >= 10.0 and 0's for any value < 10.0. Is there a good way to do this repurposing get_dummy here or will I have to construct a loop to do it?
You can can convert bools to ints directly:
(df.column_of_interest >= 10).astype(int)
I assume you're discussing pandas.get_dummies here, and I don't think that this is a use case for it. You are attempting to set two values on a boolean condition. One approach would be to get a boolean Series and take the integer representations for indicators, with
df['indicators'] = (df.percentages >= 10.).astype('int')
Demo
>>> df
percentages
0 70.176341
1 70.638246
2 55.078803
3 42.586290
4 73.340089
5 53.308670
6 3.059331
7 49.494812
8 10.379713
9 7.676286
10 55.023261
11 4.417545
12 51.744169
13 49.513638
14 39.189640
15 90.521703
16 29.696734
17 11.546118
18 5.737921
19 83.258049
>>> df['indicators'] = (df.percentages >= 10.).astype('int')
>>> df
percentages indicators
0 70.176341 1
1 70.638246 1
2 55.078803 1
3 42.586290 1
4 73.340089 1
5 53.308670 1
6 3.059331 0
7 49.494812 1
8 10.379713 1
9 7.676286 0
10 55.023261 1
11 4.417545 0
12 51.744169 1
13 49.513638 1
14 39.189640 1
15 90.521703 1
16 29.696734 1
17 11.546118 1
18 5.737921 0
19 83.258049 1
Let's assume you have a dataframe df, with a column Perc that contains your percentages:
import pandas as pd
pd.np.random.seed(111)
df = pd.DataFrame({"Perc": pd.np.random.uniform(1, 100, 20)})
Now, you can easily form a new column by using a lambda function that recodes your percentages, like so:
df["Category"] = df.Perc.apply(lambda x: 0 if x < 10.0 else 1)