Suppose one has a dataframe created as such:
tdata = {('A', 50): [1, 2, 3, 4],
('A', 55): [5, 6, 7, 8],
('B', 10): [10, 20, 30, 40],
('B', 20): [50, 60, 70, 80],
('B', 50): [2, 4, 6, 8],
('B', 55): [10, 12, 14, 16]}
tdf = pd.DataFrame(tdata, index=range(0,4))
A B
50 55 10 20 50 55
0 1 5 10 50 2 10
1 2 6 20 60 4 12
2 3 7 30 70 6 14
3 4 8 40 80 8 16
How would one drop specific columns, say ('B', 10) and ('B', 20) from the dataframe?
Is there a way to drop the columns in one command such as tdf.drop(['B', [10,20]])? Note, I know that my example of the command is by no means close to what it should be, but I hope that it gets the gist across.
Is there a way to drop the columns through some logical expression? For example, say I want to drop columns having the sublevel indices less than 50 (again, the 10, 20 columns). Can I do some general command that would encompass column 'A', even though the 10,20 sublevel indices don't exist or must I specifically reference column 'B'?
You can use drop by list of tuples:
print (tdf.drop([('B',10), ('B',20)], axis=1))
A B
50 55 50 55
0 1 5 2 10
1 2 6 4 12
2 3 7 6 14
3 4 8 8 16
For remove columns by level:
mask = tdf.columns.get_level_values(1) >= 50
print (mask)
[ True True False False True True]
print (tdf.loc[:, mask])
A B
50 55 50 55
0 1 5 2 10
1 2 6 4 12
2 3 7 6 14
3 4 8 8 16
If need remove by level is possible specify only one level:
print (tdf.drop([50,55], axis=1, level=1))
B
10 20
0 10 50
1 20 60
2 30 70
3 40 80
Related
I want to reshape the data so that the values in the index column become the columns
My Data frame:
Gender_Male Gender_Female Location_london Location_North Location_South
Cat
V 5 4 4 2 3
W 15 12 12 7 8
X 11 15 16 4 6
Y 22 18 21 9 9
Z 8 7 7 4 4
Desired Data frame:
Is there an easy way to do this? I also have 9 other categorical variables in my data set in addition to the Gender and Location variables. I have only included two variables to keep the example simple.
Code to create the example dataframe:
df1 = pd.DataFrame({
'Cat' : ['V','W', 'X', 'Y', 'Z'],
'Gender_Male' : [5, 15, 11, 22, 8],
'Gender_Female' : [4, 12, 15, 18, 7],
'Location_london': [4,12, 16, 21, 7],
'Location_North' : [2, 7, 4, 9, 4],
'Location_South' : [3, 8, 6, 9, 4]
}).set_index('Cat')
df1
You can transpose the dataframe and then split and set the new index:
Transpose
dft = df1.T
print(dft)
Cat V W X Y Z
Gender_Male 5 15 11 22 8
Gender_Female 4 12 15 18 7
Location_london 4 12 16 21 7
Location_North 2 7 4 9 4
Location_South 3 8 6 9 4
Split and set the new index
dft.index = dft.index.str.split('_', expand=True)
dft.columns.name = None
print(dft)
V W X Y Z
Gender Male 5 15 11 22 8
Female 4 12 15 18 7
Location london 4 12 16 21 7
North 2 7 4 9 4
South 3 8 6 9 4
I have this pandas series:
ts = pd.Series([1, 2, 3, 4, 5, 6, 7, 8])
What I would like to get is a dataframe which contains another column with the sum of rows 0, 2, 4, 6 and for 1, 3, 5 and 7 (that means, one row is left out when creating the sum).
In this case, this means a new dataframe should look like this one:
index ts sum
0 1 16
1 2 20
2 3 16
3 4 20
4 5 16
5 6 20
6 7 16
7 8 20
How could I do this?
Use modulo by k for each kth rows:
k = 2
df = ts.to_frame('ts')
df['sum'] = df.groupby(ts.index % k).transform('sum')
#if not default RangeIndex
#df['sum'] = df.groupby(np.arange(len(ts)) % k).transform('sum')
print (df)
ts sum
0 1 16
1 2 20
2 3 16
3 4 20
4 5 16
5 6 20
6 7 16
7 8 20
I am a geologist needing to clean up data.
I have a .csv file containing drilling intervals, that I imported as a pandas dataframe that looks like this:
hole_name from to interval_type
0 A 0 1 Gold
1 A 1 2 Gold
2 A 2 4 Inferred_fault
3 A 4 6 NaN
4 A 6 7 NaN
5 A 7 8 NaN
6 A 8 9 Inferred_fault
7 A 9 10 NaN
8 A 10 11 Inferred_fault
9 B2 11 12 Inferred_fault
10 B2 12 13 Inferred_fault
11 B2 13 14 NaN
For each individual "hole_name", I would like to group/merge the "from" and "to" range for consecutive intervals associated with the same "interval_type". The NaN values can be dropped, they are of no use to me (but I already know how to do this, so it is fine).
Based on the example above, I would like to get something like this:
hole_name from to interval_type
0 A 0 2 Gold
2 A 2 4 Inferred_fault
3 A 4 8 NaN
6 A 8 9 Inferred_fault
7 A 9 10 NaN
8 A 10 11 Inferred_fault
9 B2 11 13 Inferred_fault
11 B2 13 14 NaN
I have looked around and tried to use groupby or pyranges but cannot figure how to do this...
Thanks a lot in advance for your help!
This should do the trick:
import pandas as pd
import numpy as np
from itertools import groupby
# create dataframe
data = {
'hole_name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B'],
'from': [0, 1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13],
'to': [1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14],
'interval_type': ['Gold', 'Gold', 'Inferred_fault', np.nan, np.nan, np.nan,
'Inferred_fault', np.nan, 'Inferred_fault', 'Inferred_fault',
'Inferred_fault', np.nan]
}
df = pd.DataFrame(data=data)
# create auxiliar column that groups repetitive consecutive values
grouped = [list(g) for k, g in groupby(list(zip(df.hole_name.tolist(), df.interval_type.tolist())))]
df['interval_type_id'] = np.repeat(range(len(grouped)),[len(x) for x in grouped])+1
# aggregate results
cols = df.columns[:-1]
vals = []
for idx, group in df.groupby(['interval_type_id', 'hole_name']):
vals.append([group['hole_name'].iloc[0], group['from'].min(), group['to'].max(), group['interval_type'].iloc[0]])
result = pd.DataFrame(data=vals, columns=cols)
result
result should be:
hole_name from to interval_type
A 0 2 Gold
A 2 4 Inferred_fault
A 4 8
A 8 9 Inferred_fault
A 9 10
A 10 11 Inferred_fault
B 11 13 Inferred_fault
B 13 14
EDIT: added hole_name to the groupby function.
You can first build an indicator column for grouping. Then use agg to merge the sub groups to get from and to.
(
df.assign(ind=df.interval_type.fillna(''))
.assign(ind=lambda x: x.ind.ne(x.ind.shift(1).bfill()).cumsum())
.groupby(['hole_name', 'ind'])
.agg({'from':'first', 'to':'last', 'interval_type': 'first'})
.reset_index()
.drop('ind',1)
)
hole_name from to interval_type
0 A 0 2 Gold
1 A 2 4 Inferred_fault
2 A 4 8 NaN
3 A 8 9 Inferred_fault
4 A 9 10 NaN
5 A 10 11 Inferred_fault
6 B 11 13 Inferred_fault
7 B 13 14 NaN
Is there an opposite function of pandas.DataFrame.droplevel where I can keep some levels of the multi-level index/columns using either the level name or index?
Example:
df = pd.DataFrame([
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]
], columns=['a','b','c','d']).set_index(['a','b','c']).T
a 1 5 9 13
b 2 6 10 14
c 3 7 11 15
d 4 8 12 16
Both the following commands can return the following dataframe:
df.droplevel(['a','b'], axis=1)
df.droplevel([0, 1], axis=1)
c 3 7 11 15
d 4 8 12 16
I am looking for a "keeplevel" command such that both the following commands can return the following dataframe:
df.keeplevel(['a','b'], axis=1)
df.keeplevel([0, 1], axis=1)
a 1 5 9 13
b 2 6 10 14
d 4 8 12 16
There is no keeplevel because it would be redundant: in a closed and well-defined set, when you define what you want to drop, you automatically define what you want to keep
You may get the difference from what you have and what droplevel returns.
def keeplevel(df, levels, axis=1):
return df.droplevel(df.axes[axis].droplevel(levels).names, axis=axis)
>>> keeplevel(df, [0, 1])
a 1 5 9 13
b 2 6 10 14
d 4 8 12 16
Using set to find the different
df.droplevel(list(set(df.columns.names)-set(['a','b'])),axis=1)
Out[134]:
a 1 5 9 13
b 2 6 10 14
d 4 8 12 16
You can modify the Index objects, which should be fast. Note, this will even modify inplace.
def keep_level(df, keep, axis):
idx = pd.MultiIndex.from_arrays([df.axes[axis].get_level_values(x) for x in keep])
df.set_axis(idx, axis=axis, inplace=True)
return df
keep_level(df.copy(), ['a', 'b'], 1) # Copy to not modify original for illustration
#a 1 5 9 13
#b 2 6 10 14
#d 4 8 12 16
keep_level(df.copy(), [0, 1], 1)
#a 1 5 9 13
#b 2 6 10 14
#d 4 8 12 16
I have a data frame like the following:
values = random.sample(range(1, 101), 15)
df = pd.DataFrame({'x': [3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4], 'n': [100, 100, 100, 'reference', 'reference', 'reference', 500, 500, 500, 100, 100, 100, 'reference', 'reference', 'reference'], 'value': values})
The values labeled as 'reference' in the n column are reference values, which I will eventually plot against. To help with this, I need to make a data frame that has the reference values in a different column, so columns = ['x', 'n', 'value', 'value_reference']
Value reference is the reference value for all values of n as long as x is the same. Therefore, I want to make a data frame like the following:
desired_df = pd.DataFrame({'x': [3, 3, 3, 3, 3, 3, 4, 4, 4], 'n': [100, 100, 100, 500, 500, 500, 100, 100, 100], 'value': [values[i] for i in [0, 1, 2, 6, 7, 8, 9, 10, 11]], 'value_reference':[values[i] for i in [3, 4, 5, 3, 4, 5, 12, 13, 14]]})
I got the result here by hard coding exactly what I want to make a reproducible example. However, I am looking for the correct way of doing this operation.
How can this be done?
Thanks,
Jack
One way might be this:
df["tick"] = df.groupby(["x", "n"]).cumcount()
numbers = df.loc[df["n"] != "reference"]
ref = df.loc[df["n"] == "reference"]
ref = ref.drop("n", axis=1).rename(columns={"value": "reference"})
out = numbers.merge(ref).drop("tick", axis=1)
out = out.sort_values(["x", "n"])
which gives me
In [282]: out
Out[282]:
x n value reference
0 3 100 6 67
2 3 100 9 29
4 3 100 34 51
1 3 500 42 67
3 3 500 36 29
5 3 500 12 51
6 4 100 74 5
7 4 100 48 37
8 4 100 7 70
Step by step, first we add a tick column so we know which row of value matches with which row of reference:
In [290]: df
Out[290]:
x n value tick
0 3 100 6 0
1 3 100 9 1
2 3 100 34 2
3 3 reference 67 0
4 3 reference 29 1
5 3 reference 51 2
6 3 500 42 0
7 3 500 36 1
8 3 500 12 2
9 4 100 74 0
10 4 100 48 1
11 4 100 7 2
12 4 reference 5 0
13 4 reference 37 1
14 4 reference 70 2
Then we separate out the value and reference parts of the table:
In [291]: numbers = df.loc[df["n"] != "reference"]
...: ref = df.loc[df["n"] == "reference"]
...: ref = ref.drop("n", axis=1).rename(columns={"value": "reference"})
...:
...:
In [292]: numbers
Out[292]:
x n value tick
0 3 100 6 0
1 3 100 9 1
2 3 100 34 2
6 3 500 42 0
7 3 500 36 1
8 3 500 12 2
9 4 100 74 0
10 4 100 48 1
11 4 100 7 2
In [293]: ref
Out[293]:
x reference tick
3 3 67 0
4 3 29 1
5 3 51 2
12 4 5 0
13 4 37 1
14 4 70 2
and then we merge, where the merge will align on the shared columns, which are "x" and "tick". A sort to clean things up and we're done.