I have some data imported from a csv, to create something similar I used this:
data = pd.DataFrame([[1,0,2,3,4,5],[0,1,2,3,4,5],[1,1,2,3,4,5],[0,0,2,3,4,5]], columns=['split','sex', 'group0Low', 'group0High', 'group1Low', 'group1High'])
means = data.groupby(['split','sex']).mean()
so the dataframe looks something like this:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
You'll notice that each column actually contains 2 variables (group# and height). (It was set up this way for running repeated measures anova in SPSS.)
I want to split the columns up, so I can also groupby "group", like this (I actually screwed up the order of the numbers, but hopefully the idea is clear):
low high
split sex group
0 0 95 265
0 0 1 123 54
1 0 120 220
1 1 98 111
1 0 0 150 190
0 1 211 300
1 0 139 86
1 1 132 250
How do I achieve this?
The first trick is to gather the columns into a single column using stack:
In [6]: means
Out[6]:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
In [13]: stacked = means.stack().reset_index(level=2)
In [14]: stacked.columns = ['group_level', 'mean']
In [15]: stacked.head(2)
Out[15]:
group_level mean
split sex
0 0 group0Low 2
0 group0High 3
Now we can do whatever string operations we want on group_level using pd.Series.str as follows:
In [18]: stacked['group'] = stacked.group_level.str[:6]
In [21]: stacked['level'] = stacked.group_level.str[6:]
In [22]: stacked.head(2)
Out[22]:
group_level mean group level
split sex
0 0 group0Low 2 group0 Low
0 group0High 3 group0 High
Now you're in business and you can do whatever you want. For example, sum each group/level:
In [31]: stacked.groupby(['group', 'level']).sum()
Out[31]:
mean
group level
group0 High 12
Low 8
group1 High 20
Low 16
How do I group by everything?
If you want to group by split, sex, group and level you can do:
In [113]: stacked.reset_index().groupby(['split', 'sex', 'group', 'level']).sum().head(4)
Out[113]:
mean
split sex group level
0 0 group0 High 3
Low 2
group1 0High 5
0Low 4
What if the split is not always at location 6?
This SO answer will show you how to do the splitting more intelligently.
This can be done by first construct multi-level index on column names and then reshape the dataframe by stack.
import pandas as pd
import numpy as np
# some artificial data
# ==================================
multi_index = pd.MultiIndex.from_arrays([[0,0,1,1], [0,1,0,1]], names=['split', 'sex'])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(50,300, (4,4)), columns='group0Low group0High group1Low group1High'.split(), index=multi_index)
df
group0Low group0High group1Low group1High
split sex
0 0 222 97 167 242
1 117 245 153 59
1 0 261 71 292 86
1 137 120 266 138
# processing
# ==============================
level_group = np.where(df.columns.str.contains('0'), 0, 1)
# output: array([0, 0, 1, 1])
level_low_high = np.where(df.columns.str.contains('Low'), 'low', 'high')
# output: array(['low', 'high', 'low', 'high'], dtype='<U4')
multi_level_columns = pd.MultiIndex.from_arrays([level_group, level_low_high], names=['group', 'val'])
df.columns = multi_level_columns
df.stack(level='group')
val high low
split sex group
0 0 0 97 222
1 242 167
1 0 245 117
1 59 153
1 0 0 71 261
1 86 292
1 0 120 137
1 138 266
Related
I have the following dataframe from a database download that I cleaned up a bit. Unfortunately some of the single numbers split into a second column (row 9) from a single one. I'm trying to merge the two columns but exclude the zero values.
city crashes crashes_1 total_crashes
1 ABERDEEN 710 0 710
2 ACHERS LODGE 1 0 1
3 ACME 1 0 1
4 ADVANCE 55 0 55
5 AFTON 2 0 2
6 AHOSKIE 393 0 393
7 AKERS CENTER 1 0 1
8 ALAMANCE 50 0 50
9 ALBEMARLE 1 58 59
So for row 9 I want to end with:
9 ALBEMARLE 1 58 158
I tried a few snippets but nothing seems to work:
df['total_crashes'] = df['crashes'].astype(str).str.zfill(0) + df['crashes_1'].astype(str).str.zfill(0)
df['total_crashes'] = df['total_crashes'].astype(str).replace('\0', '', regex=True)
df['total_crashes'] = df['total_crashes'].apply(lambda x: ''.join(x[x!=0]))
df['total_crashes'] = df['total_crashes'].str.cat(df['total_crashes'], x[x!=0])
df['total_crashes'] = df.drop[0].sum(axis=1)
Thanks for any help.
You can use where condition:
df['total_crashes'] = df['crashes'].astype(str) + df['crashes_1'].astype(str).where(df['crashes_1'] != 0, "")
I have a wide dataframe I want to be able to reshape.
I have some columns that I wanna preserve. I have been exploring melt and wide_to_long but I'm not sure that's what I need.
Imagine I have some columns named: 'id', 'classroom', 'city'
And other columns called: 'alumn_x_subject_y_mark', 'alumn_x_subject_y_name', 'alumn_x_subject_y_teacher'
And x and y are the product of [range(20), range(10)].
I would like to end with a df that has columns: id, classroom, city, alumn, subject, mark, name, teacher
With all the original 20*10 columns converted to rows.
An empty dataframe with that structure can be generated this way:
import pandas as pd
import itertools
vals = list(itertools.product(*[range(20), range(10)]))
pd.DataFrame(columns=['id', 'classroom', 'city']+ \
['alumn_{0}_subject_{1}_mark'.format(x, y) for x, y in vals] + \
['alumn_{0}_subject_{1}_name'.format(x, y) for x, y in vals] + \
['alumn_{0}_subject_{1}_teacher'.format(x, y) for x, y in vals]
, dtype=object)
I'm not building this dataframe but receiving it from a file, that's why it has so many columns and I cannot change that.
If you had only 2 parameters to extract, wide_to_long would work.
Here you have 3, thus you can perform a manual reshaping with a MultiIndex:
regex = r'alumn_(\d+)_subject_(\d+)_(.*)'
out = (df
.set_index(['id', 'classroom', 'city'])
.pipe(lambda d: d.set_axis(pd.MultiIndex
.from_frame(d.columns.str.extract(regex),
names=['alumn', 'subject', None]
),
axis=1))
.stack(['alumn', 'subject'])
.reset_index()
)
output:
Empty DataFrame
Columns: [id, classroom, city, alumn, subject, mark, name, teacher]
Index: []
output with a single row (after df.loc[0] = range(df.shape[1])):
id classroom city alumn subject mark name teacher
0 0 1 2 0 0 3 203 403
1 0 1 2 0 1 4 204 404
2 0 1 2 0 2 5 205 405
3 0 1 2 0 3 6 206 406
4 0 1 2 0 4 7 207 407
.. .. ... ... ... ... ... ... ...
195 0 1 2 9 5 98 298 498
196 0 1 2 9 6 99 299 499
197 0 1 2 9 7 100 300 500
198 0 1 2 9 8 101 301 501
199 0 1 2 9 9 102 302 502
[200 rows x 8 columns]
I have pandas dataframe:
id colA colB colC
194 1 0 1
194 1 1 0
194 2 1 3
195 1 1 2
195 0 1 0
197 1 1 2
i would to calculate occurrence of each value group by id. in my case, expected result is:
id countOfValue0 countOfValue1 countOfValue2 countOfValue3
194 2 3 1 1
195 1 2 1 0
197 0 1 1 0
if value appeared in same row - distinct value by row (this is why i have for id=194, value1 = 3)
i thought to separate the data to 3 data frames using group by id-colA, id-colB, id-colC
something like = df.groupby('id', 'colaA') but i can't find an proper way to calculate those dataframe values based on id. probably there is more efficient way for doing this
Try:
res=df.set_index("id", append=True).stack()\
.reset_index(level=0).reset_index(level=1,drop=True)\
.drop_duplicates().assign(_dummy=1)\
.rename(columns={0: "countOfValue"})\
.pivot_table(index="id", columns="countOfValue", values="_dummy", aggfunc="sum")\
.fillna(0).astype(int)
res=res.add_prefix("countOfValue")
del res.columns.name
Outputs:
countOfValue0 ... countOfValue3
id ...
194 2 ... 1
195 1 ... 0
197 0 ... 0
I need a count of non-zero variables in pairs of rows.
I have a dataframe that lists density of species found at several sampling points. I need to know the total number of species found at each pair of sampling points. Here is an example of my data:
>>> import pandas
>>> df = pd.DataFrame({'ID':[111,222,333,444],'minnow':[1,3,5,4],'trout':[2,0,0,3],'bass':[0,1,3,0],'gar':[0,1,0,0]})
>>> df
ID bass gar minnow trout
0 111 0 0 1 2
1 222 1 1 3 0
2 333 3 0 5 0
3 444 0 0 4 3
I will pair the rows by ID number, so the pair (111,222) should return a total of 4, while the pair (111,333) should return a total of 3. I know I can get a sum of non-zeros for each row, but if I add those totals for each pair I will be double counting some of the species.
Here's an approach with NumPy -
In [35]: df
Out[35]:
ID bass gar minnow trout
0 111 0 0 1 2
1 222 1 1 3 0
2 333 3 0 5 0
3 444 0 0 4 3
In [36]: a = df.iloc[:,1:].values!=0
In [37]: r,c = np.triu_indices(df.shape[0],1)
In [38]: l = df.ID
In [39]: pd.DataFrame(np.column_stack((l[r], l[c], (a[r] | a[c]).sum(1))))
Out[39]:
0 1 2
0 111 222 4
1 111 333 3
2 111 444 2
3 222 333 3
4 222 444 4
5 333 444 3
You can do this using iloc for slicing and numpy
np.sum((df.iloc[[0, 1], 1:]!=0).any(axis=0))
Here df.iloc[[0, 1], 1:] gives you first two rows and numpy sum is counting the total number of non zero pairs in the selected row. You can use df.iloc[[0, 1], 1:] to select any combination of rows.
If the rows are sorted so that the two groups occur one after another, you could do
import pandas as pd
import numpy as np
x = np.random.randint(0,2,(10,3))
df = pd.DataFrame(x)
pair_a = df.loc[::2].reset_index(drop = True)
pair_b = df.loc[1::2].reset_index(drop = True)
paired = pd.concat([pair_a,pair_b],axis = 1)
Then find where paired is non-zero.
I've Pandas Dataframe as shown below. What I'm trying to do is, partition (or groupby) by BlockID, LineID, WordID, and then within each group use current WordStartX - previous (WordStartX + WordWidth) to derive another column, e.g., WordDistance to indicate the distance between this word and previous word.
This post Row operations within a group of a pandas dataframe is very helpful but in my case multiple columns involved (WordStartX and WordWidth).
*BlockID LineID WordID WordStartX WordWidth WordDistance
0 0 0 0 275 150 0
1 0 0 1 431 96 431-(275+150)=6
2 0 0 2 642 90 642-(431+96)=115
3 0 0 3 746 104 746-(642+90)=14
4 1 0 0 273 69 ...
5 1 0 1 352 151 ...
6 1 0 2 510 92
7 1 0 3 647 90
8 1 0 4 752 105**
The diff() and shift() functions are usually helpful for calculation referring to previous or next rows:
df['WordDistance'] = (df.groupby(['BlockID', 'LineID'])
.apply(lambda g: g['WordStartX'].diff() - g['WordWidth'].shift()).fillna(0).values)