Can I combine this tables via panas.crosstab? - python

I have three data frames (as result from .mean()) like this:
A 533.9
B 691.9
C 611.5
D 557.8
I want to concatenate them to three columns like this
all X Y
A 533.9 558.0 509.8
B 691.9 613.2 770.6
C 611.5 618.4 604.6
D 557.8 591.0 524.6
My MWE below does work. But I wonder if I can use .crosstab() or another fancy and more easy pandas function for that.
The initial data frame:
group A B C D
0 X 844 908 310 477
1 X 757 504 729 865
2 X 420 281 898 260
3 X 258 755 683 805
4 X 511 618 472 548
5 Y 404 250 100 14
6 Y 783 909 434 719
7 Y 303 982 610 398
8 Y 476 810 913 824
9 Y 583 902 966 668
And this is the MWE using dict and pandas.concat() to solve the problem.
#!/usr/bin/env python3
import random as rd
import pandas as pd
import statistics
rd.seed(0)
df = pd.DataFrame({
'group': ['X'] * 5 + ['Y'] * 5,
'A': rd.choices(range(1000), k=10),
'B': rd.choices(range(1000), k=10),
'C': rd.choices(range(1000), k=10),
'D': rd.choices(range(1000), k=10),
})
cols = list('ABCD')
result = {
'all': df.loc[:, cols].mean(),
'X': df.loc[df.group.eq('X'), cols].mean(),
'Y': df.loc[df.group.eq('Y'), cols].mean()
}
tab = pd.concat(result, axis=1)
print(tab)

You can do with melt then pivot_table
out = df.melt('group').pivot_table(
index = 'variable',
columns = 'group',
values = 'value',
aggfunc = 'mean',
margins = True).drop(['All'])
Out[207]:
group X Y All
variable
A 558.0 509.8 533.9
B 613.2 770.6 691.9
C 618.4 604.6 611.5
D 591.0 524.6 557.8

Solution :
res = df.groupby('group').mean().T
res['all'] = (res.X + res.Y) / 2
print(res)
Output
group X Y all
A 558.0 509.8 533.9
B 613.2 770.6 691.9
C 618.4 604.6 611.5
D 591.0 524.6 557.8

Related

How to simplify data.table logic and make it doable in pandas?

I have a dataframe with multiple columns with numerical values. I wanted to new columns which compare the values of other columns and assign its column name as label. I already understood its logic in r, but wondering how should I do this easily in python. Can anyone point me out how this can be done in python when we try to add new column where need to compare value of multiple columns and assign column name which has max value? Any idea?
reproducible example
this is 100% working reproducible example in R:
library(data.table)
df <- data.frame(a = sample(seq(1:10), size=10), b = sample(LETTERS[1:10], size=10), cnt=sample(seq(1:100), size=5),
RECENT_MOV= sample(seq(1:1000), size = 10),
RETIRED= sample(seq(1:200), size = 10),
SERV_EMPL= sample(seq(1:500), size = 10),
SUB_BUS=sample(seq(1:2000), size = 10),
WORK_HOME=sample(seq(1:1200), size = 10)
)
dt <- as.data.table(df)
write.csv(dt, "sample.csv")
label = c("RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME")
df$category <- NA_character_
df[, row_ind:= 1:nrow(df)]
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]
current output is:
> dput(dt)
structure(list(a = c(5L, 10L, 1L, 6L, 7L, 3L, 2L, 8L, 4L, 9L),
b = c("E", "A", "D", "H", "J", "F", "G", "I", "C", "B"),
cnt = c(13L, 88L, 45L, 92L, 70L, 13L, 88L, 45L, 92L, 70L),
RECENT_MOV = c(70L, 195L, 620L, 572L, 354L, 648L, 798L, 657L,
233L, 672L), RETIRED = c(189L, 195L, 191L, 88L, 148L, 186L,
39L, 78L, 158L, 55L), SERV_EMPL = c(65L, 151L, 415L, 383L,
255L, 207L, 210L, 470L, 181L, 188L), SUB_BUS = c(894L, 829L,
1798L, 502L, 897L, 1461L, 744L, 1991L, 260L, 1697L), WORK_HOME = c(553L,
739L, 454L, 137L, 435L, 1042L, 316L, 697L, 517L, 1158L),
category = c("SUB_BUS", "SUB_BUS", "SUB_BUS", "RECENT_MOV",
"SUB_BUS", "SUB_BUS", "RECENT_MOV", "SUB_BUS", "WORK_HOME",
"SUB_BUS"), row_ind = 1:10), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000015a64b61ef0>)
my current python attempt
import pandas as pd
df=pd.read_csv("sample.csv", index_col=None, header=0)
label = ["RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME"]
df['category'] = pd.NA
df['row_ind'] = range(1,len(df))
however, I have trouble to make this line in pythonic way:
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]
basically, this lines said create new column variable called category where comparing columns in label where whichever column has max value, its column name will be assigned as value in category column. How should I do it this easily in python?
logic translation:
df[cnt > 2, category := names(which.max(.SD[, label, with = FALSE])), by = row_ind]
this line telling us that first do filter by cnt column where cnt > 2, then compare columns values of df[["RECENT_MOV", "RETIRED", "SERV_EMPL", "SUB_BUS","WORK_HOME"]] and pick the column with highest value by row-wise and assign that name of that column as value to df['category']=col_name_with_highest_value_in_each_row.
desirable output
this is desirable output that I want to produce in python:
a b cnt RECENT_MOV RETIRED SERV_EMPL SUB_BUS WORK_HOME category row_ind
1 5 E 13 70 189 65 894 553 SUB_BUS 1
2 10 A 88 195 195 151 829 739 SUB_BUS 2
3 1 D 45 620 191 415 1798 454 SUB_BUS 3
4 6 H 92 572 88 383 502 137 RECENT_MOV 4
5 7 J 70 354 148 255 897 435 SUB_BUS 5
6 3 F 13 648 186 207 1461 1042 SUB_BUS 6
7 2 G 88 798 39 210 744 316 RECENT_MOV 7
8 8 I 45 657 78 470 1991 697 SUB_BUS 8
9 4 C 92 233 158 181 260 517 WORK_HOME 9
10 9 B 70 672 55 188 1697 1158 SUB_BUS 10
This is actually really simple with pandas. Have a list of the columns to search in, and then use idxmax with axis=1:
# Filter out rows where `cnt` is less than or equal to 2
df = df[df['cnt'] > 2]
# Determine category for each row
search_cols = ['RECENT_MOV', 'RETIRED', 'SERV_EMPL', 'SUB_BUS', 'WORK_HOME']
df['category'] = df[search_cols].idxmax(axis=1)
# Assign row indexes
df['row_ind'] = df.index
Output:
>>> df
a b cnt RECENT_MOV RETIRED SERV_EMPL SUB_BUS WORK_HOME category row_ind
1 1 C 76 452 62 55 115 247 RECENT_MOV 1
2 7 E 14 50 165 337 1165 810 SUB_BUS 2
3 2 A 46 523 167 423 784 707 SUB_BUS 3
4 3 H 3 38 144 473 745 437 SUB_BUS 4
5 5 I 59 743 127 261 351 190 RECENT_MOV 5
6 8 J 76 143 49 470 1612 935 SUB_BUS 6
7 4 D 14 818 101 418 1919 314 SUB_BUS 7
8 6 F 46 714 9 446 1432 938 SUB_BUS 8
9 10 B 3 585 160 14 107 489 RECENT_MOV 9
10 9 G 59 814 73 449 937 287 SUB_BUS 10

Sampling from static data set to create dataframe, ignore index in Python

I am trying to create some random samples (of a given size) from a static dataframe. The goal is to create multiple columns for each sample (and each sample drawn is the same size). I'm expecting to see multiple columns of the same length (i.e. sample size) in the fully sampled dataframe, but maybe append isn't the right way to go. Here is the code:
# create sample dataframe
target_df = pd.DataFrame(np.arange(1000))
target_df.columns=['pl']
# create the sampler:
sample_num = 5
sample_len = 10
df_max_row = len(target_df) - sample_len
for i in range(sample_num):
rndm_start = np.random.choice(df_max_row, 1)[0]
rndm_end = rndm_start + sample_len
slicer = target_df.iloc[rndm_start:rndm_end]['pl']
sampled_df = sampled_df.append(slicer, ignore_index=True)
sampled_df = sampled_df.T
The output of this is shown in the pic below - The red line shows the index I want remove.
The desired output is shown below that. How do I make this happen?
Thanks!
I would create new column using
sampled_df[i] = slicer.reset_index(drop=True)
Eventually I would use str(i) for column name because later it is simpler to select column using string than number
import pandas as pd
import random
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
df_max_row = len(target_df) - sample_len
sampled_df = pd.DataFrame()
for i in range(1, sample_num+1):
start = random.randint(0, df_max_row)
end = start + sample_len
slicer = target_df[start:end]['pl']
sampled_df[str(i)] = slicer.reset_index(drop=True)
sampled_df.index += 1
print(sampled_df)
Result:
1 2 3 4 5
1 735 396 646 534 769
2 736 397 647 535 770
3 737 398 648 536 771
4 738 399 649 537 772
5 739 400 650 538 773
6 740 401 651 539 774
7 741 402 652 540 775
8 742 403 653 541 776
9 743 404 654 542 777
10 744 405 655 543 778
But to create really random values then I would first shuffle values
np.random.shuffle(target_df['pl'])
and then I don't have to use random to select start
shuffle changes original column so it can't assign to new variable.
It doesn't repeat values in samples.
import pandas as pd
#import numpy as np
import random
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
sampled_df = pd.DataFrame()
#np.random.shuffle(target_df['pl'])
random.shuffle(target_df['pl'])
for i in range(1, sample_num+1):
start = i * sample_len
end = start + sample_len
slicer = target_df[start:end]['pl']
sampled_df[str(i)] = slicer.reset_index(drop=True)
sampled_df.index += 1
print(sampled_df)
Result:
1 2 3 4 5
1 638 331 171 989 170
2 22 643 47 136 764
3 969 455 211 763 194
4 859 384 174 552 566
5 221 829 62 926 414
6 4 895 951 967 381
7 758 688 594 876 873
8 757 691 825 693 707
9 235 353 34 699 121
10 447 81 36 682 251
If values can repeat then you could use
sampled_df[str(i)] = target_df['pl'].sample(n=sample_len, ignore_index=True)
import pandas as pd
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
sampled_df = pd.DataFrame()
for i in range(1, sample_num+1):
sampled_df[str(i)] = target_df['pl'].sample(n=sample_len, ignore_index=True)
sampled_df.index += 1
print(sampled_df)
EDIT
You may also get shuffled values as numpy array and use reshape - and later convert back to DataFrame with many columns. And later you can get some columns.
import pandas as pd
import random
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
random.shuffle(target_df['pl'])
sampled_df = pd.DataFrame(target_df['pl'].values.reshape([sample_len,-1]))
sampled_df = sampled_df.iloc[:, 0:sample_num]
sampled_df.index += 1
print(sampled_df)

Create list of unique values in pandas dataframe -- error: TypeError: '<' not supported between instances of 'int' and 'str'

I have a dataset that looks like this df:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'Name':['a','b','c','d'],'1/1/2001':
['1/1/2015',0,0,'1/1/2015'],'2/20/2002':
['2/20/2002','2/20/2002','2/20/2002',0],'3/15/2015'
[0,0,0,'3/15/2015']}); df
df[df == 0] = np.nan
col = ['1/1/2001','2/20/2002','3/15/2015']
df.loc[:,col] = df.loc[:,col].bfill(axis=1)
df = df.fillna(value=0)
df
Name 1/1/2001 2/20/2002 3/15/2015
0 a 1/1/2015 2/20/2002 0
1 b 2/20/2002 2/20/2002 0
2 c 2/20/2002 2/20/2002 0
3 d 1/1/2015 3/15/2015 3/15/2015
And I want to return a dataframe that just has the unique values per row, so it could look like:
Name x_ x_2
0 a 1/1/2015 2/20/2002
1 b 2/20/2002 0
2 c 2/20/2002 0
3 d 1/1/2015 3/15/2015
But when I try to groupby with the following code:
df.groupby(['Name'])[col].apply(lambda x: list(np.unique(x)))
I get the long error:
TypeError Traceback (most recent call last)
<ipython-input-155-a3f3c8a3e6e5> in <module>
14 df
15
---> 16 df.groupby(['Name'])[col].apply(lambda x: list(np.unique(x)))
17
18
~/miniconda3/envs/planting/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
735
736 with _group_selection_context(self):
--> 737 return self._python_apply_general(f)
738
739 return result
~/miniconda3/envs/planting/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
740
741 def _python_apply_general(self, f):
--> 742 keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
743
744 return self._wrap_applied_output(
~/miniconda3/envs/planting/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
235 # group might be modified
236 group_axes = _get_axes(group)
--> 237 res = f(group)
238 if not _is_indexed_like(res, group_axes):
239 mutated = True
<ipython-input-155-a3f3c8a3e6e5> in <lambda>(x)
14 df
15
---> 16 df.groupby(['Name'])[col].apply(lambda x: list(np.unique(x)))
17
18
<__array_function__ internals> in unique(*args, **kwargs)
~/miniconda3/envs/planting/lib/python3.7/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
260 ar = np.asanyarray(ar)
261 if axis is None:
--> 262 ret = _unique1d(ar, return_index, return_inverse, return_counts)
263 return _unpack_tuple(ret)
264
~/miniconda3/envs/planting/lib/python3.7/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
308 aux = ar[perm]
309 else:
--> 310 ar.sort()
311 aux = ar
312 mask = np.empty(aux.shape, dtype=np.bool_)
TypeError: '<' not supported between instances of 'int' and 'str'
Perhaps the error is related to the fact that the dates are strings. If it's helpful they could be converted to datetime objects.
This can be done by melt, then pivot
s=df.mask(df==0).melt('Name').drop_duplicates(['Name','value']).dropna()
s['row']=s.groupby('Name').cumcount()+1
s.pivot(index='Name',columns='row',values='value')
Out[76]:
row 1 2
Name
a 1/1/2015 2/20/2002
b 2/20/2002 NaN
c 2/20/2002 NaN
d 1/1/2015 3/15/2015
how about:
df.T.drop_duplicates(keep='first').T
output:
1/1/2001 2/20/2002 3/15/2015 Name
0 1/1/2015 2/20/2002 0 a
1 0 2/20/2002 0 b
2 0 2/20/2002 0 c
3 1/1/2015 0 3/15/2015 d
EDIT:
This solution is referring to the first version of the question, little needs to be done to apply it to the last version.

How to split several column's data in pandas?

I have a dataframe which looks like this:
df = pd.DataFrame({'hard': [['525', '21']], 'soft': [['1525', '221']], 'set': [['5245', '271']], 'purch': [['925', '201']], \
'mont': [['555', '621']], 'gest': [['536', '251']], 'memo': [['825', '241']], 'raw': [['532', '210']]})
df
Out:
gest hard memo mont purch raw set soft
0 [536, 251] [525, 21] [825, 241] [555, 621] [925, 201] [532, 210] [5245, 271] [1525, 221]
I should split all of the columns like this:
df1 = pd.DataFrame()
df1['gest_pos'] = df.gest.str[0].astype(int)
df1['gest_size'] = df.gest.str[1].astype(int)
df1['hard_pos'] = df.hard.str[0].astype(int)
df1['hard_size'] = df.hard.str[1].astype(int)
df1
gest_pos gest_size hard_pos hard_size
0 536 251 525 21
I have more than 70 columns and my method takes lot of place and time. Is there an easier way to do this job?
Thanks!
Different approach:
df2 = pd.DataFrame()
for column in df:
df2['{}_pos'.format(column)] = df[column].str[0].astype(int)
df2['{}_size'.format(column)] = df[column].str[1].astype(int)
print(df2)
You can use nested list comprehension with flattening and then create new DataFrame by constructor:
L = [[y for x in z for y in x] for z in df.values.tolist()]
#if want filter first 2 values per each list
#L = [[y for x in z for y in x[:2]] for z in df.values.tolist()]
#https://stackoverflow.com/a/45122198/2901002
def mygen(lst):
for item in lst:
yield item + '_pos'
yield item + '_size'
df = pd.DataFrame(L, columns = list(mygen(df.columns))).astype(int)
print (df)
hard_pos hard_size soft_pos soft_size set_pos set_size purch_pos purch_size \
0 525 21 1525 221 5245 271 925 201
mont_pos mont_size gest_pos gest_size memo_pos memo_size raw_pos raw_size
0 555 621 536 251 825 241 532 210
You can use NumPy operations to construct your list of columns and flatten out your series of lists:
import numpy as np
from itertools import chain
# create column label array
cols = np.repeat(df.columns, 2).values
cols[::2] += '_pos'
cols[1::2] += '_size'
# create data array
arr = np.array([list(chain.from_iterable(i)) for i in df.values]).astype(int)
# combine with pd.DataFrame constructor
res = pd.DataFrame(arr, columns=cols)
Result:
print(res)
gest_pos gest_size hard_pos hard_size memo_pos memo_size mont_pos \
0 536 251 525 21 825 241 555
mont_size purch_pos purch_size raw_pos raw_size set_pos set_size \
0 621 925 201 532 210 5245 271
soft_pos soft_size
0 1525 221

Pandas Merge Error: MemoryError

Problem:
I'm trying to two relatively small datasets together, but the merge raises a MemoryError. I have two datasets of aggregates of country trade data, that I'm trying to merge on the keys year and country, so the data needs to be particularity placed. This unfortunately makes the use of concat and its performance benefits impossible as seen in the answer to this question: MemoryError on large merges with pandas in Python.
Here's the setup:
The attempted merge:
df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
Basic data structure:
i:
Year Reporter_Code Trade_Flow_Code Partner_Code Classification Commodity Code Quantity Unit Code Supplementary Quantity Netweight (kg) Value Estimation Code
0 2003 381 2 36 H2 070951 8 1274 1274 13810 0
1 2003 381 2 36 H2 070930 8 17150 17150 30626 0
2 2003 381 2 36 H2 0709 8 20493 20493 635840 0
3 2003 381 1 36 H2 0507 8 5200 5200 27619 0
4 2003 381 1 36 H2 050400 8 56439 56439 683104 0
df:
mporter cod CC ComTrade_CC Distance_miles
0 110 215 215 757 428.989
1 110 215 215 757 428.989
2 110 215 215 757 428.989
3 110 215 215 757 428.989
4 110 215 215 757 428.989
Error Traceback:
MemoryError Traceback (most recent call last)
<ipython-input-10-8d6e9fb45de6> in <module>()
1 for i in c_list:
----> 2 df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
36 right_index=right_index, sort=sort, suffixes=suffixes,
37 copy=copy)
---> 38 return op.get_result()
39 if __debug__:
40 merge.__doc__ = _merge_doc % '\nleft : DataFrame'
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
193 copy=self.copy)
194
--> 195 result_data = join_op.get_result()
196 result = DataFrame(result_data)
197
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
693 if klass in mapping:
694 klass_blocks.extend((unit, b) for b in mapping[klass])
--> 695 res_blk = self._get_merged_block(klass_blocks)
696
697 # if we have a unique result index, need to clear the _ref_locs
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge)
706 def _get_merged_block(self, to_merge):
707 if len(to_merge) > 1:
--> 708 return self._merge_blocks(to_merge)
709 else:
710 unit, block = to_merge[0]
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks)
728 # Should use Fortran order??
729 block_dtype = _get_block_dtype([x[1] for x in merge_chunks])
--> 730 out = np.empty(out_shape, dtype=block_dtype)
731
732 sofar = 0
MemoryError:
Thanks for your thoughts!
In case anyone coming across this question still has similar trouble with merge, you can probably get concat to work by renaming the relevant columns in the two dataframes to the same names, setting them as a MultiIndex (i.e. df = dv.set_index(['A','B'])), and then using concat to join them.
UPDATE
Example:
df1 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'C':[3, 4]})
df2 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'D':[7, 8]})
both = pd.concat([df1.set_index(['A','B']), df2.set_index(['A','B'])], axis=1).reset_index()
df1
A B C
0 1 2 3
1 2 3 4
df2
A B D
0 1 2 7
1 2 3 8
both
A B C D
0 1 2 3 7
1 2 3 4 8
I haven't benchmarked the performance of this approach, but it didn't get the memory error and worked for my applications.

Categories