I have a dataframe which looks like this:
df = pd.DataFrame({'hard': [['525', '21']], 'soft': [['1525', '221']], 'set': [['5245', '271']], 'purch': [['925', '201']], \
'mont': [['555', '621']], 'gest': [['536', '251']], 'memo': [['825', '241']], 'raw': [['532', '210']]})
df
Out:
gest hard memo mont purch raw set soft
0 [536, 251] [525, 21] [825, 241] [555, 621] [925, 201] [532, 210] [5245, 271] [1525, 221]
I should split all of the columns like this:
df1 = pd.DataFrame()
df1['gest_pos'] = df.gest.str[0].astype(int)
df1['gest_size'] = df.gest.str[1].astype(int)
df1['hard_pos'] = df.hard.str[0].astype(int)
df1['hard_size'] = df.hard.str[1].astype(int)
df1
gest_pos gest_size hard_pos hard_size
0 536 251 525 21
I have more than 70 columns and my method takes lot of place and time. Is there an easier way to do this job?
Thanks!
Different approach:
df2 = pd.DataFrame()
for column in df:
df2['{}_pos'.format(column)] = df[column].str[0].astype(int)
df2['{}_size'.format(column)] = df[column].str[1].astype(int)
print(df2)
You can use nested list comprehension with flattening and then create new DataFrame by constructor:
L = [[y for x in z for y in x] for z in df.values.tolist()]
#if want filter first 2 values per each list
#L = [[y for x in z for y in x[:2]] for z in df.values.tolist()]
#https://stackoverflow.com/a/45122198/2901002
def mygen(lst):
for item in lst:
yield item + '_pos'
yield item + '_size'
df = pd.DataFrame(L, columns = list(mygen(df.columns))).astype(int)
print (df)
hard_pos hard_size soft_pos soft_size set_pos set_size purch_pos purch_size \
0 525 21 1525 221 5245 271 925 201
mont_pos mont_size gest_pos gest_size memo_pos memo_size raw_pos raw_size
0 555 621 536 251 825 241 532 210
You can use NumPy operations to construct your list of columns and flatten out your series of lists:
import numpy as np
from itertools import chain
# create column label array
cols = np.repeat(df.columns, 2).values
cols[::2] += '_pos'
cols[1::2] += '_size'
# create data array
arr = np.array([list(chain.from_iterable(i)) for i in df.values]).astype(int)
# combine with pd.DataFrame constructor
res = pd.DataFrame(arr, columns=cols)
Result:
print(res)
gest_pos gest_size hard_pos hard_size memo_pos memo_size mont_pos \
0 536 251 525 21 825 241 555
mont_size purch_pos purch_size raw_pos raw_size set_pos set_size \
0 621 925 201 532 210 5245 271
soft_pos soft_size
0 1525 221
Related
Trying to transform my data from
lm-stands for last month
hopefully this makes sense ,how i have it
import pandas as pd
df = pd.read_excel('data.xlsx') #reading data
output = []
grouped = df.groupby('txn_id')
for txn_id, group in grouped:
avg_amt = group['avg_amount'].iloc[-1]
min_amt = group['min_amount'].iloc[-1]
lm_avg = group['avg_amount'].iloc[-6:-1]
min_amt_list = group['min_amount'].iloc[-6:-1]
output.append([txn_id, *lm_avg, min_amt, *min_amt_list])
result_df = pd.DataFrame(output, columns=['txn_id', 'lm_avg', 'lm_avg-1', 'lm_avg-2', 'lm_avg-3', 'lm_avg-4', 'lm_avg-5', 'min_am', 'min_amt-1', 'min_amt-2', 'min_amt-3', 'min_amt-4', 'min_amt-5'])#getting multiple crows for 1 txn_id which is not expected
Use pivot_table:
# Rename columns before reshaping your dataframe with pivot_table
cols = df[::-1].groupby('TXN_ID').cumcount().astype(str)
out = (df.rename(columns={'AVG_Amount': 'lm_avg', 'MIN_AMOUNT': 'min_amnt'})
.pivot_table(index='TXN_ID', values=['lm_avg', 'min_amnt'], columns=cols))
# Flat columns name
out.columns = ['-'.join(i) if i[1] != '0' else i[0] for i in out.columns.to_flat_index()]
# Reset index
out = out.reset_index()
Output:
>>> out
TXN_ID lm_avg lm_avg-1 lm_avg-2 lm_avg-3 lm_avg-4 lm_avg-5 min_amnt min_amnt-1 min_amnt-2 min_amnt-3 min_amnt-4 min_amnt-5
0 1 578 688 589 877 556 78 400 31 20 500 300 30
1 2 578 688 589 877 556 78 400 31 20 0 0 90
I have three data frames (as result from .mean()) like this:
A 533.9
B 691.9
C 611.5
D 557.8
I want to concatenate them to three columns like this
all X Y
A 533.9 558.0 509.8
B 691.9 613.2 770.6
C 611.5 618.4 604.6
D 557.8 591.0 524.6
My MWE below does work. But I wonder if I can use .crosstab() or another fancy and more easy pandas function for that.
The initial data frame:
group A B C D
0 X 844 908 310 477
1 X 757 504 729 865
2 X 420 281 898 260
3 X 258 755 683 805
4 X 511 618 472 548
5 Y 404 250 100 14
6 Y 783 909 434 719
7 Y 303 982 610 398
8 Y 476 810 913 824
9 Y 583 902 966 668
And this is the MWE using dict and pandas.concat() to solve the problem.
#!/usr/bin/env python3
import random as rd
import pandas as pd
import statistics
rd.seed(0)
df = pd.DataFrame({
'group': ['X'] * 5 + ['Y'] * 5,
'A': rd.choices(range(1000), k=10),
'B': rd.choices(range(1000), k=10),
'C': rd.choices(range(1000), k=10),
'D': rd.choices(range(1000), k=10),
})
cols = list('ABCD')
result = {
'all': df.loc[:, cols].mean(),
'X': df.loc[df.group.eq('X'), cols].mean(),
'Y': df.loc[df.group.eq('Y'), cols].mean()
}
tab = pd.concat(result, axis=1)
print(tab)
You can do with melt then pivot_table
out = df.melt('group').pivot_table(
index = 'variable',
columns = 'group',
values = 'value',
aggfunc = 'mean',
margins = True).drop(['All'])
Out[207]:
group X Y All
variable
A 558.0 509.8 533.9
B 613.2 770.6 691.9
C 618.4 604.6 611.5
D 591.0 524.6 557.8
Solution :
res = df.groupby('group').mean().T
res['all'] = (res.X + res.Y) / 2
print(res)
Output
group X Y all
A 558.0 509.8 533.9
B 613.2 770.6 691.9
C 618.4 604.6 611.5
D 591.0 524.6 557.8
I have two pandas dataframes. The first one contains some data I want to multiplicate with the second dataframe which is a reference table.
So in my example I want to get a new column in df1 for every column in my reference table - but also add up every row in that column.
Like this (Index 205368421 with R21 17): (1205 * 0.526499) + (7562* 0.003115) + (1332* 0.000267) = 658
In Excel VBA I iterated through both tables and did it that way - but it took very long. I've read that pandas is way better for this without iterating.
df1 = pd.DataFrame({'Index': ['205368421', '206321177','202574796','200212811', '204376114'],
'L1.09A': [1205,1253,1852,1452,1653],
'L1.10A': [7562,7400,5700,4586,4393],
'L1.10C': [1332, 0, 700,1180,290]})
df2 = pd.DataFrame({'WorkerID': ['L1.09A', 'L1.10A', 'L1.10C'],
'R21 17': [0.526499,0.003115,0.000267],
'R21 26': [0.458956,0,0.001819]})
Index 1.09A L1.10A L1.10C
205368421 1205 7562 1332
206321177 1253 7400 0
202574796 1852 5700 700
200212811 1452 4586 1180
204376114 1653 4393 290
WorkerID R21 17 R21 26
L1.09A 0.526499 0.458956
L1.10A 0.003115 0
L1.10C 0.000267 0.001819
I want this:
Index L1.09A L1.10A L1.10C R21 17 R21 26
205368421 1205 7562 1332 658 555
206321177 1253 7400 0 683 575
202574796 1852 5700 700 993 851
200212811 1452 4586 1180 779 669
204376114 1653 4393 290 884 759
I would be okay with some hints. Like someone told me this might be matrix multiplication. So .dot() would be helpful. Is this the right direction?
Edit:
I have now done the following:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df_multiplied = df1_sorted # df2_sorted
This works with my example dataframes, but not with my real dataframes.
My real ones have these dimensions: df1_sorted(10429, 69) and df2_sorted(69, 18).
It should work, but my df_multiplied is full with NaN.
Alright, I did it!
I had to replace all nan with 0.
So the final solution is:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df1_sorted= df1_sorted.fillna(0)
df2_sorted= df2_sorted.fillna(0)
df_multiplied = df1_sorted # df2_sorted
I have a pandas dataframe which has data of 24 hours of the day for a whole month with the following fields:
(df1):- date,hour,mid,rid,percentage,total
I need to create 2nd dataframe using this dataframe with the following fields:
(df2) :- date, hour,mid,rid,hour_total
Here hour_total is to be calculated as below:
If for a combination of (date,mid,rid) from dataframe 1, count of records where df1.percentage is 0 is 24, then hour_total = df1.total/24 else hour_total = (df1.percentage /100) * total
For example if dataframe 1 is as below:- (count of records for group of date mid,rid where perc is 0 is 24)
date,hour,mid,rid,perc,total
2019-10-31,0,2, 0,0,3170.87
2019-10-31,1,2,0,0,3170.87
2019-10-31,2,2,0,0,3170.87
2019-10-31,3,2,0,0,3170.87
2019-10-31,4,2,0,0,3170.87
.
.
2019-10-31,23,2,0,0,3170.87
Then dataframe 2 should be: (hour_total = df1.total/24)
date,hour,mid,rid,hour_total
2019-10-31,0,2,0,132.12
2019-10-31,1,4,0,132.12
2019-10-31,2,13,0,132.12
2019-10-31,3,17,0,132.12
2019-10-31,4,7,0,132.12
.
.
2019-10-31,23,27,0,132.12
How can I accomplish this?
You can try the apply function
For example
a = np.random.randint(100,200, size=5)
b = np.random.randint(100,200, size=5)
c = [datetime.now() for x in range(100) if x%20 == 0]
df1 = pd.DataFrame({'Time' : c, "A" : a, "B" : b})
Above data frame looks like this
Time A B
0 2019-10-24 20:37:38.907058 158 190
1 2019-10-24 20:37:38.907058 161 127
2 2019-10-24 20:37:38.908056 100 100
3 2019-10-24 20:37:38.908056 163 164
4 2019-10-24 20:37:38.908056 121 159
Now if we want to compute a new column whose value depends on the other values of column.
You can define a function which does this computation.
def func(x):
t = x[0] # time
a = x[1] # A
b = x[2] # B
return a+b
And apply this function to the data frame
df1["new_col"] = df1.apply(func, axis=1)
Which would yield the following result.
Time A B new_col
0 2019-10-24 20:37:38.907058 158 190 348
1 2019-10-24 20:37:38.907058 161 127 288
2 2019-10-24 20:37:38.908056 100 100 200
3 2019-10-24 20:37:38.908056 163 164 327
4 2019-10-24 20:37:38.908056 121 159 280
Based on a selection ds of a dataframe d with:
{ 'x': d.x, 'y': d.y, 'a':d.a, 'b':d.b, 'c':d.c 'row:d.n'})
Having n rows, x ranges from 0 to n-1. The column n is needed since it's a selection and indices need to be kept for a later query.
How do you efficiently compute the difference between each row (e.g.a_0, a_1, etc) of each column (a, b, c) without losing the rows information (e.g. new column with the indices of the rows that were used) ?
MWE
Sample selection ds:
x y a b c n
554.607085 400.971878 9789 4151 6837 146
512.231450 405.469524 8796 3811 6596 225
570.427284 694.369140 1608 2019 2097 291
Desired output:
dist euclidean distance math.hypot(x2 - x1, y2 - y1)
da, db, dc for da: np.abs(a1-a2)
ns a string with both ns of the employed rows
the result would look like:
dist da db dc ns
42.61365102824963 993 340 241 146-225
293.82347069813255 8181 2132 4740 146-291
.. .. .. .. 225-291
You can use itertools.combinations() to generate the pairs:
Read data first:
import pandas as pd
from io import StringIO
import numpy as np
text = """ x y a b c n
554.607085 400.971878 9789 4151 6837 146
512.231450 405.469524 8796 3811 6596 225
570.427284 694.369140 1608 2019 2097 291"""
df = pd.read_csv(StringIO(text), delim_whitespace=True)
Create the index and calculate the results:
from itertools import combinations
index = np.array(list(combinations(range(df.shape[0]), 2)))
df1, df2 = [df.iloc[idx].reset_index(drop=True) for idx in index.T]
res = pd.concat([
np.hypot(df1.x - df2.x, df1.y - df2.y),
df1[["a", "b", "c"]] - df2[["a", "b", "c"]],
df1.n.astype(str) + "-" + df2.n.astype(str)
], axis=1)
res.columns = ["dist", "da", "db", "dc", "ns"]
res
the output:
dist da db dc ns
0 42.613651 993 340 241 146-225
1 293.823471 8181 2132 4740 146-291
2 294.702805 7188 1792 4499 225-291
This approach makes good use of Pandas and the underlying numpy capabilities, but the matrix manipulations are a little hard to keep track of:
import pandas as pd, numpy as np
ds = pd.DataFrame(
[
[554.607085, 400.971878, 9789, 4151, 6837, 146],
[512.231450, 405.469524, 8796, 3811, 6596, 225],
[570.427284, 694.369140, 1608, 2019, 2097, 291]
],
columns = ['x', 'y', 'a', 'b', 'c', 'n']
)
def concat_str(*arrays):
result = arrays[0]
for arr in arrays[1:]:
result = np.core.defchararray.add(result, arr)
return result
# Make a panel with one item for each column, with a square data frame for
# each item, showing the differences between all row pairs.
# This creates perpendicular matrices of values based on the underlying numpy arrays;
# then numpy broadcasts them along the missing axis when calculating the differences
p = pd.Panel(
(ds.values[np.newaxis,:,:] - ds.values[:,np.newaxis,:]).transpose(),
items=['d'+c for c in ds.columns], major_axis=ds.index, minor_axis=ds.index
)
# calculate euclidian distance
p['dist'] = np.hypot(p['dx'], p['dy'])
# create strings showing row relationships
p['ns'] = concat_str(ds['n'].values.astype(str)[:,np.newaxis], '-', ds['n'].values.astype(str)[np.newaxis,:])
# remove unneeded items
del p['dx'], p['dy'], p['dn']
# convert to frame
diffs = p.to_frame().reindex_axis(['dist', 'da', 'db', 'dc', 'ns'], axis=1)
diffs
This gives:
dist da db dc ns
major minor
0 0 0.000000 0 0 0 146-146
1 42.613651 993 340 241 146-225
2 293.823471 8181 2132 4740 146-291
1 0 42.613651 -993 -340 -241 225-146
1 0.000000 0 0 0 225-225
2 294.702805 7188 1792 4499 225-291
2 0 293.823471 -8181 -2132 -4740 291-146
1 294.702805 -7188 -1792 -4499 291-225
2 0.000000 0 0 0 291-291