Python Data Transformation--EDA - python

Trying to transform my data from
lm-stands for last month
hopefully this makes sense ,how i have it
import pandas as pd
df = pd.read_excel('data.xlsx') #reading data
output = []
grouped = df.groupby('txn_id')
for txn_id, group in grouped:
avg_amt = group['avg_amount'].iloc[-1]
min_amt = group['min_amount'].iloc[-1]
lm_avg = group['avg_amount'].iloc[-6:-1]
min_amt_list = group['min_amount'].iloc[-6:-1]
output.append([txn_id, *lm_avg, min_amt, *min_amt_list])
result_df = pd.DataFrame(output, columns=['txn_id', 'lm_avg', 'lm_avg-1', 'lm_avg-2', 'lm_avg-3', 'lm_avg-4', 'lm_avg-5', 'min_am', 'min_amt-1', 'min_amt-2', 'min_amt-3', 'min_amt-4', 'min_amt-5'])#getting multiple crows for 1 txn_id which is not expected

Use pivot_table:
# Rename columns before reshaping your dataframe with pivot_table
cols = df[::-1].groupby('TXN_ID').cumcount().astype(str)
out = (df.rename(columns={'AVG_Amount': 'lm_avg', 'MIN_AMOUNT': 'min_amnt'})
.pivot_table(index='TXN_ID', values=['lm_avg', 'min_amnt'], columns=cols))
# Flat columns name
out.columns = ['-'.join(i) if i[1] != '0' else i[0] for i in out.columns.to_flat_index()]
# Reset index
out = out.reset_index()
Output:
>>> out
TXN_ID lm_avg lm_avg-1 lm_avg-2 lm_avg-3 lm_avg-4 lm_avg-5 min_amnt min_amnt-1 min_amnt-2 min_amnt-3 min_amnt-4 min_amnt-5
0 1 578 688 589 877 556 78 400 31 20 500 300 30
1 2 578 688 589 877 556 78 400 31 20 0 0 90

Related

How to calculate value of column in dataframe based on value/count of other columns in the dataframe in python?

I have a pandas dataframe which has data of 24 hours of the day for a whole month with the following fields:
(df1):- date,hour,mid,rid,percentage,total
I need to create 2nd dataframe using this dataframe with the following fields:
(df2) :- date, hour,mid,rid,hour_total
Here hour_total is to be calculated as below:
If for a combination of (date,mid,rid) from dataframe 1, count of records where df1.percentage is 0 is 24, then hour_total = df1.total/24 else hour_total = (df1.percentage /100) * total
For example if dataframe 1 is as below:- (count of records for group of date mid,rid where perc is 0 is 24)
date,hour,mid,rid,perc,total
2019-10-31,0,2, 0,0,3170.87
2019-10-31,1,2,0,0,3170.87
2019-10-31,2,2,0,0,3170.87
2019-10-31,3,2,0,0,3170.87
2019-10-31,4,2,0,0,3170.87
.
.
2019-10-31,23,2,0,0,3170.87
Then dataframe 2 should be: (hour_total = df1.total/24)
date,hour,mid,rid,hour_total
2019-10-31,0,2,0,132.12
2019-10-31,1,4,0,132.12
2019-10-31,2,13,0,132.12
2019-10-31,3,17,0,132.12
2019-10-31,4,7,0,132.12
.
.
2019-10-31,23,27,0,132.12
How can I accomplish this?
You can try the apply function
For example
a = np.random.randint(100,200, size=5)
b = np.random.randint(100,200, size=5)
c = [datetime.now() for x in range(100) if x%20 == 0]
df1 = pd.DataFrame({'Time' : c, "A" : a, "B" : b})
Above data frame looks like this
Time A B
0 2019-10-24 20:37:38.907058 158 190
1 2019-10-24 20:37:38.907058 161 127
2 2019-10-24 20:37:38.908056 100 100
3 2019-10-24 20:37:38.908056 163 164
4 2019-10-24 20:37:38.908056 121 159
Now if we want to compute a new column whose value depends on the other values of column.
You can define a function which does this computation.
def func(x):
t = x[0] # time
a = x[1] # A
b = x[2] # B
return a+b
And apply this function to the data frame
df1["new_col"] = df1.apply(func, axis=1)
Which would yield the following result.
Time A B new_col
0 2019-10-24 20:37:38.907058 158 190 348
1 2019-10-24 20:37:38.907058 161 127 288
2 2019-10-24 20:37:38.908056 100 100 200
3 2019-10-24 20:37:38.908056 163 164 327
4 2019-10-24 20:37:38.908056 121 159 280

Entering values at each index of dataframe

I have a pandas dataframe which I am storing information about different objects in a video.
For each frame of the video I'm saving the positions of the objects in a dataframe with columns 'x', 'y' 'particle' with the frame number in the index:
x y particle
frame
0 588 840 0
0 260 598 1
0 297 1245 2
0 303 409 3
0 307 517 4
This works fine but I want to save information about each frame of the video, e.g. the temperature at each frame.
I'm currently doing this by creating a series with the values for each frame and the index containing the frame number then adding the series to the dataframe.
prop = pd.Series(temperature_values,
index=pd.Index(np.arange(len(temperature_values)), name='frame')
df['temperature'] = prop
This works but produces duplicates of the data in every row of the column:
x y particle temperature
frame
0 588 840 0 12
0 260 598 1 12
0 297 1245 2 12
0 303 409 3 12
0 307 517 4 12
Is there anyway of saving this information without duplicates in the current dataframe so that when I try and get the temperature column I just receive the original series that I created?
If there isn't anyway of doing this my plan is to either deal with the duplicates using drop_duplicates or create a second dataframe with just the data for each frame which I can then merge into my first dataframe but I'd like to avoid doing this if possible.
Here is the current code with jupyter outputs formatted as best as I can:
import pandas as pd
import numpy as np
df = pd.DataFrame()
frames = list(range(5))
for f in frames:
x = np.random.randint(10, 100, size=10)
y = np.random.randint(10, 100, size=10)
particle = np.arange(10)
data = {
'x': x,
'y': y,
'particle': particle,
'frame': f}
df_to_append = pd.DataFrame(data)
df = df.append(df_to_append)
print(df.head())
Output:
x y particle frame
0 61 97 0 0
1 49 73 1 0
2 48 72 2 0
3 59 37 3 0
4 39 64 4 0
Input
df = df.set_index('frame')
print(df.head())
Output
x y particle
frame
0 61 97 0
0 49 73 1
0 48 72 2
0 59 37 3
0 39 64 4
Input:
example_data = [10*f for f in frames]
# Current method
prop = pd.Series(example_data, index=pd.Index(np.arange(len(example_data)), name='frame'))
df['data1'] = prop
print(df.head())
print(df.tail())
Output:
x y particle data1
frame
0 61 97 0 0
0 49 73 1 0
0 48 72 2 0
0 59 37 3 0
0 39 64 4 0
x y particle data1
frame
4 25 93 5 40
4 28 17 6 40
4 39 15 7 40
4 28 47 8 40
4 12 56 9 40
Input:
# Proposed method
df['data2'] = example_data
Output:
ValueError Traceback (most recent call last)
<ipython-input-12-e41b12bbe1cd> in <module>
1 # Proposed method
----> 2 df['data2'] = example_data
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
3368 else:
3369 # set column
-> 3370 self._set_item(key, value)
3371
3372 def _setitem_slice(self, key, value):
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
3443
3444 self._ensure_valid_index(value)
-> 3445 value = self._sanitize_column(key, value)
3446 NDFrame._set_item(self, key, value)
3447
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
3628
3629 # turn me into an ndarray
-> 3630 value = sanitize_index(value, self.index, copy=False)
3631 if not isinstance(value, (np.ndarray, Index)):
3632 if isinstance(value, list) and len(value) > 0:
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/internals/construction.py in sanitize_index(data, index, copy)
517
518 if len(data) != len(index):
--> 519 raise ValueError('Length of values does not match length of index')
520
521 if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index
I am afraid you cannot. All columns in a DataFrame share the same index and are required to have same length. But coming from the database world, I try to avoid as much as possible indexes with duplicate values.

How to split several column's data in pandas?

I have a dataframe which looks like this:
df = pd.DataFrame({'hard': [['525', '21']], 'soft': [['1525', '221']], 'set': [['5245', '271']], 'purch': [['925', '201']], \
'mont': [['555', '621']], 'gest': [['536', '251']], 'memo': [['825', '241']], 'raw': [['532', '210']]})
df
Out:
gest hard memo mont purch raw set soft
0 [536, 251] [525, 21] [825, 241] [555, 621] [925, 201] [532, 210] [5245, 271] [1525, 221]
I should split all of the columns like this:
df1 = pd.DataFrame()
df1['gest_pos'] = df.gest.str[0].astype(int)
df1['gest_size'] = df.gest.str[1].astype(int)
df1['hard_pos'] = df.hard.str[0].astype(int)
df1['hard_size'] = df.hard.str[1].astype(int)
df1
gest_pos gest_size hard_pos hard_size
0 536 251 525 21
I have more than 70 columns and my method takes lot of place and time. Is there an easier way to do this job?
Thanks!
Different approach:
df2 = pd.DataFrame()
for column in df:
df2['{}_pos'.format(column)] = df[column].str[0].astype(int)
df2['{}_size'.format(column)] = df[column].str[1].astype(int)
print(df2)
You can use nested list comprehension with flattening and then create new DataFrame by constructor:
L = [[y for x in z for y in x] for z in df.values.tolist()]
#if want filter first 2 values per each list
#L = [[y for x in z for y in x[:2]] for z in df.values.tolist()]
#https://stackoverflow.com/a/45122198/2901002
def mygen(lst):
for item in lst:
yield item + '_pos'
yield item + '_size'
df = pd.DataFrame(L, columns = list(mygen(df.columns))).astype(int)
print (df)
hard_pos hard_size soft_pos soft_size set_pos set_size purch_pos purch_size \
0 525 21 1525 221 5245 271 925 201
mont_pos mont_size gest_pos gest_size memo_pos memo_size raw_pos raw_size
0 555 621 536 251 825 241 532 210
You can use NumPy operations to construct your list of columns and flatten out your series of lists:
import numpy as np
from itertools import chain
# create column label array
cols = np.repeat(df.columns, 2).values
cols[::2] += '_pos'
cols[1::2] += '_size'
# create data array
arr = np.array([list(chain.from_iterable(i)) for i in df.values]).astype(int)
# combine with pd.DataFrame constructor
res = pd.DataFrame(arr, columns=cols)
Result:
print(res)
gest_pos gest_size hard_pos hard_size memo_pos memo_size mont_pos \
0 536 251 525 21 825 241 555
mont_size purch_pos purch_size raw_pos raw_size set_pos set_size \
0 621 925 201 532 210 5245 271
soft_pos soft_size
0 1525 221

Dataframe/Row Indexing for Pandas

I was wondering how could I index datasets so that a row number from df1 can equal a different row number for df2? eg. row 1 in df 1 = row 3 in df2
What I would like. (In this case: row 1 2011 = row 2 2016)
row 49:50 2011 b1 is the same as row 51:52 bt 2016 (both the same item, but different value in different years) but is sliced differently due to being in a different cell in 2016
I've been using pd.concat and pd.Series but still no success.
# slicing 2011 data (total)
b1 = df1.iloc[49:50, 6:7]
m1 = df1.iloc[127:128, 6:7]
a1 = df1.iloc[84:85, 6:7]
data2011 = pd.concat([b1, m1, a1])
# slicing 2016 data (total)
bt = df2.iloc[51:52, 6:7]
mt = df2.iloc[129:130, 6:7]
at = df2.iloc[86:87, 6:7]
data2016 = pd.concat([bt, mt, at])
data20112016 = pd.concat([data2011, data2016])
print(data20112016)
Output I'm getting:
What I need to fix. (In this case : row 49 = row 51, so 11849 in the left column and 13500 in the right coloumn)
49 11849
127 22622
84 13658
51 13500
129 25281
86 18594
I would like to do a bar graph comparing b12011 to bt2016 and so on. meaning 42 = 51, 127 = 129 etc
# Tot_x Tot_y
# 49=51 11849 13500
# 127=129 22622 25281
# 84=86 13658 18594
I hope this clear things up.
Thanks in advance.
If I understood your question correctly, here is solution using merge:
df1 = pd.DataFrame([9337, 2953, 8184], index=[49, 127, 84], columns=['Tot'])
df2 = pd.DataFrame([13500, 25281, 18594], index=[51, 129, 86], columns=['Tot'])
total_df = (df1.reset_index()
.merge(df2.reset_index(), left_index=True, right_index=True))
And here is, using concat:
total_df = pd.concat([df1.reset_index(), df2.reset_index()], axis=1)
And here is resulting barplot:
total_df.index = total_df['index_x'].astype(str) + '=' + total_df['index_y'].astype(str)
total_df
# index_x Tot_x index_y Tot_y
# 49=51 49 9337 51 13500
# 127=129 127 2953 129 25281
# 84=86 84 8184 86 18594
(total_df.drop(['index_x', 'index_y'], axis=1)
.plot(kind='bar', rot=0))

How to merge pandas dataset onto itself based on a condition and a groupby

To best illustrate consider the following SQL Illustration:
Table StockPrices, BarSeqId is a sequential number where each increment is information from next minute of trading.
The goal to achieve in pandas data frame is to transform this data:
StockPrice BarSeqId LongProfitTarget
105 0 109
100 1 105
103 2 107
103 3 108
104 4 110
105 5 113
into this data:
StockPrice BarSeqId LongProfitTarget TargetHitBarSeqId
106 0 109 Nan
100 1 105 3
103 2 107 5
105 3 108 Nan
104 4 110 Nan
107 5 113 Nan
to create a new column which describes at which soonest sequential time-frame a price target will be hit in the future from the current time-frame
Here is how it could be achieved in SQL:
SELECT S1.StockPrice, S1.BarSeqId, S1.LongProfitTarget,
min(S2.BarSeqId) as TargetHitBarSeqId
FROM StockPrices S1
left outer join StockPrices S2 on S1.BarSeqId<s2.BarSeqId and
S2.StockPrice>=S1.LongProfitTarget
GROUP BY S1.StockPrice, S1.BarSeqId, S1.LongProfitTarget
I would like the answer to be as follows:
someDataFrame['TargetHitBarSeqId'] = (pandas expression here ...**
assume that someDataFrame already has columns: StockPrice, BarSeqId, LongProfitTarget
data edited to illustrate case
so in the second row result should be
100 1 105 3
and NOT
100 1 105 0
since 3 and not 0 occurs after 1.
It is important that the barseq in question shall occur in the future (greater than current BarSeq)
df = pd.DataFrame({'StockPrice':[105,100,103,105,104,107],'BarSeqId':[0,1,2,3,4,5],
'LongProfitTarget':[109,105,107,108,110,113]})
def get_barseqid(longProfitTarget):
try:
idx = df.StockPrice[df.StockPrice >= longProfitTarget].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget']), axis=1)
Here's one solution:
import pandas as pd
import numpy as np
df = <your input data frame>
def get_barseqid(longProfitTarget):
try:
idx = df.StockPrice[df.StockPrice >= longProfitTarget].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget']), axis=1)
Output:
StockPrice BarSeqId LongProfitTarget TargetHitBarSeqId
0 100 1 105 3.0
1 103 2 107 5.0
2 105 3 108 NaN
3 104 4 110 NaN
4 107 5 113 NaN
from pathlib import Path
import pandas as pd
from itertools import islice
import numpy as np
df = pd.DataFrame({'StockPrice':[105,100,103,105,104,107],'BarSeqId':[0,1,2,3,4,5],
'LongProfitTarget':[109,105,107,108,110,113]})
def get_barseqid(longProfitTarget,barseq):
try:
idx = df[(df.StockPrice >= longProfitTarget) & (df.BarSeqId>barseq)].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget'], row['BarSeqId']), axis=1)
df
The key misunderstanding for me was a need to use & operator instead of regular 'or'
Assuming data is manageable, consider a cross join followed by filter and groupby, which would replicate the SQL query:
cdf = pd.merge(df.assign(key=1), df.assign(key=1), on='key', suffixes=['','_'])\
.query('(BarSeqId < BarSeqId_) & (LongProfitTarget <= StockPrice_)')\
.groupby(['StockPrice', 'BarSeqId', 'LongProfitTarget'])['BarSeqId_'].min()
print(cdf)
# StockPrice BarSeqId LongProfitTarget
# 100 1 105 3
# 103 2 107 5
# Name: BarSeqId_, dtype: int64

Categories