Advanced Pivot Table in Pandas

Advanced Pivot Table in Pandas - python

I am trying to optimize some table transformation scripts in Python Pandas, which I am trying to feed with huge data sets (above 50k rows). I wrote a script that iterates through every index and parses values into a new data frame (see example below), but I am experiencing performance issues. Is there any pandas function, that could get the same results without iterating?
Example code:
from datetime import datetime
import pandas as pd
date1 = datetime(2019,1,1)
date2 = datetime(2019,1,2)
df = pd.DataFrame({"ID": [1,1,2,2,3,3],
"date": [date1,date2,date1,date2,date1,date2],
"x": [1,2,3,4,5,6],
"y": ["a","a","b","b","c","c"]})
new_df = pd.DataFrame()
for i in df.index:
new_df.at[df.at[i, "ID"], "y"] = df.at[i, "y"]
if df.at[i, "date"] == datetime(2019,1,1):
new_df.at[df.at[i, "ID"], "x1"] = df.at[i, "x"]
elif df.at[i, "date"] == datetime(2019,1,2):
new_df.at[df.at[i, "ID"], "x2"] = df.at[i, "x"]
output:
ID date x y
0 1 2019-01-01 1 a
1 1 2019-01-02 2 a
2 2 2019-01-01 3 b
3 2 2019-01-02 4 b
4 3 2019-01-01 5 c
5 3 2019-01-02 6 c
y x1 x2
1 a 1.0 2.0
2 b 3.0 4.0
3 c 5.0 6.0
The transformation basically groups the rows by the "ID" column and gets the "x1" values from the rows with date 2019-01-01, and the "x2" values from the rows with date 2019-01-02. The "y" value is the same within the same "ID". "ID" columns become the new indexes.
I'd appreciate any advice on this matter.

Using pivot_tables will get what you are looking for:
result = df.pivot_table(index=['ID', 'y'], columns='date', values='x')
result.rename(columns={date1: 'x1', date2: 'x2'}).reset_index('y')

Related

Resampling timeseries dataframe with multi-index

Generate data:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{FREQ}T',start='2020-10-01',periods=(12)*24))
df['col1'] = np.random.normal(size = df.shape[0])
df['col2'] = np.random.random_integers(1, 100, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{FREQ}T',start='2020-10-01',periods=(12)*24))
df2['col1'] = np.random.normal(size = df2.shape[0])
df2['col2'] = np.random.random_integers(1, 50, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3=df3.set_index(['index','uid'])
I am trying to resample the data to 30min intervals and assign how to aggregate the data for each uid and each column individually. I have many columns and I need to assign whether if I want the mean, median, std, max, min, for each column. Since there are duplicate timestamps I need to do this operation for each user, that's why I try to set the multiindex and do the following:
df3.groupby(pd.Grouper(freq='30Min',closed='right',label='right')).agg({
"col1": "max", "col2": "min", 'uid':'max'})
but I get the following error
ValueError: MultiIndex has no single backing array. Use
'MultiIndex.to_numpy()' to get a NumPy array of tuples.
How can I do this operation?

You have to specify the level name when you use pd.Grouper on index:
out = (df3.groupby([pd.Grouper(level='index', freq='30T', closed='right', label='right'), 'uid'])
.agg({"col1": "max", "col2": "min"}))
print(out)
# Output
col1 col2
index uid
2020-10-01 00:00:00 1 -0.222489 77
2 -1.490019 22
2020-10-01 00:30:00 1 1.556801 16
2 0.580076 1
2020-10-01 01:00:00 1 0.745477 12
... ... ...
2020-10-02 23:00:00 2 0.272276 13
2020-10-02 23:30:00 1 0.378779 20
2 0.786048 5
2020-10-03 00:00:00 1 1.716791 20
2 1.438454 5
[194 rows x 2 columns]

How to deduct the values from pandas (same column)?

I am trying to manipulate excel sheet data to automate a process on excel(not a developer) in order to delete the value of the last row from the first then the value of the last -1 from the second and so on, my data is similar to the below
Code Col1 Col2
0 A 1.7653 56.2
1 B 1 Nan
2 C Nan 5
3 D 34.4 0
and i have to deduct the last last column from the first, then then last -1 from the second and so on until i meet them in the middle(assuming that we'll only be having even numbers of rows), i already solved the issue of getting rid of columns having strings so my output pandas looks like this
Col1 Col2
0 1.7653 56.2
1 1 Nan
2 Nan 5
3 34.4 0
now i need to deduct the values so the new panda frame to be created will look like this:
the values below are found after the deductions
Col1 Col2
0 -32.2347 56.2
1 1 -5
I was able to delete it per 1 value but not iteratively no matter how many rows i have and create a pandas half the rows of the first with the same columns as output
Nan will be treated as 0 and the actual dataset has hundreds of columns and rows that can change
code:
import pandas as pd
import datetime
# Create a dataframe
df = pd.read_excel(r'file.xls', sheet_name='sheet1')
for col, dt in df.dtypes.items():
if dt == object:
df = df.drop(col, 1)
i=0
for col in df.dtypes.items():
while i < len(df)/2:
df[i] = df[i] - df[len(df) - i]
i++

An approach could be the following:
import pandas as pd
import numpy as np
df = pd.DataFrame([["A", 1.7653, 56.2], ["B", 1, np.nan], ["C", np.nan, 5], ["D", 34.4, 0]], columns=["Code", "Col1", "Col2"], )
del df["Code"]
df.fillna(0, inplace=True)
s = df.shape[0] // 2
differences = pd.DataFrame([df.iloc[i] - df.iloc[df.shape[0]-i-1] for i in range(s)])
print(differences)
OUTPUT
Col1 Col2
0 -32.6347 56.2
1 1.0000 -5.0
FOLLOW UP
Reading the comments, I understand that the subtraction logic you want to apply is the following:
Normal subtraction if both numbers are not nan
If one of the numbers is nan, then swap the nan with 0
If both numbers are nan, a `nan is returned
I don't know if there is a function which works like that out of the box, hence I have created a custom_sub.
In avoidance of doubt, this is the file I am using
grid.txt
,Code,Col1,Col2
0,A,1.7653,56.2
1,B,1,
2,C,,5
3,D,34.4,0
The code:
import pandas as pd
import numpy as np
df = pd.read_csv("grid.txt", sep=",",index_col=[0])
del df["Code"]
def custom_sub(x1, x2):
if np.isnan(x1) or np.isnan(x2):
if np.isnan(x1) and np.isnan(x2):
return np.nan
else:
return -x2 if np.isnan(x1) else x1
else:
return x1 - x2
s = df.shape[0] // 2
differences = pd.DataFrame([df.iloc[i].combine(df.iloc[df.shape[0]-i-1], custom_sub) for i in range(s)])
print(differences)

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!

Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)

df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4

How to combine two pandas dataframes value by value

I have 2 dataframes - players (only has playerid) and dates (only has date). I want new dataframe which will contain for each player each date. In my case, players df contains about 2600 rows and date df has 1100 rows. I used 2 for loops to do this, but it is really slow, is there a way to do it faster via some function? thx
my loop:
player_elo = pd.DataFrame(columns = ['PlayerID','Date'])
for row in players.itertuples():
idx = row.Index
pl = players.at[idx,'PlayerID']
for i in dates.itertuples():
idd = row.Index
dt = dates.at[idd, 0]
new = {'PlayerID': [pl], 'Date': [dt]}
new = pd.DataFrame(new)
player_elo = player_elo.append(new)

If you have a key that is repeated for each df, you can come up with the cartesian product you are looking for using pd.merge().
import pandas as pd
players = pd.DataFrame([['A'], ['B'], ['C']], columns=['PlayerID'])
dates = pd.DataFrame([['12/12/2012'],['12/13/2012'],['12/14/2012']], columns=['Date'])
dates['Date'] = pd.to_datetime(dates['Date'])
players['key'] = 1
dates['key'] = 1
print(pd.merge(players, dates,on='key')[['PlayerID', 'Date']])
Output
PlayerID Date
0 A 2012-12-12
1 A 2012-12-13
2 A 2012-12-14
3 B 2012-12-12
4 B 2012-12-13
5 B 2012-12-14
6 C 2012-12-12
7 C 2012-12-13
8 C 2012-12-14

pandas dataframe drop columns by number of nan

I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan
print dff

There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.

You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026

Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)

I recommend the drop-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)

Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)

You can do this through another approach as well like below for dropping columns having certain number of na values:
df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])
For dropping columns having certain percentage of na values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Advanced Pivot Table in Pandas - python

Using pivot_tables will get what you are looking for: result = df.pivot_table(index=['ID', 'y'], columns='date', values='x') result.rename(columns={date1: 'x1', date2: 'x2'}).reset_index('y')

Related

Resampling timeseries dataframe with multi-index

How to deduct the values from pandas (same column)?

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

How to combine two pandas dataframes value by value

pandas dataframe drop columns by number of nan

Categories

Resources