delete consecutive rows conditionally pandas - python

I have a df with columns (A, B, C, D, F). I want to:
1) Compare consecutive rows
2) if the absolute difference between consecutive E <=1 AND absolute difference between consecutive C>7, then delete the row with the lowest C value.
Sample Data:
A B C D E
0 94.5 4.3 26.0 79.0 NaN
1 34.0 8.8 23.0 58.0 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.1
4 98.0 8.2 13.0 193.7 5.5
5 20.5 9.6 17.0 157.3 5.3
6 32.9 5.4 24.5 45.9 79.8
Desired result:
A B C D E
0 94.5 4.3 26.0 79.0 NaN
1 34.0 8.8 23.0 58.0 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.01
4 32.9 5.4 24.5 45.9 79.8
Row 4 was deleted when compared with row 3. Row 5 is now row 4 and it was deleted when compared to row 3.
This code returns the results as boolean (not df with values) and does not satisfy all the conditions.
df = (abs(df.E.diff(-1)) <=1 & (abs(df.C.diff(-1)) >7.)
The result of the code:
0 False
1 False
2 False
3 True
4 False
5 False
6 False
dtype: bool
Any help appreciated.

Using shift() to compare the rows, and a while loop to iterate until no new change happens:
while(True):
rows = len(df)
df = df[~((abs(df.E - df.E.shift(1)) <= 1)&(abs(df.C - df.C.shift(1)) > 7))]
df.reset_index(inplace = True, drop = True)
if (rows == len(df)):
break
It produces the desired output:
A B C D E
0 94.5 4.3 26.0 79.00 NaN
1 34.0 8.8 23.0 58.00 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.1
4 32.9 5.4 24.5 45.90 79.8

Related

correlation matrix with group-by and sort

I am trying calculate correlation matrix with groupby and sort. I have 100 companies from 11 industries. I would like to group by industry and sort by their total assets (atq), and then calculate the correlation of data.pr_multi with this order. however, when I do sort and groupby, it reverses back and calculates by alphabetical order.
The code I use:
index
datafqtr
tic
pr_multi
atq
industry
0
2018Q1
A
NaN
8698.0
4
1
2018Q2
A
-0.0856845728151735
8784.0
4
2
2018Q3
A
0.0035103320774146
8349.0
4
3
2018Q4
A
-0.0157732687260246
8541.0
4
4
2018Q1
AAL
NaN
53280.0
5
5
2018Q2
AAL
-0.2694380292532717
52622.0
5
the code I use:
data1=data18.sort_values(['atq'],ascending=False).groupby('industry').head()
df = data1.pivot_table('pr_multi', ['datafqtr'], 'tic')
# calculate correlation matrix using inbuilt pandas function
correlation_matrix = df.corr()
correlation_matrix.head()
IIUC, you want to calculate the correlation between the order based on the groupby and the pr_multi column. use:
data1=data18.groupby('industry')['atq'].apply(lambda x: x.sort_values(ascending=False))
np.corrcoef(data1.reset_index()['level_1'], data18['pr_multi'].astype(float).fillna(0))
Output:
array([[ 1. , -0.44754795],
[-0.44754795, 1. ]])
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
df.groupby('name')[['col1','col2']].corr() # you can put as many desired columns here
Out put:
y x
name
a y 1.000000 0.974467
a x 0.974467 1.000000
b y 1.000000 0.975120
b x 0.975120 1.000000
The data is like this:
name col1 col2
0 a 13.7 7.8
1 a -14.7 -9.7
2 a -3.4 -0.6
3 a 7.4 3.3
4 a -5.3 -1.9
5 a -8.3 -2.3
6 a 8.9 3.7
7 a 10.0 7.9
8 a 1.8 -0.4
9 a 6.7 3.1
10 a 17.4 9.9
11 a 8.9 7.7
12 a -3.1 -1.5
13 a -12.2 -7.9
14 a 7.6 4.9
15 a 4.2 2.3
16 a -15.3 -5.6
17 a 9.9 6.7
18 a 11.0 5.2
19 a 5.7 5.1
20 a -0.3 -0.6
21 a -15.0 -8.7
22 a -10.6 -5.7
23 a -16.0 -9.1
24 b 16.7 8.5
25 b 9.2 8.2
26 b 4.7 3.4
27 b -16.7 -8.7
28 b -4.8 -1.5
29 b -2.6 -2.2
30 b 16.3 9.5
31 b 15.8 9.8
32 b -10.8 -7.3
33 b -5.4 -3.4
34 b -6.0 -1.8
35 b 1.9 -0.6
36 b 6.3 6.1
37 b -14.7 -8.0
38 b -16.1 -9.7
39 b -10.5 -8.0
40 b 4.9 1.0
41 b 11.1 4.5
42 b -14.8 -8.5
43 b -0.2 -2.8
44 b 6.3 1.7
45 b -14.1 -8.7
46 b 13.8 8.9
47 b -6.2 -3.0
​

Remove NaN from each column and rearranging it with python pandas/numpy

I got similar issue with my previous question:
Remove zero from each column and rearranging it with python pandas/numpy
But in this case, I need to remove NaN. I have tried many solutions including modifying solutions from my previous post:
a = a[a!=np.nan].reshape(-1,3)
but it gave me weird result.
Here is my initial matrix from Dataframe :
A B C D E F
nan nan nan 0.0 27.7 nan
nan nan nan 5.0 27.5 nan
nan nan nan 10.0 27.4 nan
0.0 29.8 nan nan nan nan
5.0 29.9 nan nan nan nan
10.0 30.0 nan nan nan nan
nan nan 0.0 28.6 nan nan
nan nan 5.0 28.6 nan nan
nan nan 10.0 28.5 nan nan
nan nan 15.0 28.4 nan nan
nan nan 20.0 28.3 nan nan
nan nan 25.0 28.2 nan nan
And I expect to have result like this :
A B
0.0 27.7
5.0 27.5
10.0 27.4
0.0 29.8
5.0 29.9
10.0 30.0
0.0 28.6
5.0 28.6
10.0 28.5
15.0 28.4
0.0 28.3
25.0 28.2
Solution:
Given the input dataframe a:
a.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')
This will give you the desired output.
Example:
import numpy as np
import pandas as pd
a = pd.DataFrame({ 'A':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'B':[np.nan,np.nan,np.nan,np.nan,np.nan,4],
'C':[7,np.nan,9,np.nan,2,np.nan],
'D':[1,3,np.nan,7,np.nan,np.nan],
'E':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})
print (a)
A B C D E
0 NaN NaN 7.0 1.0 NaN
1 NaN NaN NaN 3.0 NaN
2 NaN NaN 9.0 NaN NaN
3 NaN NaN NaN 7.0 NaN
4 NaN NaN 2.0 NaN NaN
5 NaN 4.0 NaN NaN NaN
a_new = a.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')
print(a_new)
C D
0 7.0 1.0
1 9.0 3.0
2 2.0 7.0
Use np.isnan for test missing values with ~ for invert mask if there are always 2 non missing values per rows:
a = df.to_numpy()
df = pd.DataFrame(a[~np.isnan(a)].reshape(-1,2))
print (df)
0 1
0 0.0 27.7
1 5.0 27.5
2 10.0 27.4
3 0.0 29.8
4 5.0 29.9
5 10.0 30.0
6 0.0 28.6
7 5.0 28.6
8 10.0 28.5
9 15.0 28.4
10 20.0 28.3
11 25.0 28.2
Another idea is use justify fucntion with remove only NaNs columns:
df1 = (pd.DataFrame(justify(a, invalid_val=np.nan),
columns=df.columns).dropna(how='all', axis=1))
print (df1)
A B
0 0.0 27.7
1 5.0 27.5
2 10.0 27.4
3 0.0 29.8
4 5.0 29.9
5 10.0 30.0
6 0.0 28.6
7 5.0 28.6
8 10.0 28.5
9 15.0 28.4
10 20.0 28.3
11 25.0 28.2
EDIT:
df = pd.concat([df] * 1000, ignore_index=True)
a = df.to_numpy()
print (a.shape)
(12000, 6)
In [168]: %timeit df.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')
8.06 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [172]: %%timeit
...: a = df.to_numpy()
...: pd.DataFrame(a[~np.isnan(a)].reshape(-1,2))
...:
...:
...:
422 µs ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [173]: %timeit pd.DataFrame(justify(a, invalid_val=np.nan),columns=df.columns).dropna(how='all', axis=1)
2.88 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

exporting to table not aligned

I'm trying to scrape a table from this link:
https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc
when scraping the table, the names and stats categories align but the numbers themselves don't.
import csv
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(
requests.get("https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc", timeout=30).text,
'lxml')
def scrape_data(url):
# the categories of stats (first row)
ct = soup.find_all('tr', class_="Table__TR Table__even")
# player's stats table (the names and numbers)
st = soup.find_all('tr', class_="Table__TR Table__TR--sm Table__even")
header = [th.text.rstrip() for th in ct[1].find_all('th')]
with open('s espn.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(header)
for row in st[1:]:
data = [th.text.rstrip() for th in row.find_all('td')]
writer.writerow(data)
scrape_data(soup)
https://imgur.com/UFHC8wf
That's because those are under 2 separate table tags. A table for the names, then the table for stats. You'll need to merge them.
Use pandas. A lot easier to parse tables (it actually uses beautifulsoup under the hood). pd.read_html() will return a list of dataframes. I just merged them together. Then you can also manipulate the tables any way you need.
import pandas as pd
url = 'https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
dfs = pd.read_html(url)
df = dfs[0].join(dfs[1])
df[['Name','Team']] = df['Name'].str.extract('^(.*?)([A-Z]+)$', expand=True)
df.to_csv('s espn.csv', index=False)
Output:
print (df.head(10).to_string())
RK Name Team POS GP MIN PTS FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% REB AST STL BLK TO DD2 TD3 PER
0 1 Trae Young ATL PG 2 36.5 38.5 13.5 23.0 58.7 5.5 10.0 55.0 6.0 8.0 75.0 7.0 9.0 1.5 0.0 5.5 0 0 40.95
1 2 Kyrie Irving BKN PG 3 34.7 37.7 12.0 26.3 45.6 4.7 11.3 41.2 9.0 9.7 93.1 5.7 6.3 1.7 0.7 2.0 0 0 37.84
2 3 Karl-Anthony Towns MIN C 3 33.7 32.0 10.7 20.3 52.5 5.0 9.7 51.7 5.7 9.0 63.0 13.3 5.0 3.0 2.0 2.3 3 0 40.60
3 4 Damian Lillard POR PG 3 36.7 31.7 10.7 20.3 52.5 2.3 8.0 29.2 8.0 9.0 88.9 4.0 6.0 1.7 0.3 2.7 0 0 32.86
4 5 Giannis Antetokounmpo MIL PF 2 32.5 29.5 11.5 19.0 60.5 1.0 5.0 20.0 5.5 10.0 55.0 15.0 10.0 2.0 1.5 5.5 2 1 34.55
5 6 Luka Doncic DAL SF 3 36.3 29.3 10.0 20.0 50.0 3.0 9.7 31.0 6.3 8.0 79.2 10.3 7.3 2.3 0.0 4.3 2 1 29.45
6 7 Pascal Siakam TOR PF 3 34.0 28.7 9.7 21.0 46.0 2.7 5.7 47.1 6.7 7.0 95.2 10.7 3.7 0.3 0.3 4.3 1 0 24.56
7 8 Brandon Ingra mNO SF 3 35.0 27.3 10.7 20.3 52.5 3.3 6.3 52.6 2.7 3.3 80.0 9.3 4.3 0.3 1.7 2.7 1 0 25.72
8 9 Kristaps Porzingis DAL PF 3 31.0 26.3 8.7 18.7 46.4 3.0 7.3 40.9 6.0 8.3 72.0 5.7 3.3 0.3 2.7 2.3 0 0 27.15
9 10 Russell Westbrook HOU PG 2 33.0 26.0 8.0 17.0 47.1 2.0 5.0 40.0 8.0 10.5 76.2 13.0 10.0 1.5 0.5 3.5 2 1 30.96
...
Explained line by line:
dfs = pd.read_html(url)
This will return a list of dataframes. Essentially it parses every <table> tag within the html. When you do this with the given URL, you will see it returns 2 dataframes:
The dataframe in index position 0, has the ranking and player name. If I look at that, I notice the text is the player name followed by the team abbreviation (the last 3 characters in the string).
So, dfs[0]['Team'] = dfs[0]['Name'].str[-3:] is going to create a column called 'Team', where it takes the Name column strings and takes the last 3 characters. dfs[0]['Name'] = dfs[0]['Name'].str[:-3] will store the string from the Name column up until the last 3 charachters. Essentially splitting the Name column into 2 columns.
df = dfs[0].merge(dfs[1], left_index=True, right_index=True)
The last part then takes those 2 dataframes in index position 0 and 1 (stored in the dfs list) and merges them together and merges on the index values.

Function on dataframe rows to reduce duplicate pairs Python

I've got a dataframe that looks like:
0 1 2 3 4 5 6 7 8 9 10 11
12 13 13 13.4 13.4 12.4 12.4 16 0 0 0 0
14 12.2 12.2 13.4 13.4 12.6 12.6 19 5 5 6.7 6.7
.
.
.
Each 'layer'/row has pairs that are duplicates that I want to reduce.
The one problem is that there are repeating 0s as well so I cannot just simply remove duplicates per row or it will leave an uneven number of rows.
My desired output would be a lambda function that I could apply to all rows of this dataframe to get this:
0 1 2 3 4 5 6
12 13 13.4 12.4 16 0 0
14 12.2 13.4 12.6 19 5 6.7
.
.
.
Is there a simple function I could write to do this?
Method 1 using transpose
As mentioned by Yuca in the comments:
df = df.T.drop_duplicates().T
df.columns = range(len(df.columns))
print(df)
0 1 2 3 4 5 6
0 12.0 13.0 13.4 12.4 16.0 0.0 0.0
1 14.0 12.2 13.4 12.6 19.0 5.0 6.7
Method 2 using list comprehension with even numbers
We can make a list of even numbers and then select those columns based on their index:
idxcols = [x-1 for x in range(len(df.columns)) if x % 2]
df = df.iloc[:, idxcols]
df.columns = range(len(df.columns))
print(df)
0 1 2 3 4 5
0 12 13.0 13.4 12.4 0 0.0
1 14 12.2 13.4 12.6 5 6.7
In your case
from itertools import zip_longest
l=[sorted(set(x), key=x.index) for x in df.values.tolist()]
newdf=pd.DataFrame(l).ffill(1)
newdf
Out[177]:
0 1 2 3 4 5 6
0 12.0 13.0 13.4 12.4 16.0 0.0 0.0
1 14.0 12.2 13.4 12.6 19.0 5.0 6.7
You can use functools.reduce to sequentially concatenate columns to your output DataFrame if the next column is not equal to the last column added:
from functools import reduce
output_df = reduce(
lambda d, c: d if (d.iloc[:,-1] == df[c]).all() else pd.concat([d, df[c]], axis=1),
df.columns[1:],
df[df.columns[0]].to_frame()
)
print(output_frame)
# 0 1 3 5 7 8 10
#0 12 13.0 13.4 12.4 16 0 0.0
#1 14 12.2 13.4 12.6 19 5 6.7
This method also maintains the column names of the columns which were picked, if that's important.
Assuming this is your input df:
print(df)
# 0 1 2 3 4 5 6 7 8 9 10 11
#0 12 13.0 13.0 13.4 13.4 12.4 12.4 16 0 0 0.0 0.0
#1 14 12.2 12.2 13.4 13.4 12.6 12.6 19 5 5 6.7 6.7

Merging pandas dataframes: empty columns created in left

I have several datasets, which I am trying to merge into one. Below, I created fictive simpler smaller datasets to test the method and it worked perfectly fine.
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30,40,50,60],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20,30,40,50,60],'T4':[12.0,12.2,12.4,13.2,14.1]})
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print result
The result is, as expected:
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN 12.0
2 30 11.5 11.8 28.8 12.1 12.2
3 40 12.0 12.2 37.7 12.3 12.4
4 50 12.3 12.4 46.6 13.5 13.2
5 60 12.6 12.7 55.5 14.2 14.1
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Happy with the result, I applied this method to my actual data, but for T3 and T4 in the resulting dataframes, I received just empty columns (all values were NaN). I suspect that the problem is with floating numbers, because my datasets were created on different machines by different software and although the "Depth" has the precision of two decimal numbers in all of the files, I am afraid that it may not be 20.05 in both of them, but one might be 20.049999999999999 while in the other it might be 20.05000000000001. Then, the merge function will not work, as shown in the following example:
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30.05,40.05,50.05,60.05],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20.01,30.01,40.01,50.01,60.01],'T4':[12.0,12.2,12.4,13.2,14.1]})
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print result
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN NaN
2 30 11.5 11.8 28.8 NaN NaN
3 40 12.0 12.2 37.7 NaN NaN
4 50 12.3 12.4 46.6 NaN NaN
5 60 12.6 12.7 55.5 NaN NaN
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Do you know how to fix this?
Thanks!
Round the Depth values to the appropriate precision:
for df in [examplelog, log1, log2]:
df['Depth'] = df['Depth'].round(1)
import numpy as np
import pandas as pd
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30.05,40.05,50.05,60.05],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20.01,30.01,40.01,50.01,60.01],
'T4':[12.0,12.2,12.4,13.2,14.1]})
for df in [examplelog, log1, log2]:
df['Depth'] = df['Depth'].round(1)
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print(result)
yields
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN 12.0
2 30 11.5 11.8 28.8 12.1 12.2
3 40 12.0 12.2 37.7 12.6 12.4
4 50 12.3 12.4 46.6 13.7 13.2
5 60 12.6 12.7 55.5 14.0 14.1
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Per the comments, rounding does not appear to work for the OP on the actual
data. To debug the problem, find some rows which should merge:
subframes = []
for frame in [examplelog, log2]:
mask = (frame['Depth'] < 20.051) & (frame['Depth'] >= 20.0)
subframes.append(frame.loc[mask])
Then post
for frame in subframes:
print(frame.to_dict('list'))
print(frame.info()) # shows the dtypes of the columns
This might give us the info we need to reproduce the problem.

Categories