Joining pandas dataframes by column headers

Joining pandas dataframes by column headers - python

I have two data frames (Actual and Targets) with followed headers:
print (df1)
WorkWeek Area Actual
0 202001 South 5
1 202001 North 5
2 202001 West 6
3 202001 East 8
4 202002 South 7
5 202002 North 9
6 202002 West 6
7 202002 East 3
8 202003 South 5
9 202003 North 85
10 202003 West 5
11 202003 East 11
12 202004 South 2
13 202004 North 2
14 202004 West 2
15 202004 East 2
print (df2)
WorkWeek South North West East
0 202001 60 90 70 80
1 202002 60 90 70 80
2 202003 60 90 70 80
3 202004 60 90 70 80
I want to have joined df(Actual_vs_Targets) by WW and Area
In case if i want to add more areas how should i act?
Thank you!

Use DataFrame.melt with DataFrame.merge:
df22 = df2.melt('WorkWeek', var_name='Area', value_name='Target')
df = df1.merge(df22, on=['WorkWeek','Area'], how='left')
Or DataFrame.sem with DataFrame.join:
df22 = df2.set_index('WorkWeek').stack().rename_axis(['WorkWeek','Area']).rename('Target')
df = df1.join(df22, on=['WorkWeek','Area'])
print (df)
WorkWeek Area Actual Target
0 202001 South 5 60
1 202001 North 5 90
2 202001 West 6 70
3 202001 East 8 80
4 202002 South 7 60
5 202002 North 9 90
6 202002 West 6 70
7 202002 East 3 80
8 202003 South 5 60
9 202003 North 85 90
10 202003 West 5 70
11 202003 East 11 80
12 202004 South 2 60
13 202004 North 2 90
14 202004 West 2 70
15 202004 East 2 80

Related

How can I turn this into a DataFrame?

I am new to Python, and was trying to run a basic web scraper. My code looks like this
import requests
import pandas as pd
x = requests.get('https://www.baseball-reference.com/players/p/penaje02.shtml')
dfs = pd.read_html(x.content)
print(dfs)
df = pd.DataFrame(dfs)
when printing dfs it looks like this. I only want the second table.
[ Year Age Tm Lg G PA AB \
0 2018 20 HOU-min A- 36 156 136
1 2019 21 HOU-min A,A+ 109 473 409
2 2021 23 HOU-min AAA,Rk 37 160 145
3 2022 24 HOU AL 136 558 521
4 1 Yr 1 Yr 1 Yr 1 Yr 136 558 521
5 162 Game Avg. 162 Game Avg. 162 Game Avg. 162 Game Avg. 162 665 621
R H 2B ... OPS OPS+ TB GDP HBP SH SF IBB Pos \
0 22 34 5 ... 0.649 NaN 42 0 1 0 1 0 NaN
1 72 124 21 ... 0.825 NaN 180 4 11 0 6 0 NaN
2 25 43 5 ... 0.942 NaN 84 0 7 0 0 0 NaN
3 72 132 20 ... 0.715 101.0 222 5 6 1 5 0 *6/H
4 72 132 20 ... 0.715 101.0 222 5 6 1 5 0 NaN
5 86 157 24 ... 0.715 101.0 264 6 7 1 6 0 NaN
Awards
0 TRC · NYPL
1 DAV,FAY · MIDW,CARL
2 SKT,AST · AAAW,FCL
3 GG
4 NaN
5 NaN
[6 rows x 30 columns]]
however, i end up with error Must pass 2-d input. shape=(1, 6, 30) after my last line. I have tried using df=dfs[1], but got the error list index our of range. Any way i can turn dfs from a list to a datframe?

What do you mean you only want the second table? There's only one table, it's 6 rows and 30 columns. The backslashes show up when whatever you're trying to print to isn't wide enough to contain the dataframe without line wrapping. Here's the dataframe printed in a wider terminal:
The pd.read_html() function returns a List[DataFrame] so you first need to grab your dataframe from the list, and then you can subset it to get the columns you care about:
df = dfs[0]
columns = ['R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB', 'Pos']
print(df[columns])
Output:
R H 2B 3B HR RBI SB CS BB SO BA OBP SLG OPS OPS+ TB GDP HBP SH SF IBB Pos
0 22 34 5 0 1 10 3 0 18 19 0.250 0.340 0.309 0.649 NaN 42 0 1 0 1 0 NaN
1 72 124 21 7 7 54 20 10 47 90 0.303 0.385 0.440 0.825 NaN 180 4 11 0 6 0 NaN
2 25 43 5 3 10 21 6 1 8 41 0.297 0.363 0.579 0.942 NaN 84 0 7 0 0 0 NaN
3 72 132 20 2 22 63 11 2 22 135 0.253 0.289 0.426 0.715 101.0 222 5 6 1 5 0 *6/H
4 72 132 20 2 22 63 11 2 22 135 0.253 0.289 0.426 0.715 101.0 222 5 6 1 5 0 NaN
5 86 157 24 2 26 75 13 2 26 161 0.253 0.289 0.426 0.715 101.0 264 6 7 1 6 0 NaN

In Selenium see if the 'a' anchor tag contains the text I want, and then extract multiple td's of text in the same row

For my python code, I have been trying to scrape data from NCAAF Stats. I have been having issues extracting the td's text after I evaluate if the anchor tag 'a', contains the text I want. I want to be able to find the teams amount of tds, points, and ppg. I have been able to successfully find the school by text in selenium, but after that I am unable to extract the info I want. Here is what I have coded so far.
from selenium import webdriver
driver = webdriver.Chrome('C:\\Users\\Carl\\Downloads\\chromedriver.exe')
driver.get('https://www.ncaa.com/stats/football/fbs/current/team/27')
# I plan to make a while or for loop later, that is why I used f strings
team = "Coastal Carolina"
first = driver.find_element_by_xpath(f'//a[text()="{team}"]')
# This was the way another similiarly asked question was answered but did not work
#tds = driver.find_element_by_xpath(f'//td//a[text()="{apples}"]/../td[4]').text
# This grabs data from the very first row of data... not the one I want
tds = first.find_element_by_xpath('//following-sibling::td[4]').text
total_points = first.find_element_by_xpath('//following-sibling::td[10]').text
ppg = first.find_element_by_xpath('//following-sibling::td[11]').text
print(tds, total_points, ppg)
driver.quit()
I have tried to look around for a similarly asked question and was able to find this snippet
tds = driver.find_element_by_xpath(f'//td//a[text()="{apples}"]/../td[4]').text
it unfortunately did not help me out much. The html structure looks like this. I appreciate any help, and thank you!

No need to use Selenium, the page isn't dynamic. Just use pandas to parse the table for you:
import pandas as pd
url = 'https://www.ncaa.com/stats/football/fbs/current/team/27'
dfs = pd.read_html(url)[0]
Output:
print(df)
Rank Team G TDs PAT 2PT Def Pts FG Saf Pts PPG
0 1 Ohio St. 6 39 39 0 0 6 0 291.0 48.5
1 2 Pittsburgh 6 40 36 0 0 4 1 290.0 48.3
2 3 Coastal Carolina 7 43 42 0 0 6 1 320.0 45.7
3 4 Alabama 7 41 40 1 0 9 0 315.0 45.0
4 5 Ole Miss 6 35 30 1 0 6 1 262.0 43.7
5 6 Cincinnati 6 36 34 1 0 3 0 261.0 43.5
6 7 Oklahoma 7 35 34 1 1 17 0 299.0 42.7
7 - SMU 7 40 36 1 0 7 0 299.0 42.7
8 9 Texas 7 38 37 0 0 8 1 291.0 41.6
9 10 Western Ky. 6 31 27 1 0 10 0 245.0 40.8
10 11 Tennessee 7 36 36 0 0 7 1 275.0 39.3
11 12 Wake Forest 6 28 24 2 0 12 0 232.0 38.7
12 13 UTSA 7 33 33 0 0 13 0 270.0 38.6
13 14 Michigan 6 28 25 1 0 12 0 231.0 38.5
14 15 Georgia 7 34 33 0 0 10 1 269.0 38.4
15 16 Baylor 7 35 35 0 0 7 1 268.0 38.3
16 17 Houston 6 30 28 0 0 5 0 223.0 37.2
17 - TCU 6 29 28 0 0 7 0 223.0 37.2
18 19 Marshall 7 34 33 0 0 7 0 258.0 36.9
19 - North Carolina 7 34 32 2 0 6 0 258.0 36.9
20 21 Nevada 6 26 24 1 0 12 0 218.0 36.3
21 22 Virginia 7 31 29 2 0 10 2 253.0 36.1
22 23 Fresno St. 7 32 27 1 0 10 0 251.0 35.9
23 - Memphis 7 33 26 3 0 7 0 251.0 35.9
24 25 Texas Tech 7 32 31 0 0 9 0 250.0 35.7
25 26 Auburn 7 29 28 1 0 12 1 242.0 34.6
26 27 Florida 7 33 29 1 0 4 0 241.0 34.4
27 - Missouri 7 31 31 0 0 8 0 241.0 34.4
28 29 Liberty 7 33 29 1 0 3 1 240.0 34.3
29 - Michigan St. 7 30 30 0 0 10 0 240.0 34.3
30 31 UCF 6 28 26 0 0 3 1 205.0 34.2
31 32 Oregon St. 6 27 27 0 0 5 0 204.0 34.0
32 33 Oregon 6 26 26 0 0 7 0 203.0 33.8
33 34 Iowa St. 6 23 22 0 0 14 0 202.0 33.7
34 35 UCLA 7 30 28 0 0 9 0 235.0 33.6
35 36 San Diego St. 6 25 24 1 0 7 0 197.0 32.8
36 37 LSU 7 29 29 0 0 8 0 227.0 32.4
37 38 Louisville 6 24 23 0 0 9 0 194.0 32.3
38 - Miami (FL) 6 24 22 1 0 8 1 194.0 32.3
39 - NC State 6 25 24 0 0 6 1 194.0 32.3
40 41 Southern California 6 22 19 3 0 12 0 193.0 32.2
41 42 Tulane 7 31 23 4 0 2 0 223.0 31.9
42 43 Arizona St. 7 30 25 2 0 4 0 221.0 31.6
43 44 Utah 6 25 22 1 0 5 0 189.0 31.5
44 45 Air Force 7 29 27 1 0 5 1 220.0 31.4
45 46 App State 7 27 24 0 0 11 0 219.0 31.3
46 47 Arkansas 7 27 25 0 0 10 0 217.0 31.0
47 - Army West Point 6 25 22 0 0 4 1 186.0 31.0
48 - Notre Dame 6 23 20 2 0 8 0 186.0 31.0
49 - Western Mich. 7 28 25 0 0 8 0 217.0 31.0

Conditional filling of column based on string

I have a dataset which I have to fill conditional or dropping the conditional rows. But, I am still unsuccessful.
Idx Fruits Days Name
0 60 20
1 15 85.5
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Now, I have some empty cells. I can fill with fillna or regex or can drop empty cells.
I want only first starting cells until the string starts, either dropping or filling with "."
Like below
Idx Fruits Days Name
0 60 20 .
1 15 85.5 .
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
and
Idx Fruits Days Name
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Is there any possibility using pandas? or any looping?

You can try this:
df['Name'] = df['Name'].replace('', np.nan)
df['Name'] = df['Name'].where(df['Name'].ffill().notna(), '.')
print(df)
Idx Fruits Days Name
0 0 60 20.0 .
1 1 15 85.5 .
2 2 10 62.0 Peter
3 3 40 90.0 Maria
4 4 5 10.2
5 5 92 66.0
6 6 65 87.0 John
7 7 50 1.0 Eric
8 8 50 0.0 Maria
9 9 80 87.0 John

Comparing Two Data Frames in python

I have two data frames. I have to compare the two data frames and get the position of the unmatched data using python.
Note:
The First column will always not be unique.
Data Frame 1:
0 1 2 3 4
0 1 Dhoni 24 Kota 60000.0
1 2 Raina 90 Delhi 41500.0
2 3 Kholi 67 Ahmedabad 20000.0
3 4 Ashwin 45 Bhopal 8500.0
4 5 Watson 64 Mumbai 6500.0
5 6 KL Rahul 19 Indore 4500.0
6 7 Hardik 24 Bengaluru 1000.0
Data Frame 2
0 1 2 3 4
0 3 Kholi 67 Ahmedabad 20000.0
1 7 Hardik 24 Bengaluru 1000.0
2 4 Ashwin 45 Bhopal 8500.0
3 2 Raina 90 Delhi 41500.0
4 6 KL Rahul 19 Chennai 4500.0
5 1 Dhoni 24 Kota 60000.0
6 5 Watson 64 Mumbai 6500.0
I expect the output of (3,5)-(Indore - Chennai).

df1=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Indore'],'D':[6000.0,41500.0,4500.0]})
df2=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Chennai'],'D':[6000.0,41500.0,4500.0]})
df1['df']='df1'
df2['df']='df2'
df=pd.concat([df1,df2],sort=False).drop_duplicates(subset=['A','B','C','D'],keep=False)
print(df)
A B C D df
2 KL Rahul 67 Indore 4500.0 df1
2 KL Rahul 67 Chennai 4500.0 df2
I have added df column to show, from which df difference comes from

How to perform conditional updation of column values in Pandas DataFrame?

I have a below dataframe is there any way to perform conditional addition of column values in pandas.
emp_id emp_name City months_worked default_sal total_sal jan feb mar apr may jun
111 aaa pune 2 90 NaN 4 5 5 54 3 2
222 bbb pune 1 70 NaN 5 4 4 8 3 4
333 ccc mumbai 2 NaN NaN 9 3 4 8 4 3
444 ddd hyd 4 NaN NaN 3 8 6 4 2 7
What I want to achive
if city = pune default_sal should be updated in total_sal for ex for
emp_id 111 total_salary should be 90
if city!=pune then depending on months_worked value total salary
should be updated.For ex for emp id 333 months_worked =2 So addition
of jan and feb value should be updated as total_sal which is 9+3=12
Desired O/P
emp_id emp_name City months_worked default_sal total_sal jan feb mar apr may jun
111 aaa pune 2 90 90 4 5 5 54 3 2
222 bbb pune 1 70 70 5 4 4 8 3 4
333 ccc mumbai 2 NaN 12 9 3 4 8 4 3
444 ddd hyd 4 NaN 21 3 8 6 4 2 7

Using np.where after create the help series
s1=pd.Series([df.iloc[x,6:y+6].sum() for x,y in enumerate(df.months_worked)],index=df.index)
np.where(df.City=='pune',df.default_sal,s1 )
Out[429]: array([90., 70., 12., 21.])
#df['total']=np.where(df.City=='pune',df.default_sal,s1 )

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Joining pandas dataframes by column headers - python

Related

How can I turn this into a DataFrame?

In Selenium see if the 'a' anchor tag contains the text I want, and then extract multiple td's of text in the same row

Conditional filling of column based on string

Comparing Two Data Frames in python

How to perform conditional updation of column values in Pandas DataFrame?

Categories

Resources