Copy column data into rows below in python pandas - python
I am new to python and need to have some database management. I have a large dataset in CSV and the column name repeats after a certain column. I need to copy that column set below the end of rows of the first set. As shown in the image below, I want to cut and paste the dataset for each ID { 03]01]17, 03]01]16, 03]01]15 and so on...I have attached here the sample data and required format.
,Day,Time,Q,V,N,Unnamed: 5,Q.1,V.1,N.1,Unnamed: 9,Q.2,V.2,N.2,Unnamed: 13,Q.3,V.3,N.3
0,,,03]01]17,,,,03]01]16,,,,03]01]15,,,,03]01]14,,
1,,,,,,,,,,,,,,,,,
2,2019N11,00:00-00:05,48,80.2,2.3,,65,78.8,2.8,,67,78.6,2.9, ,71,84.3,2.6
3,,00:05-00:10,87,75.1,4.2,,102,77.2,4.8,,98,76.2,4.7, ,94,83.9,4.4
4,,00:10-00:15,56,78.0,2.2,,62,81.2,2.3,,66,77.2,2.7, ,70,81.3,2.6
5,,00:15-00:20,62,73.6,2.7,,79,76.9,3.3,,82,74.5,3.5, ,78,84.1,2.8
6,,00:20-00:25,69,75.6,3.0,,84,77.4,3.6,,81,75.4,3.5, ,81,83.0,3.2
7,,00:25-00:30,65,76.0,2.6,,69,77.2,2.7,,75,76.1,3.2, ,72,84.4,2.7
8,,00:30-00:35,62,77.9,2.6,,77,79.4,3.2,,82,77.9,3.4, ,83,86.1,3.1
9,,00:35-00:40,63,80.0,2.2,,82,76.6,3.2,,79,78.7,3.2, ,74,86.0,2.6
10,,00:40-00:45,52,79.5,2.0,,66,81.2,2.2,,69,78.9,2.5, ,74,85.0,2.6
11,,00:45-00:50,59,73.9,2.6,,73,78.9,2.9,,76,76.7,3.0, ,73,84.3,2.6
12,,00:50-00:55,67,77.4,2.8,,89,78.0,3.4,,87,74.9,3.4, ,90,82.6,3.1
13,,00:55-01:00,49,74.2,1.9,,75,76.6,2.8,,78,73.5,3.0, ,75,82.9,2.6
dfsample = pd.read_clipboard(sep=',')
dfsample
##Required_format
,ID,Day,Time,Q,V,N
0,03]01]17,2019N11,00:00-00:05,48,80.2,2.3
1,,,00:05-00:10,87,75.1,4.2
2,,,00:10-00:15,56,78.0,2.2
3,,,00:15-00:20,62,73.6,2.7
4,,,00:20-00:25,69,75.6,3.0
5,,,00:25-00:30,65,76.0,2.6
6,,,00:30-00:35,62,77.9,2.6
7,,,00:35-00:40,63,80.0,2.2
8,,,00:40-00:45,52,79.5,2.0
9,,,00:45-00:50,59,73.9,2.6
10,,,00:50-00:55,67,77.4,2.8
11,,,00:55-01:00,49,74.2,1.9
12,03]01]16,2019N11,00:00-00:05,65,78.8,2.8
13,,,00:05-00:10,102,77.2,4.8
14,,,00:10-00:15,62,81.2,2.3
15,,,00:15-00:20,79,76.9,3.3
16,,,00:20-00:25,84,77.4,3.6
17,,,00:25-00:30,69,77.2,2.7
18,,,00:30-00:35,77,79.4,3.2
19,,,00:35-00:40,82,76.6,3.2
20,,,00:40-00:45,66,81.2,2.2
21,,,00:45-00:50,73,78.9,2.9
22,,,00:50-00:55,89,78.0,3.4
23,,,00:55-01:00,75,76.6,2.8
24,03]01]15,2019N11,00:00-00:05,67,78.6,2.9
25,,,00:05-00:10,98,76.2,4.7
26,,,00:10-00:15,66,77.2,2.7
27,,,00:15-00:20,82,74.5,3.5
28,,,00:20-00:25,81,75.4,3.5
29,,,00:25-00:30,75,76.1,3.2
30,,,00:30-00:35,82,77.9,3.4
31,,,00:35-00:40,79,78.7,3.2
32,,,00:40-00:45,69,78.9,2.5
33,,,00:45-00:50,76,76.7,3.0
34,,,00:50-00:55,87,74.9,3.4
35,,,00:55-01:00,78,73.5,3.0
36,03]01]14,2019N11,00:00-00:05,71,84.3,2.6
37,,,00:05-00:10,94,83.9,4.4
38,,,00:10-00:15,70,81.3,2.6
39,,,00:15-00:20,78,84.1,2.8
40,,,00:20-00:25,81,83.0,3.2
41,,,00:25-00:30,72,84.4,2.7
42,,,00:30-00:35,83,86.1,3.1
43,,,00:35-00:40,74,86.0,2.6
44,,,00:40-00:45,74,85.0,2.6
45,,,00:45-00:50,73,84.3,2.6
46,,,00:50-00:55,90,82.6,3.1
47,,,00:55-01:00,75,82.9,2.6
dfrequired = pd.read_clipboard(sep=',')
dfrequired
Please try this:
import pandas as pd
import numpy as np
df = pd.read_csv('file.csv')
df = df.drop('Unnamed: 0', axis=1)
DAY = df.iloc[2, 0]
ID = df.iloc[0, 2]
TIME = df.iloc[2:, 1]
result_df = pd.DataFrame()
i = 0
for n in range(2, df.shape[1], 4):
if i==0:
first_col = n
last_col = n+3
temp_df = df.iloc[:, first_col:last_col]
temp_df = temp_df.iloc[2:, :]
temp_df.insert(0, 'ID', np.nan)
temp_df.iloc[0, 0] = ID
temp_df.insert(1, 'Day', np.nan)
temp_df.iloc[0, 1] = DAY
temp_df.insert(2, 'Time', np.nan)
temp_df.iloc[:, 2] = TIME
result_df = result_df.append(temp_df)
i += 1
else:
first_col = n
last_col = n+3
temp_df = df.iloc[:, first_col:last_col]
ID = temp_df.iloc[0, 0]
temp_df = temp_df.iloc[2:, :]
temp_df.insert(0, 'ID', np.nan)
temp_df.iloc[0, 0] = ID
temp_df.insert(1, 'Day', np.nan)
temp_df.iloc[0, 1] = DAY
temp_df.insert(2, 'Time', np.nan)
temp_df.iloc[:, 2] = TIME
temp_df.columns = result_df.columns
result_df = result_df.append(temp_df)
result_df = result_df.reset_index(drop=True)
OK, there we go. First of all, I created a random sample file that looks just like yours:
(link here: https://drive.google.com/file/d/121li6T5OfSlZ12-HrxyP_Thd4NIuu1jK/view?usp=sharing)
Then, I uploaded it as a dataframe:
import pandas as pd
import numpy as np
df = pd.read_csv('database.csv')
output:
My approach was to create lists and group them as a new dataframe:
Q_index = [i for i in range(len(df.columns)) if 'Q' in df.columns[i]]
id_list = [i for sub_list in [[id]+list(np.ones(11)*np.nan) for id in df.iloc[0].dropna()] for i in sub_list]
day_list = df.loc[2,'Day']
time_list = (df['Time'][2:]).tolist()*len(Q_index)
Q_list = [i for sub_list in [df.iloc[2:,i].tolist() for i in Q_index] for i in sub_list]
V_list = [i for sub_list in [df.iloc[2:,i+1].tolist() for i in Q_index] for i in sub_list]
N_list = [i for sub_list in [df.iloc[2:,i+2].tolist() for i in Q_index] for i in sub_list]
result_df = pd.DataFrame({'ID' :id_list,
'Day' :day_list,
'Time':time_list,
'Q' :Q_list,
'V' :V_list,
'N' :N_list}).fillna(method='ffill')
Output:
result_df
ID Day Time Q V N
0 3]01]17 2019N11 00:00-00:05 91.9 70.0 3.0
1 3]01]17 2019N11 00:05-00:10 92.7 80.1 4.0
2 3]01]17 2019N11 00:10-00:15 68.3 86.8 3.2
3 3]01]17 2019N11 00:15-00:20 40.2 74.5 4.4
4 3]01]17 2019N11 00:20-00:25 81.4 74.3 3.3
5 3]01]17 2019N11 00:25-00:30 45.2 85.0 4.8
6 3]01]17 2019N11 00:30-00:35 92.3 82.3 3.6
7 3]01]17 2019N11 00:35-00:40 78.7 81.2 3.0
8 3]01]17 2019N11 00:40-00:45 88.8 86.2 2.0
9 3]01]17 2019N11 00:45-00:50 75.4 79.9 4.5
10 3]01]17 2019N11 00:50-00:55 53.0 73.6 3.2
11 3]01]17 2019N11 00:55-01:00 58.9 82.7 4.4
12 3]01]16 2019N11 00:00-00:05 62.9 77.1 3.1
13 3]01]16 2019N11 00:05-00:10 52.2 78.7 2.0
14 3]01]16 2019N11 00:10-00:15 52.0 79.0 4.7
15 3]01]16 2019N11 00:15-00:20 77.6 85.3 4.4
16 3]01]16 2019N11 00:20-00:25 57.8 84.0 5;0
17 3]01]16 2019N11 00:25-00:30 47.9 77.0 3.1
18 3]01]16 2019N11 00:30-00:35 62.4 84.5 3.2
19 3]01]16 2019N11 00:35-00:40 84.5 83.4 5.0
20 3]01]16 2019N11 00:40-00:45 56.6 88.6 2.5
21 3]01]16 2019N11 00:45-00:50 47.9 84.7 4.8
22 3]01]16 2019N11 00:50-00:55 92.5 77.8 3.7
23 3]01]16 2019N11 00:55-01:00 60.6 75.0 4.5
24 3]01]15 2019N11 00:00-00:05 51.8 86.3 4.4
25 3]01]15 2019N11 00:05-00:10 52.9 83.6 5.0
26 3]01]15 2019N11 00:10-00:15 52.5 85.4 3.4
27 3]01]15 2019N11 00:15-00:20 46.1 81.2 2.3
28 3]01]15 2019N11 00:20-00:25 65.1 70.9 4.7
29 3]01]15 2019N11 00:25-00:30 65.2 77.6 2.6
30 3]01]15 2019N11 00:30-00:35 67.1 84.2 4.1
31 3]01]15 2019N11 00:35-00:40 42.2 82.2 3.3
32 3]01]15 2019N11 00:40-00:45 71.5 79.8 2.4
33 3]01]15 2019N11 00:45-00:50 65.1 72.3 2.9
34 3]01]15 2019N11 00:50-00:55 86.0 80.3 3.9
35 3]01]15 2019N11 00:55-01:00 92.8 85.9 4.1
36 3]01]14 2019N11 00:00-00:05 53.2 82.4 3.1
37 3]01]14 2019N11 00:05-00:10 98.0 76.0 3.5
38 3]01]14 2019N11 00:10-00:15 58.9 88.3 4.4
39 3]01]14 2019N11 00:15-00:20 95.3 85.1 3.2
40 3]01]14 2019N11 00:20-00:25 45.7 74.0 3.5
41 3]01]14 2019N11 00:25-00:30 48.6 89.7 4.8
42 3]01]14 2019N11 00:30-00:35 94.6 79.5 2.1
43 3]01]14 2019N11 00:35-00:40 71.8 73.0 3.8
44 3]01]14 2019N11 00:40-00:45 92.5 83.1 2.0
45 3]01]14 2019N11 00:45-00:50 70.3 79.4 4.2
46 3]01]14 2019N11 00:50-00:55 83.6 82.6 2.8
47 3]01]14 2019N11 00:55-01:00 56.2 89.1 2.6
Related
How to use beautifulsoup to scrape a certain table and turn into pandas dataframe?
How would I use bs4 to get the "Per Game Stats" table on here to turn it into a pandas dataframe? I have already tried url = 'https://www.basketball-reference.com/leagues/NBA_2021.html' page = requests.get(url) page soup = BeautifulSoup(page.content, 'html.parser') print(soup.prettify()) and am stuck from there. Thanks.
Use pd.read_html: import requests from bs4 import BeautifulSoup import pandas as pd url = 'https://www.basketball-reference.com/leagues/NBA_2021.html' page = requests.get(url) soup = BeautifulSoup(page.content, 'html.parser') table = soup.find('table', id='per_game-team') df = pd.read_html(str(table))[0] The table you want has the id 'per_game-team'. Use the inspector from your browser's developer tools to find it. Output: >>> df.head(10) Rk Team G MP ... BLK TOV PF PTS 0 1.0 Milwaukee Bucks* 72 240.7 ... 4.6 13.8 17.3 120.1 1 2.0 Brooklyn Nets* 72 241.7 ... 5.3 13.5 19.0 118.6 2 3.0 Washington Wizards* 72 241.7 ... 4.1 14.4 21.6 116.6 3 4.0 Utah Jazz* 72 241.0 ... 5.2 14.2 18.5 116.4 4 5.0 Portland Trail Blazers* 72 240.3 ... 5.0 11.1 18.9 116.1 5 6.0 Phoenix Suns* 72 242.8 ... 4.3 12.5 19.1 115.3 6 7.0 Indiana Pacers 72 242.4 ... 6.4 13.5 20.2 115.3 7 8.0 Denver Nuggets* 72 242.8 ... 4.5 13.5 19.1 115.1 8 9.0 New Orleans Pelicans 72 242.1 ... 4.4 14.6 18.0 114.6 9 10.0 Los Angeles Clippers* 72 240.0 ... 4.1 13.2 19.2 114.0 [10 rows x 25 columns]
pandas's .read_html() is the way to go here (as it uses BeautifulSoup under the hood). And since it already incorporates requests with it, you can actually simplify the solution Corral provided as simply: import pandas as pd url = 'https://www.basketball-reference.com/leagues/NBA_2021.html' df = pd.read_html(url, attrs = {'id': 'per_game-team'})[0] But since you are specifically asking how to convert to dataframe with bs4, I'll provide that solution. The basic logic/steps to do this are: Get the table tag From the table object, Get the Header names from <th> tags under the <thead> tag iterate through the rows (<tr> tags) and get the <td> content from each row Code: import pandas as pd import requests from bs4 import BeautifulSoup url = 'https://www.basketball-reference.com/leagues/NBA_2021.html' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') table = soup.find('table', {'id':'per_game-team'}) headers = [x.text for x in table.find('thead').find_all('th')] data = [] table_body_rows = table.find('tbody').find_all('tr') for row in table_body_rows: rank = [row.find('th').text] row_data = rank + [x.text for x in row.find_all('td')] data.append(row_data) df = pd.DataFrame(data, columns=headers) Output: print(df) Rk Team G MP FG ... STL BLK TOV PF PTS 0 1 Milwaukee Bucks* 72 240.7 44.7 ... 8.1 4.6 13.8 17.3 120.1 1 2 Brooklyn Nets* 72 241.7 43.1 ... 6.7 5.3 13.5 19.0 118.6 2 3 Washington Wizards* 72 241.7 43.2 ... 7.3 4.1 14.4 21.6 116.6 3 4 Utah Jazz* 72 241.0 41.3 ... 6.6 5.2 14.2 18.5 116.4 4 5 Portland Trail Blazers* 72 240.3 41.3 ... 6.9 5.0 11.1 18.9 116.1 5 6 Phoenix Suns* 72 242.8 43.3 ... 7.2 4.3 12.5 19.1 115.3 6 7 Indiana Pacers 72 242.4 43.3 ... 8.5 6.4 13.5 20.2 115.3 7 8 Denver Nuggets* 72 242.8 43.3 ... 8.1 4.5 13.5 19.1 115.1 8 9 New Orleans Pelicans 72 242.1 42.5 ... 7.6 4.4 14.6 18.0 114.6 9 10 Los Angeles Clippers* 72 240.0 41.8 ... 7.1 4.1 13.2 19.2 114.0 10 11 Atlanta Hawks* 72 241.7 40.8 ... 7.0 4.8 13.2 19.3 113.7 11 12 Sacramento Kings 72 240.3 42.6 ... 7.5 5.0 13.4 19.4 113.7 12 13 Golden State Warriors 72 240.3 41.3 ... 8.2 4.8 15.0 21.2 113.7 13 14 Philadelphia 76ers* 72 242.1 41.4 ... 9.1 6.2 14.4 20.2 113.6 14 15 Memphis Grizzlies* 72 241.7 42.8 ... 9.1 5.1 13.3 18.7 113.3 15 16 Boston Celtics* 72 241.4 41.5 ... 7.7 5.3 14.1 20.4 112.6 16 17 Dallas Mavericks* 72 240.3 41.1 ... 6.3 4.3 12.1 19.4 112.4 17 18 Minnesota Timberwolves 72 241.7 40.7 ... 8.8 5.5 14.3 20.9 112.1 18 19 Toronto Raptors 72 240.3 39.7 ... 8.6 5.4 13.2 21.2 111.3 19 20 San Antonio Spurs 72 242.8 41.9 ... 7.0 5.1 11.4 18.0 111.1 20 21 Chicago Bulls 72 241.4 42.2 ... 6.7 4.2 15.1 18.9 110.7 21 22 Los Angeles Lakers* 72 242.4 40.6 ... 7.8 5.4 15.2 19.1 109.5 22 23 Charlotte Hornets 72 241.0 39.9 ... 7.8 4.8 14.8 18.0 109.5 23 24 Houston Rockets 72 240.3 39.3 ... 7.6 5.0 14.7 19.5 108.8 24 25 Miami Heat* 72 241.4 39.2 ... 7.9 4.0 14.1 18.9 108.1 25 26 New York Knicks* 72 242.1 39.4 ... 7.0 5.1 12.9 20.5 107.0 26 27 Detroit Pistons 72 242.1 38.7 ... 7.4 5.2 14.9 20.5 106.6 27 28 Oklahoma City Thunder 72 241.0 38.8 ... 7.0 4.4 16.1 18.1 105.0 28 29 Orlando Magic 72 240.7 38.3 ... 6.9 4.4 12.8 17.2 104.0 29 30 Cleveland Cavaliers 72 242.1 38.6 ... 7.8 4.5 15.5 18.2 103.8 [30 rows x 25 columns]
Transforming yearwise data using pandas
I have a dataframe that looks like this: Temp Date 1981-01-01 20.7 1981-01-02 17.9 1981-01-03 18.8 1981-01-04 14.6 1981-01-05 15.8 ... ... 1981-12-27 15.5 1981-12-28 13.3 1981-12-29 15.6 1981-12-30 15.2 1981-12-31 17.4 365 rows × 1 columns And I want to transform It so That It looks like: 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 0 20.7 17.0 18.4 19.5 13.3 12.9 12.3 15.3 14.3 14.8 1 17.9 15.0 15.0 17.1 15.2 13.8 13.8 14.3 17.4 13.3 2 18.8 13.5 10.9 17.1 13.1 10.6 15.3 13.5 18.5 15.6 3 14.6 15.2 11.4 12.0 12.7 12.6 15.6 15.0 16.8 14.5 4 15.8 13.0 14.8 11.0 14.6 13.7 16.2 13.6 11.5 14.3 ... ... ... ... ... ... ... ... ... ... ... 360 15.5 15.3 13.9 12.2 11.5 14.6 16.2 9.5 13.3 14.0 361 13.3 16.3 11.1 12.0 10.8 14.2 14.2 12.9 11.7 13.6 362 15.6 15.8 16.1 12.6 12.0 13.2 14.3 12.9 10.4 13.5 363 15.2 17.7 20.4 16.0 16.3 11.7 13.3 14.8 14.4 15.7 364 17.4 16.3 18.0 16.4 14.4 17.2 16.7 14.1 12.7 13.0 My attempt: groups=df.groupby(df.index.year) keys=groups.groups.keys() years=pd.DataFrame() for key in keys: years[key]=groups.get_group(key)['Temp'].values Question: The above code is giving me my desired output but Is there is a more efficient way of transforming this? As I can't post the whole data because there are 3650 rows in the dataframe so you can download the csv file(60.6 kb) for testing from here
Try grabbing the year and dayofyear from the index then pivoting: import pandas as pd import numpy as np # Create Random Data dr = pd.date_range(pd.to_datetime("1981-01-01"), pd.to_datetime("1982-12-31")) df = pd.DataFrame(np.random.randint(1, 100, size=dr.shape), index=dr, columns=['Temp']) # Get Year and Day of Year df['year'] = df.index.year df['day'] = df.index.dayofyear # Pivot p = df.pivot(index='day', columns='year', values='Temp') print(p) p: year 1981 1982 day 1 38 85 2 51 70 3 76 61 4 71 47 5 44 76 .. ... ... 361 23 22 362 42 64 363 84 22 364 26 56 365 67 73 Run-Time via Timeit import timeit setup = ''' import pandas as pd import numpy as np # Create Random Data dr = pd.date_range(pd.to_datetime("1981-01-01"), pd.to_datetime("1983-12-31")) df = pd.DataFrame(np.random.randint(1, 100, size=dr.shape), index=dr, columns=['Temp'])''' pivot = ''' df['year'] = df.index.year df['day'] = df.index.dayofyear p = df.pivot(index='day', columns='year', values='Temp')''' groupby_for = ''' groups=df.groupby(df.index.year) keys=groups.groups.keys() years=pd.DataFrame() for key in keys: years[key]=groups.get_group(key)['Temp'].values''' if __name__ == '__main__': print("Pivot") print(timeit.timeit(setup=setup, stmt=pivot, number=1000)) print("Groupby For") print(timeit.timeit(setup=setup, stmt=groupby_for, number=1000)) Pivot 1.598973 Groupby For 2.3967995999999996 *Additional note, the groupby for option will not work for leap years as it will not be able to handle 1984 being 366 days instead of 365. Pivot will work regardless.
Opening and Closing Tags are Removed from html When Using BeautifulSoup
I am running into an issue when using BeautifulSoup to scrape data off of www.basketball-reference.com. I've used BeautifulSoup before on Bballreference before so I am a little stumped as to what is happening (granted I am a pretty huge noob so please bear with me). I am trying to scrape team season stats off of https://www.basketball-reference.com/leagues/NBA_2020.html and am running into troubles from the very start: from bs4 import BeautifulSoup import requests web_response = requests.get('https://www.basketball-reference.com/leagues/NBA_2020.html').text soup = BeautifulSoup(web_response, 'lxml') table = soup.find('table', id='team-stats-per_game') print(table) This shows that the finding of the table in question was unsuccessful even though I can clearly locate that tag when inspecting the web page. Okay... no biggie so far (usually these errors are on my end) so I instead just print out the whole soup: soup = BeautifulSoup(web_response, 'lxml') print(soup) I copy and paste that into https://codebeautify.org/htmlviewer/. To get a better view than from the terminal and I see that it does not look how I would expect it to. Essentially the meta tags are fine but everything else appears to have lost its opening and closing tags, just making my soup into an actual soup... Again, no biggie (still pretty sure it is something that I am doing), so I go and grab the html from a simple blog site, print it, and paste it into codebeautify and lo and behold it looks normal. Now I have a suspicion that something is occurring on basketball-reference's side that is obscuring my ability to even grab the html. My question is this; what exactly is going on here? I am assuming it's an 80% chance it is still me but the 20% is not so sure at this point. Can someone point out what I am doing wrong or how to grab the html?
The data is stored within the page, but inside the HTML comment. To parse it, you can do for example: import requests from bs4 import BeautifulSoup, Comment web_response = requests.get('https://www.basketball-reference.com/leagues/NBA_2020.html').text soup = BeautifulSoup(web_response, 'lxml') table = soup.find('table', id='team-stats-per_game') # find the comment section where the data is stored for idx, c in enumerate(soup.select_one('div#all_team-stats-per_game').contents): if isinstance(c, Comment): break # load the data from comment: soup2 = BeautifulSoup(soup.select_one('div#all_team-stats-per_game').contents[idx], 'html.parser') # print data: for tr in soup2.select('tr:has(td)'): tds = tr.select('td') for td in tds: print(td.get_text(strip=True), end='\t') print() Prints: Dallas Mavericks 67 241.5 41.6 90.0 .462 15.3 41.5 .369 26.3 48.5 .542 17.9 23.1 .773 10.6 36.4 47.0 24.5 6.3 5.0 12.8 19.0 116.4 Milwaukee Bucks* 65 240.8 43.5 91.2 .477 13.7 38.6 .356 29.8 52.6 .567 17.8 24.0 .742 9.5 42.2 51.7 25.9 7.4 6.0 14.9 19.2 118.6 Houston Rockets 64 241.2 41.1 90.7 .454 15.4 44.3 .348 25.7 46.4 .554 20.5 26.0 .787 10.4 34.6 44.9 21.5 8.5 5.1 14.7 21.6 118.1 Portland Trail Blazers 66 240.8 41.9 90.9 .461 12.6 33.8 .372 29.3 57.1 .513 17.3 21.7 .798 10.1 35.4 45.5 20.2 6.1 6.2 13.0 21.4 113.6 Atlanta Hawks 67 243.0 40.6 90.6 .449 12.0 36.1 .333 28.6 54.5 .525 18.5 23.4 .790 9.9 33.4 43.3 24.0 7.8 5.1 16.2 23.1 111.8 New Orleans Pelicans 64 242.3 42.6 92.2 .462 14.0 37.6 .372 28.6 54.6 .525 16.9 23.2 .729 11.2 35.8 47.0 27.0 7.6 5.1 16.2 21.0 116.2 Los Angeles Clippers 64 241.2 41.6 89.7 .464 12.2 33.2 .366 29.5 56.5 .522 20.8 26.2 .792 11.0 37.0 48.0 23.8 7.1 5.0 14.8 22.0 116.2 Washington Wizards 64 241.2 41.9 91.0 .461 12.3 33.1 .372 29.6 57.9 .511 19.5 24.8 .787 10.1 31.6 41.7 25.3 8.1 4.3 14.1 22.6 115.6 Memphis Grizzlies 65 240.4 42.8 91.0 .470 10.9 31.1 .352 31.8 59.9 .531 16.2 21.3 .761 10.4 36.3 46.7 27.0 8.0 5.6 15.3 20.8 112.6 Phoenix Suns 65 241.2 40.8 87.8 .464 11.2 31.7 .353 29.6 56.1 .527 19.8 24.0 .826 9.8 33.3 43.1 27.2 7.8 4.0 15.1 22.1 112.6 Miami Heat 65 243.5 39.6 84.4 .470 13.4 34.8 .383 26.3 49.6 .530 19.5 25.1 .778 8.5 36.0 44.5 26.0 7.4 4.5 14.9 20.4 112.2 Minnesota Timberwolves 64 243.1 40.4 91.6 .441 13.3 39.7 .336 27.1 52.0 .521 19.1 25.4 .753 10.5 34.3 44.8 23.8 8.7 5.7 15.3 21.4 113.3 Boston Celtics* 64 242.0 41.2 89.6 .459 12.4 34.2 .363 28.8 55.4 .519 18.3 22.8 .801 10.7 35.3 46.0 22.8 8.3 5.6 13.6 21.4 113.0 Toronto Raptors* 64 241.6 40.6 88.5 .458 13.8 37.0 .371 26.8 51.5 .521 18.1 22.6 .800 9.7 35.5 45.2 25.4 8.8 4.9 14.4 21.5 113.0 Los Angeles Lakers* 63 240.8 42.9 88.6 .485 11.2 31.4 .355 31.8 57.1 .556 17.3 23.7 .730 10.6 35.5 46.1 25.9 8.6 6.8 15.1 20.6 114.3 Denver Nuggets 65 242.3 41.8 88.9 .471 10.9 30.4 .358 31.0 58.5 .529 15.9 20.5 .775 10.8 33.5 44.3 26.5 8.1 4.6 13.7 20.0 110.4 San Antonio Spurs 63 242.8 42.0 89.5 .470 10.7 28.7 .371 31.4 60.8 .517 18.4 22.8 .809 8.8 35.6 44.4 24.5 7.2 5.5 12.3 19.2 113.2 Philadelphia 76ers 65 241.2 40.8 87.7 .465 11.4 31.6 .362 29.4 56.1 .523 16.6 22.1 .752 10.4 35.1 45.5 25.9 8.2 5.4 14.2 20.6 109.6 Indiana Pacers 65 241.5 42.2 88.4 .477 10.0 27.5 .363 32.2 60.9 .529 15.1 19.1 .787 8.8 34.0 42.8 25.9 7.2 5.1 13.1 19.6 109.3 Utah Jazz 64 240.4 40.1 84.6 .475 13.2 34.4 .383 27.0 50.2 .537 17.6 22.8 .772 8.8 36.3 45.1 22.2 5.9 4.0 14.9 20.0 111.0 Oklahoma City Thunder 64 241.6 40.3 85.1 .473 10.4 29.3 .355 29.9 55.8 .536 19.8 24.8 .797 8.1 34.6 42.7 21.9 7.6 5.0 13.5 18.8 110.8 Brooklyn Nets 64 243.1 40.0 90.0 .444 12.9 37.9 .340 27.1 52.2 .519 18.0 24.1 .744 10.8 37.6 48.5 24.0 6.5 4.6 15.5 20.7 110.8 Detroit Pistons 66 241.9 39.3 85.7 .459 12.0 32.7 .367 27.3 53.0 .515 16.6 22.4 .743 9.8 32.0 41.7 24.1 7.4 4.5 15.3 19.7 107.2 New York Knicks 66 241.9 40.0 89.3 .447 9.6 28.4 .337 30.4 61.0 .499 16.3 23.5 .694 12.0 34.5 46.5 22.1 7.6 4.7 14.3 22.2 105.8 Sacramento Kings 64 242.3 40.4 87.8 .459 12.6 34.7 .364 27.7 53.2 .522 15.6 20.3 .769 9.6 32.9 42.5 23.4 7.6 4.2 14.4 21.9 109.0 Cleveland Cavaliers 65 241.9 40.3 87.9 .458 11.2 31.8 .351 29.1 56.1 .519 15.1 19.9 .758 10.8 33.4 44.2 23.1 6.9 3.2 16.5 18.3 106.9 Chicago Bulls 65 241.2 39.6 88.6 .447 12.2 35.1 .348 27.4 53.5 .511 15.5 20.5 .755 10.5 31.4 41.9 23.2 10.0 4.1 15.5 21.8 106.8 Orlando Magic 65 240.4 39.2 88.8 .442 10.9 32.0 .341 28.3 56.8 .498 17.0 22.1 .770 10.4 34.2 44.5 24.0 8.4 5.7 12.6 17.6 106.4 Golden State Warriors 65 241.9 38.6 88.2 .438 10.4 31.3 .334 28.2 56.9 .495 18.7 23.2 .803 10.0 32.9 42.8 25.6 8.2 4.6 14.9 20.1 106.3 Charlotte Hornets 65 242.3 37.3 85.9 .434 12.1 34.3 .352 25.2 51.6 .489 16.2 21.6 .748 11.0 31.8 42.8 23.8 6.6 4.1 14.6 18.8 102.9 League Average 65 241.7 40.8 88.8 .460 12.1 33.9 .357 28.7 54.9 .523 17.7 22.9 .771 10.1 34.7 44.9 24.3 7.7 4.9 14.5 20.6 111.4
Not able to read txt file without comma separator in pandas python
CODE import pandas df = pandas.read_csv('biharpopulation.txt', delim_whitespace=True) df.columns = ['SlNo','District','Total','Male','Female','Total','Male','Female','SC','ST','SC','ST'] DATA SlNo District Total Male Female Total Male Female SC ST SC ST 1 Patna 729988 386991 342997 9236 5352 3884 15.5 0.2 38.6 68.7 2 Nalanda 473786 248246 225540 970 524 446 20.2 0.0 29.4 29.8 3 Bhojpur 343598 181372 162226 8337 4457 3880 15.3 0.4 39.1 46.7 4 Buxar 198014 104761 93253 8428 4573 3855 14.1 0.6 37.9 44.6 5 Rohtas 444333 233512 210821 25663 13479 12184 18.1 1.0 41.3 30.0 6 Kaimur 286291 151031 135260 35662 18639 17023 22.2 2.8 40.5 38.6 7 Gaya 1029675 529230 500445 2945 1526 1419 29.6 0.1 26.3 49.1 8 Jehanabad 174738 90485 84253 1019 530 489 18.9 0.07 32.6 32.4 9 Arawal 11479 57677 53802 294 179 115 18.8 0.04 10 Nawada 435975 223929 212046 2158 1123 1035 24.1 0.1 22.4 20.5 11 Aurangabad 472766 244761 228005 1640 865 775 23.5 0.1 35.7 49.7 Saran 12 Saran 389933 199772 190161 6667 3384 3283 12 0.2 33.6 48.5 13 Siwan 309013 153558 155455 13822 6856 6966 11.4 0.5 35.6 44.0 14 Gopalganj 267250 134796 132454 6157 2984 3173 12.4 0.3 32.1 37.8 15 Muzaffarpur 594577 308894 285683 3472 1789 1683 15.9 0.1 28.9 50.4 16 E. Champaran 514119 270968 243151 4812 2518 2294 13.0 0.1 20.6 34.3 17 W. Champaran 434714 228057 206657 44912 23135 21777 14.3 1.5 22.3 24.1 18 Sitamarhi 315646 166607 149039 1786 952 834 11.8 0.1 22.1 31.4 19 Sheohar 74391 39405 34986 64 35 29 14.4 0.0 16.9 38.8 20 Vaishali 562123 292711 269412 3068 1595 1473 20.7 0.1 29.4 29.9 21 Darbhanga 511125 266236 244889 841 467 374 15.5 0.0 24.7 49.5 22 Madhubani 481922 248774 233148 1260 647 613 13.5 0.0 22.2 35.8 23 Samastipur 628838 325101 303737 3362 2724 638 18.5 0.1 25.1 22.0 24 Munger 150947 80031 70916 18060 9297 8763 13.3 1.6 42.6 37.3 25 Begusarai 341173 177897 163276 1505 823 682 14.5 0.1 31.4 78.6 26 Shekhapura 103732 54327 49405 211 115 96 19.7 0.0 25.2 45.6 27 Lakhisarai 126575 65781 60794 5636 2918 2718 15.8 0.7 26.8 12.9 28 Jamui 242710 124538 118172 67357 34689 32668 17.4 4.8 24.5 26.7
The issue is with these 2 lines: 16 E. Champaran 514119 270968 243151 4812 2518 2294 13.0 0.1 20.6 34.3 17 W. Champaran 434714 228057 206657 44912 23135 21777 14.3 1.5 22.3 24.1 If you can somehow remove the space between E. Champaran and W. Champaran then you can do this: df = pd.read_csv('test.csv', sep=r'\s+', skip_blank_lines=True, skipinitialspace=True) print(df) SlNo District Total Male Female Total.1 Male.1 Female.1 SC ST SC.1 ST.1 0 1 Patna 729988 386991 342997 9236 5352 3884 15.5 0.20 38.6 68.7 1 2 Nalanda 473786 248246 225540 970 524 446 20.2 0.00 29.4 29.8 2 3 Bhojpur 343598 181372 162226 8337 4457 3880 15.3 0.40 39.1 46.7 3 4 Buxar 198014 104761 93253 8428 4573 3855 14.1 0.60 37.9 44.6 4 5 Rohtas 444333 233512 210821 25663 13479 12184 18.1 1.00 41.3 30.0 5 6 Kaimur 286291 151031 135260 35662 18639 17023 22.2 2.80 40.5 38.6 6 7 Gaya 1029675 529230 500445 2945 1526 1419 29.6 0.10 26.3 49.1 7 8 Jehanabad 174738 90485 84253 1019 530 489 18.9 0.07 32.6 32.4 8 9 Arawal 11479 57677 53802 294 179 115 18.8 0.04 NaN NaN 9 10 Nawada 435975 223929 212046 2158 1123 1035 24.1 0.10 22.4 20.5 10 11 Aurangabad 472766 244761 228005 1640 865 775 23.5 0.10 35.7 49.7 11 12 Saran 389933 199772 190161 6667 3384 3283 12.0 0.20 33.6 48.5 12 13 Siwan 309013 153558 155455 13822 6856 6966 11.4 0.50 35.6 44.0 13 14 Gopalganj 267250 134796 132454 6157 2984 3173 12.4 0.30 32.1 37.8 14 15 Muzaffarpur 594577 308894 285683 3472 1789 1683 15.9 0.10 28.9 50.4 15 16 E.Champaran 514119 270968 243151 4812 2518 2294 13.0 0.10 20.6 34.3 16 17 W.Champaran 434714 228057 206657 44912 23135 21777 14.3 1.50 22.3 24.1 17 18 Sitamarhi 315646 166607 149039 1786 952 834 11.8 0.10 22.1 31.4 18 19 Sheohar 74391 39405 34986 64 35 29 14.4 0.00 16.9 38.8 19 20 Vaishali 562123 292711 269412 3068 1595 1473 20.7 0.10 29.4 29.9 20 21 Darbhanga 511125 266236 244889 841 467 374 15.5 0.00 24.7 49.5 21 22 Madhubani 481922 248774 233148 1260 647 613 13.5 0.00 22.2 35.8 22 23 Samastipur 628838 325101 303737 3362 2724 638 18.5 0.10 25.1 22.0 23 24 Munger 150947 80031 70916 18060 9297 8763 13.3 1.60 42.6 37.3 24 25 Begusarai 341173 177897 163276 1505 823 682 14.5 0.10 31.4 78.6 25 26 Shekhapura 103732 54327 49405 211 115 96 19.7 0.00 25.2 45.6 26 27 Lakhisarai 126575 65781 60794 5636 2918 2718 15.8 0.70 26.8 12.9 27 28 Jamui 242710 124538 118172 67357 34689 32668 17.4 4.80 24.5 26.7
Your problem is that the CSV is whitespace-delimited, but some of your district names also have whitespace in them. Luckily, none of the district names contain '\t' characters, so we can fix this: df = pandas.read_csv('biharpopulation.txt', delimiter='\t')
Pandas/Python: interpolation of multiple columns based on values specified for one reference column
df Out[1]: PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV 0 978.0 345 17.0 16.5 97 12.22 0 0 292.0 326.8 294.1 1 977.0 354 17.8 16.7 93 12.39 1 0 292.9 328.3 295.1 2 970.0 416 23.4 15.4 61 11.47 4 2 299.1 332.9 301.2 3 963.0 479 24.0 14.0 54 10.54 8 3 300.4 331.6 302.3 4 948.7 610 23.0 13.4 55 10.28 15 6 300.7 331.2 302.5 5 925.0 830 21.4 12.4 56 9.87 20 5 301.2 330.6 303.0 6 916.0 914 20.7 11.7 56 9.51 20 4 301.3 329.7 303.0 7 884.0 1219 18.2 9.2 56 8.31 60 4 301.8 326.7 303.3 8 853.1 1524 15.7 6.7 55 7.24 35 3 302.2 324.1 303.5 9 850.0 1555 15.4 6.4 55 7.14 20 2 302.3 323.9 303.6 10 822.8 1829 13.3 5.6 60 6.98 300 4 302.9 324.0 304.1 How do I interpolate the values of all the columns on specified PRES (pressure) values at say PRES=[950, 900, 875]? Is there an elegant pandas type of way to do this? The only way I can think of doing this is to first start with making empty NaN values for the entire row for each specified PRES values in a loop, then set PRES as index and then use the pandas native interpolate option: df.interpolate(method='index', inplace=True) Is there a more elegant solution?
Use your solution with no loop - reindex by union original index values with PRES list, but working only if all values are unique: PRES=[950, 900, 875] df = df.set_index('PRES') df = df.reindex(df.index.union(PRES)).sort_index(ascending=False).interpolate(method='index') print (df) HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV 978.0 345.0 17.0 16.5 97.0 12.22 0.0 0.0 292.0 326.8 294.1 977.0 354.0 17.8 16.7 93.0 12.39 1.0 0.0 292.9 328.3 295.1 970.0 416.0 23.4 15.4 61.0 11.47 4.0 2.0 299.1 332.9 301.2 963.0 479.0 24.0 14.0 54.0 10.54 8.0 3.0 300.4 331.6 302.3 950.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1 948.7 610.0 23.0 13.4 55.0 10.28 15.0 6.0 300.7 331.2 302.5 925.0 830.0 21.4 12.4 56.0 9.87 20.0 5.0 301.2 330.6 303.0 916.0 914.0 20.7 11.7 56.0 9.51 20.0 4.0 301.3 329.7 303.0 900.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1 884.0 1219.0 18.2 9.2 56.0 8.31 60.0 4.0 301.8 326.7 303.3 875.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1 853.1 1524.0 15.7 6.7 55.0 7.24 35.0 3.0 302.2 324.1 303.5 850.0 1555.0 15.4 6.4 55.0 7.14 20.0 2.0 302.3 323.9 303.6 822.8 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1 If possible not unique values in PRES column, then use concat with sort_index: PRES=[950, 900, 875] df = df.set_index('PRES') df = (pd.concat([df, pd.DataFrame(index=PRES)]) .sort_index(ascending=False) .interpolate(method='index'))