We want to fetch data from our mySQL database, and we're using python (sqlalchemy) to do so. We're then saving the data on pandas dataframes. So far we're receiving data, but the column names are not included, and is automatically just indexed instead. How can we include column names, so that the true names are included and not just numbers from 0-5.
import pandas as pd
from pandas.io import sql
from sqlalchemy import create_engine
engine = create_engine("mysql://root:DTULab#123#localhost/Afgangsprojekt?host=localhost?port=3306")
conn = engine.connect()
result = conn.execute("SELECT * FROM Weather_Station").fetchall()
df = pd.DataFrame(result)
print(df)
Output prints the following:
0 1 2 3 4 5
0 0 2019-07-26 14:50:13 27.3 29.8 45.0 44.0
1 1 2019-07-26 15:00:13 26.9 28.3 44.0 48.0
2 2 2019-07-26 15:10:13 28.0 28.3 41.0 48.0
3 3 2019-07-26 15:20:13 27.8 28.3 39.0 48.0
4 4 2019-07-26 15:30:13 27.0 28.3 40.0 48.0
5 5 2019-07-26 15:40:13 26.8 28.3 42.0 48.0
6 6 2019-07-26 15:50:13 27.0 28.3 42.0 48.0
7 7 2019-07-26 16:00:14 26.8 27.2 42.0 41.0
8 8 2019-07-26 16:10:13 27.0 27.2 42.0 41.0
9 9 2019-07-26 16:20:13 26.8 27.2 43.0 41.0
10 10 2019-07-26 16:30:13 26.4 27.2 44.0 41.0
11 11 2019-07-26 16:40:13 27.1 27.2 42.0 41.0
12 12 2019-07-26 16:50:13 26.2 27.2 43.0 41.0
13 13 2019-07-26 17:00:14 25.6 26.6 44.0 43.0
14 14 2019-07-26 17:10:14 25.5 26.6 47.0 43.0
15 15 2019-07-26 17:20:14 25.3 26.6 49.0 43.0
16 16 2019-07-26 17:30:14 25.1 26.6 51.0 43.0
17 17 2019-07-26 17:40:14 25.6 26.6 52.0 43.0
18 18 2019-07-26 17:50:14 24.8 26.6 55.0 43.0
19 19 2019-07-26 18:00:14 24.4 25.2 57.0 51.0
20 20 2019-07-26 18:10:14 24.6 25.2 57.0 51.0
21 21 2019-07-26 18:20:14 24.4 25.2 58.0 51.0
22 22 2019-07-26 18:30:14 24.4 25.2 58.0 51.0
23 23 2019-07-26 18:40:14 24.8 25.2 57.0 51.0
24 24 2019-07-26 18:50:14 25.0 25.2 57.0 51.0
25 25 2019-07-26 19:00:15 24.9 24.7 57.0 57.0
26 26 2019-07-26 19:10:14 25.1 24.7 56.0 57.0
27 27 2019-07-26 19:20:14 25.4 24.7 49.0 57.0
28 28 2019-07-26 19:30:14 25.4 24.7 48.0 57.0
29 29 2019-07-26 19:40:13 25.4 24.7 48.0 57.0
.. ... ... ... ... ... ...
822 822 2019-08-01 07:30:13 13.7 14.0 94.0 94.0
823 823 2019-08-01 07:40:13 13.6 14.0 95.0 94.0
824 824 2019-08-01 07:50:13 13.6 14.0 97.0 94.0
825 825 2019-08-01 08:00:13 13.9 13.7 97.0 94.0
826 826 2019-08-01 08:10:13 13.8 13.7 94.0 94.0
827 827 2019-08-01 08:20:13 13.6 13.7 93.0 94.0
828 828 2019-08-01 08:30:14 13.6 13.7 92.0 94.0
829 829 2019-08-01 08:40:13 13.8 13.7 92.0 94.0
830 830 2019-08-01 08:50:13 14.0 13.7 91.0 94.0
831 831 2019-08-01 09:00:13 13.9 13.8 91.0 93.0
832 832 2019-08-01 09:10:13 13.9 13.8 90.0 93.0
833 833 2019-08-01 09:20:13 13.8 13.8 91.0 93.0
834 834 2019-08-01 09:30:13 13.6 13.8 93.0 93.0
835 835 2019-08-01 09:40:13 13.6 13.8 94.0 93.0
836 836 2019-08-01 09:50:13 13.6 13.8 94.0 93.0
837 837 2019-08-01 10:00:13 13.9 13.7 94.0 92.0
838 838 2019-08-01 10:10:13 13.9 13.7 95.0 92.0
839 839 2019-08-01 10:20:13 14.0 13.7 94.0 92.0
840 840 2019-08-01 10:30:13 14.3 13.7 95.0 92.0
841 841 2019-08-01 10:40:13 14.4 13.7 95.0 92.0
842 842 2019-08-01 10:50:13 14.6 13.7 94.0 92.0
843 843 2019-08-01 11:00:13 14.9 14.3 94.0 94.0
844 844 2019-08-01 11:10:14 15.0 14.3 93.0 94.0
845 845 2019-08-01 11:20:14 15.3 14.3 93.0 94.0
846 846 2019-08-01 11:30:14 15.5 14.3 92.0 94.0
847 847 2019-08-01 11:40:13 15.5 14.3 92.0 94.0
848 848 2019-08-01 11:50:13 15.4 14.3 85.0 94.0
849 849 2019-08-01 12:00:13 15.3 15.3 86.0 91.0
850 850 2019-08-01 12:10:13 15.3 15.3 86.0 91.0
851 851 2019-08-01 12:20:13 15.3 15.3 87.0 91.0
Try This
To read : read_sql
To write : to_sql
import pandas as pd
from pandas.io import sql
from sqlalchemy import create_engine
engine = create_engine("mysql://root:DTULab#123#localhost/Afgangsprojekt?host=localhost?port=3306")
connection = engine.connect()
Query = "<Query Here>"
df = pd.read_sql(Query, connection)
print(df.head(50)) # For 50 Rows to be printed
You could try calling the read_sql and pass the connection to Read SQL query or database table into a DataFrame : read_sql
import pandas as pd
from pandas.io import sql
from sqlalchemy import create_engine
engine = create_engine("mysql://root:DTULab#123#localhost/Afgangsprojekt?host=localhost?port=3306")
connection = engine.connect()
df = pd.read_sql("SELECT * FROM Weather_Station", connection)
print(df)
Related
I wish to create a DataFrame where each row is one day, and the columns provide the date, hourly data, and maximum minimum of the day's data. Here is an example (I provide the input data further down in the question):
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.3 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
1 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 25.8 25.4 22.1 21.8 21.0 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
My input DataFrame has a row for each hour, with the date & time, mean, max, and min for each hour as its columns.
I wish to iterate through each day in the input DataFrame and do the following:
Check that there is a row for each hour of the day
Check that there is both maximum and minimum data for each hour of the day
If the conditions above are met, I wish to:
Add a row to the output DataFrame for the given date
Use the date to fill the 'Date_time' cell for the row
Transpose the hourly data to the hourly cells
Find the max of the hourly max data, and use it to fill the max cell for the row
Find the min of the hourly min data, and use it to fill the min cell for the row
Example daily input data examples follow.
Example 1
All hours for day available
Max & min available for each hour
Proceed to create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
0 2019-02-03 00:00:00 18.6 18.7 18.5
1 2019-02-03 01:00:00 18.6 18.7 18.5
2 2019-02-03 02:00:00 18.2 18.5 18.0
3 2019-02-03 03:00:00 18.0 18.0 17.9
4 2019-02-03 04:00:00 18.0 18.1 17.9
5 2019-02-03 05:00:00 18.3 18.4 18.1
6 2019-02-03 06:00:00 18.7 19.1 18.4
7 2019-02-03 07:00:00 20.1 21.3 19.1
8 2019-02-03 08:00:00 21.7 22.9 21.0
9 2019-02-03 09:00:00 23.2 23.9 22.8
10 2019-02-03 10:00:00 23.7 24.1 23.3
11 2019-02-03 11:00:00 24.6 25.5 24.0
12 2019-02-03 12:00:00 25.1 25.7 24.7
13 2019-02-03 13:00:00 24.5 25.0 24.2
14 2019-02-03 14:00:00 23.9 25.3 21.2
15 2019-02-03 15:00:00 19.6 21.2 18.8
16 2019-02-03 16:00:00 19.2 19.5 18.7
17 2019-02-03 17:00:00 19.8 19.9 19.4
18 2019-02-03 18:00:00 19.6 19.8 19.5
19 2019-02-03 19:00:00 19.3 19.4 19.1
20 2019-02-03 20:00:00 19.2 19.4 19.1
21 2019-02-03 21:00:00 19.3 19.4 18.9
22 2019-02-03 22:00:00 18.8 19.0 18.7
23 2019-02-03 23:00:00 19.0 19.1 18.9
Example 2
All hours for day available
Max & min available for each hour
NaN values for some Mean_temp entries
Proceed to create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
24 2019-02-04 00:00:00 18.9 19.0 18.9
25 2019-02-04 01:00:00 18.8 18.9 18.7
26 2019-02-04 02:00:00 18.6 18.8 18.4
27 2019-02-04 03:00:00 18.4 18.6 18.1
28 2019-02-04 04:00:00 18.7 18.9 18.4
29 2019-02-04 05:00:00 18.8 18.8 18.7
30 2019-02-04 06:00:00 19.0 19.3 18.8
31 2019-02-04 07:00:00 19.7 20.4 19.3
32 2019-02-04 08:00:00 21.4 22.8 20.3
33 2019-02-04 09:00:00 23.5 23.9 22.8
34 2019-02-04 10:00:00 25.7 23.6
35 2019-02-04 11:00:00 26.5 25.4
36 2019-02-04 12:00:00 27.1 26.1
37 2019-02-04 13:00:00 25.8 26.8 24.8
38 2019-02-04 14:00:00 25.4 27.8 23.7
39 2019-02-04 15:00:00 22.1 24.1 20.2
40 2019-02-04 16:00:00 21.8 22.6 20.2
41 2019-02-04 17:00:00 20.9 22.4 19.6
42 2019-02-04 18:00:00 18.9 19.6 18.6
43 2019-02-04 19:00:00 18.8 18.9 18.6
44 2019-02-04 20:00:00 18.9 19.0 18.8
45 2019-02-04 21:00:00 18.8 18.9 18.7
46 2019-02-04 22:00:00 18.8 18.9 18.7
47 2019-02-04 23:00:00 18.9 19.2 18.7
Example 3
Not all hours of the day are available
Do not create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
48 2019-02-05 00:00:00 19.2 19.3 19.0
49 2019-02-05 01:00:00 19.3 19.4 19.3
50 2019-02-05 02:00:00 19.3 19.4 19.2
51 2019-02-05 03:00:00 19.4 19.5 19.4
52 2019-02-05 04:00:00 19.5 19.6 19.3
53 2019-02-05 05:00:00 19.3 19.5 19.1
54 2019-02-05 06:00:00 20.1 20.6 19.2
55 2019-02-05 07:00:00 21.1 21.7 20.6
56 2019-02-05 08:00:00 22.3 23.2 21.7
57 2019-02-05 15:00:00 25.3 25.8 25.0
58 2019-02-05 16:00:00 25.8 26.0 25.2
59 2019-02-05 17:00:00 24.3 25.2 23.3
60 2019-02-05 18:00:00 22.5 23.3 22.1
61 2019-02-05 19:00:00 21.6 22.1 21.1
62 2019-02-05 20:00:00 21.1 21.3 20.9
63 2019-02-05 21:00:00 21.2 21.3 20.9
64 2019-02-05 22:00:00 20.9 21.0 20.6
65 2019-02-05 23:00:00 19.9 20.6 19.7
Example 4
All hours of the day are available
Max and/or min have at least one NaN value
Do not create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
66 2019-02-06 00:00:00 19.7 19.8 19.7
67 2019-02-06 01:00:00 19.6 19.7 19.3
68 2019-02-06 02:00:00 19.0 19.3 18.6
69 2019-02-06 03:00:00 18.5 18.6 18.4
70 2019-02-06 04:00:00 18.6 18.7 18.4
71 2019-02-06 05:00:00 18.5 18.6
72 2019-02-06 06:00:00 19.0 19.6 18.5
73 2019-02-06 07:00:00 20.3 21.2 19.6
74 2019-02-06 08:00:00 21.5 21.7 21.2
75 2019-02-06 09:00:00 21.4 22.3 20.9
76 2019-02-06 10:00:00 23.5 24.4 22.3
77 2019-02-06 11:00:00 24.7 25.4 24.3
78 2019-02-06 12:00:00 24.9 25.5 23.9
79 2019-02-06 13:00:00 23.4 24.0 22.9
80 2019-02-06 14:00:00 23.3 23.8 22.9
81 2019-02-06 15:00:00 24.4 23.7
82 2019-02-06 16:00:00 24.9 25.1 24.7
83 2019-02-06 17:00:00 24.4 24.9 23.8
84 2019-02-06 18:00:00 22.5 23.8 21.7
85 2019-02-06 19:00:00 20.8 21.8 19.6
86 2019-02-06 20:00:00 19.1 19.6 18.9
87 2019-02-06 21:00:00 19.0 19.1 18.9
88 2019-02-06 22:00:00 19.1 19.1 19.0
89 2019-02-06 23:00:00 19.1 19.1 19.0
Just to recap, the above inputs would create the following output:
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.3 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
1 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 25.8 25.4 22.1 21.8 21.0 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
I've had a really good think about this, and I can only come up with a horrible set of if statements that I known will be terribly slow and will take ages to write (apologies, this is due to me being bad at coding)!
Does anyone have any pointers to Pandas functions that could begin to deal with this problem efficiently?
You can use a groupby on the day of the Date_time column, and build each row of your final_df from each group (moving to the next iteration of the groupby whenever there are any missing values in the max_temp or min_temp columns, or whenever the length of the group is less than 24)
Note that I assuming that your Date_time column is of type datetime64[ns]. If it isn't, you should run the line: df['Date_time'] = pd.to_datetime(df['Date_time'])
all_hours = list(pd.date_range(start='1/1/22 00:00:00', end='1/1/22 23:00:00', freq='h').strftime('%H:%M'))
final_df = pd.DataFrame(columns=['Date_time'] + all_hours + ['Max','Min'])
## construct final_df by using a groupby on the day of the 'Date_time' column
for group,df_group in df.groupby(df['Date_time'].dt.date):
## check if NaN is in either 'Max Temp' or 'Min Temp' columns
new_df_data = {}
if (df_group[['Max_temp','Min_temp']].isnull().sum().sum() == 0) & (len(df_group) == 24):
## create a dictionary for the new row of the final_df
new_df_data['Date_time'] = group
new_df_data.update(dict(zip(all_hours, [[val] for val in df_group['Mean_temp']])))
new_df_data['Max'], new_df_data['Min'] = df_group['Max_temp'].max(), df_group['Min_temp'].min()
final_df = pd.concat([final_df, pd.DataFrame(new_df_data)])
else:
continue
Output:
>>> final_df
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.2 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
0 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 NaN NaN NaN 25.8 25.4 22.1 21.8 20.9 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
I have a dataframe like below:
0 1 2 ... 62 63 64
795 89.0 92.0 89.0 ... 74.0 64.0 4.0
575 80.0 75.0 78.0 ... 70.0 68.0 3.0
1119 2694.0 2437.0 2227.0 ... 4004.0 4010.0 6.0
777 90.0 88.0 88.0 ... 71.0 67.0 4.0
506 82.0 73.0 77.0 ... 69.0 64.0 2.0
... ... ... ... ... ... ... ...
65 84.0 77.0 78.0 ... 78.0 80.0 0.0
1368 4021.0 3999.0 4064.0 ... 1.0 4094.0 8.0
1036 80.0 80.0 79.0 ... 73.0 66.0 5.0
1391 3894.0 3915.0 3973.0 ... 4.0 4090.0 8.0
345 81.0 74.0 75.0 ... 80.0 75.0 1.0
I want to divide all elements over 1000 in this dataframe by 100. So 4021.0 becomes 40.21, et cetera.
I've tried something like below:
for cols in df:
for rows in df[cols]:
print(df[cols][rows])
I get index out of bound errors. I'm just not sure how to properly iterate the way I'm looking for.
I think loops are here slow, so better is use vectorizes solutions - select values greater like 1000 and divide:
df[df.gt(1000)] = df.div(100)
Or using DataFrame.mask:
df = df.mask(df.gt(1000), df.div(100))
print (df)
0 1 2 62 63 64
795 89.00 92.00 89.00 74.00 64.00 4.0
575 80.00 75.00 78.00 70.00 68.00 3.0
1119 26.94 24.37 22.27 40.04 40.10 6.0
777 90.00 88.00 88.00 71.00 67.00 4.0
506 82.00 73.00 77.00 69.00 64.00 2.0
65 84.00 77.00 78.00 78.00 80.00 0.0
1368 40.21 39.99 40.64 1.00 40.94 8.0
1036 80.00 80.00 79.00 73.00 66.00 5.0
1391 38.94 39.15 39.73 4.00 40.90 8.0
345 81.00 74.00 75.00 80.00 75.00 1.0
You can use the applymap function and create your custom function
def mapper_function(x):
if x >= 1000:
x=x/100
else:
x
return x
df=df.applymap(mapper_function)
CODE
import pandas
df = pandas.read_csv('biharpopulation.txt', delim_whitespace=True)
df.columns = ['SlNo','District','Total','Male','Female','Total','Male','Female','SC','ST','SC','ST']
DATA
SlNo District Total Male Female Total Male Female SC ST SC ST
1 Patna 729988 386991 342997 9236 5352 3884 15.5 0.2 38.6 68.7
2 Nalanda 473786 248246 225540 970 524 446 20.2 0.0 29.4 29.8
3 Bhojpur 343598 181372 162226 8337 4457 3880 15.3 0.4 39.1 46.7
4 Buxar 198014 104761 93253 8428 4573 3855 14.1 0.6 37.9 44.6
5 Rohtas 444333 233512 210821 25663 13479 12184 18.1 1.0 41.3 30.0
6 Kaimur 286291 151031 135260 35662 18639 17023 22.2 2.8 40.5 38.6
7 Gaya 1029675 529230 500445 2945 1526 1419 29.6 0.1 26.3 49.1
8 Jehanabad 174738 90485 84253 1019 530 489 18.9 0.07 32.6 32.4
9 Arawal 11479 57677 53802 294 179 115 18.8 0.04
10 Nawada 435975 223929 212046 2158 1123 1035 24.1 0.1 22.4 20.5
11 Aurangabad 472766 244761 228005 1640 865 775 23.5 0.1 35.7 49.7
Saran
12 Saran 389933 199772 190161 6667 3384 3283 12 0.2 33.6 48.5
13 Siwan 309013 153558 155455 13822 6856 6966 11.4 0.5 35.6 44.0
14 Gopalganj 267250 134796 132454 6157 2984 3173 12.4 0.3 32.1 37.8
15 Muzaffarpur 594577 308894 285683 3472 1789 1683 15.9 0.1 28.9 50.4
16 E. Champaran 514119 270968 243151 4812 2518 2294 13.0 0.1 20.6 34.3
17 W. Champaran 434714 228057 206657 44912 23135 21777 14.3 1.5 22.3 24.1
18 Sitamarhi 315646 166607 149039 1786 952 834 11.8 0.1 22.1 31.4
19 Sheohar 74391 39405 34986 64 35 29 14.4 0.0 16.9 38.8
20 Vaishali 562123 292711 269412 3068 1595 1473 20.7 0.1 29.4 29.9
21 Darbhanga 511125 266236 244889 841 467 374 15.5 0.0 24.7 49.5
22 Madhubani 481922 248774 233148 1260 647 613 13.5 0.0 22.2 35.8
23 Samastipur 628838 325101 303737 3362 2724 638 18.5 0.1 25.1 22.0
24 Munger 150947 80031 70916 18060 9297 8763 13.3 1.6 42.6 37.3
25 Begusarai 341173 177897 163276 1505 823 682 14.5 0.1 31.4 78.6
26 Shekhapura 103732 54327 49405 211 115 96 19.7 0.0 25.2 45.6
27 Lakhisarai 126575 65781 60794 5636 2918 2718 15.8 0.7 26.8 12.9
28 Jamui 242710 124538 118172 67357 34689 32668 17.4 4.8 24.5 26.7
The issue is with these 2 lines:
16 E. Champaran 514119 270968 243151 4812 2518 2294 13.0 0.1 20.6 34.3
17 W. Champaran 434714 228057 206657 44912 23135 21777 14.3 1.5 22.3 24.1
If you can somehow remove the space between E. Champaran and W. Champaran then you can do this:
df = pd.read_csv('test.csv', sep=r'\s+', skip_blank_lines=True, skipinitialspace=True)
print(df)
SlNo District Total Male Female Total.1 Male.1 Female.1 SC ST SC.1 ST.1
0 1 Patna 729988 386991 342997 9236 5352 3884 15.5 0.20 38.6 68.7
1 2 Nalanda 473786 248246 225540 970 524 446 20.2 0.00 29.4 29.8
2 3 Bhojpur 343598 181372 162226 8337 4457 3880 15.3 0.40 39.1 46.7
3 4 Buxar 198014 104761 93253 8428 4573 3855 14.1 0.60 37.9 44.6
4 5 Rohtas 444333 233512 210821 25663 13479 12184 18.1 1.00 41.3 30.0
5 6 Kaimur 286291 151031 135260 35662 18639 17023 22.2 2.80 40.5 38.6
6 7 Gaya 1029675 529230 500445 2945 1526 1419 29.6 0.10 26.3 49.1
7 8 Jehanabad 174738 90485 84253 1019 530 489 18.9 0.07 32.6 32.4
8 9 Arawal 11479 57677 53802 294 179 115 18.8 0.04 NaN NaN
9 10 Nawada 435975 223929 212046 2158 1123 1035 24.1 0.10 22.4 20.5
10 11 Aurangabad 472766 244761 228005 1640 865 775 23.5 0.10 35.7 49.7
11 12 Saran 389933 199772 190161 6667 3384 3283 12.0 0.20 33.6 48.5
12 13 Siwan 309013 153558 155455 13822 6856 6966 11.4 0.50 35.6 44.0
13 14 Gopalganj 267250 134796 132454 6157 2984 3173 12.4 0.30 32.1 37.8
14 15 Muzaffarpur 594577 308894 285683 3472 1789 1683 15.9 0.10 28.9 50.4
15 16 E.Champaran 514119 270968 243151 4812 2518 2294 13.0 0.10 20.6 34.3
16 17 W.Champaran 434714 228057 206657 44912 23135 21777 14.3 1.50 22.3 24.1
17 18 Sitamarhi 315646 166607 149039 1786 952 834 11.8 0.10 22.1 31.4
18 19 Sheohar 74391 39405 34986 64 35 29 14.4 0.00 16.9 38.8
19 20 Vaishali 562123 292711 269412 3068 1595 1473 20.7 0.10 29.4 29.9
20 21 Darbhanga 511125 266236 244889 841 467 374 15.5 0.00 24.7 49.5
21 22 Madhubani 481922 248774 233148 1260 647 613 13.5 0.00 22.2 35.8
22 23 Samastipur 628838 325101 303737 3362 2724 638 18.5 0.10 25.1 22.0
23 24 Munger 150947 80031 70916 18060 9297 8763 13.3 1.60 42.6 37.3
24 25 Begusarai 341173 177897 163276 1505 823 682 14.5 0.10 31.4 78.6
25 26 Shekhapura 103732 54327 49405 211 115 96 19.7 0.00 25.2 45.6
26 27 Lakhisarai 126575 65781 60794 5636 2918 2718 15.8 0.70 26.8 12.9
27 28 Jamui 242710 124538 118172 67357 34689 32668 17.4 4.80 24.5 26.7
Your problem is that the CSV is whitespace-delimited, but some of your district names also have whitespace in them. Luckily, none of the district names contain '\t' characters, so we can fix this:
df = pandas.read_csv('biharpopulation.txt', delimiter='\t')
I'm trying to plot fantasy points from two players in every game since the start of the NBA season.
I've created a dataframe that has the lines of every player, every night, and I want to plot every date that each have played.
The two dataframes look as such.
kemba[['Date','FP']]
Date FP
Rk
260 10/23/2019 2.0
532 10/25/2019 28.0
754 10/26/2019 49.0
1390 10/30/2019 35.0
1628 11/1/2019 39.5
2178 11/5/2019 32.5
2463 11/7/2019 17.5
2800 11/9/2019 40.0
3103 11/11/2019 37.5
3410 11/13/2019 37.0
3699 11/15/2019 25.0
4001 11/17/2019 22.5
4186 11/18/2019 22.0
4494 11/20/2019 9.5
4750 11/22/2019 4.0
5637 11/27/2019 50.5
5904 11/29/2019 19.0
6193 12/1/2019 22.5
6677 12/4/2019 43.5
6975 12/6/2019 26.0
7454 12/9/2019 33.5
7769 12/11/2019 57.0
7861 12/12/2019 31.5
8614 12/18/2019 35.5
9071 12/20/2019 5.0
9289 12/22/2019 26.0
100 12/25/2019 23.0
ingram[['Date','FP']]
Date FP
Rk
22 10/22/2019 31.5
441 10/25/2019 37.5
646 10/26/2019 57.0
984 10/28/2019 41.5
1439 10/31/2019 30.0
1718 11/2/2019 10.5
1994 11/4/2019 59.0
2586 11/8/2019 30.0
2757 11/9/2019 31.5
4245 11/19/2019 30.5
4532 11/21/2019 38.5
4864 11/23/2019 40.5
5022 11/24/2019 32.5
5496 11/27/2019 22.0
5784 11/29/2019 43.0
6111 12/1/2019 31.0
6404 12/3/2019 40.0
6737 12/5/2019 27.0
7038 12/7/2019 18.0
7372 12/9/2019 38.5
7668 12/11/2019 29.0
7958 12/13/2019 38.0
8283 12/15/2019 32.5
8551 12/17/2019 24.0
8612 12/18/2019 48.0
8891 12/20/2019 30.5
102 12/23/2019 31.0
55 12/25/2019 46.5
The data that I've plotted is such:
# creating x & y for Ingram
ingram_fp=ingram['FP']
ingram_date=ingram['Date']
# creating x and y for Kemmba
kemba_fp=kemba['FP']
kemba_date=kemba['Date']
fig=plt.figure()
plt.plot(kemba_date,kemba_fp,color='#FF5733',linewidth=1,marker='.',label='Walker')
plt.plot(ingram_date,ingram_fp,color='#33A7FF',marker='.',label='Ingram')
fig.autofmt_xdate()
plt.show()
When I do this, the link for Ingram is all over the place. Any idea on what went wrong?
This is the plot I get
It looks like Date might not be formatted as a date.
Modify your code as follows:
import pandas as pd
# creating x & y for Ingram
ingram_fp=ingram['FP']
ingram_date=pd.to_datetime(ingram['Date'])
# creating x and y for Kemmba
kemba_fp=kemba['FP']
kemba_date=pd.to_datetime(kemba['Date'])
df
Out[1]:
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
0 978.0 345 17.0 16.5 97 12.22 0 0 292.0 326.8 294.1
1 977.0 354 17.8 16.7 93 12.39 1 0 292.9 328.3 295.1
2 970.0 416 23.4 15.4 61 11.47 4 2 299.1 332.9 301.2
3 963.0 479 24.0 14.0 54 10.54 8 3 300.4 331.6 302.3
4 948.7 610 23.0 13.4 55 10.28 15 6 300.7 331.2 302.5
5 925.0 830 21.4 12.4 56 9.87 20 5 301.2 330.6 303.0
6 916.0 914 20.7 11.7 56 9.51 20 4 301.3 329.7 303.0
7 884.0 1219 18.2 9.2 56 8.31 60 4 301.8 326.7 303.3
8 853.1 1524 15.7 6.7 55 7.24 35 3 302.2 324.1 303.5
9 850.0 1555 15.4 6.4 55 7.14 20 2 302.3 323.9 303.6
10 822.8 1829 13.3 5.6 60 6.98 300 4 302.9 324.0 304.1
How do I interpolate the values of all the columns on specified PRES (pressure) values at say PRES=[950, 900, 875]? Is there an elegant pandas type of way to do this?
The only way I can think of doing this is to first start with making empty NaN values for the entire row for each specified PRES values in a loop, then set PRES as index and then use the pandas native interpolate option:
df.interpolate(method='index', inplace=True)
Is there a more elegant solution?
Use your solution with no loop - reindex by union original index values with PRES list, but working only if all values are unique:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = df.reindex(df.index.union(PRES)).sort_index(ascending=False).interpolate(method='index')
print (df)
HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
978.0 345.0 17.0 16.5 97.0 12.22 0.0 0.0 292.0 326.8 294.1
977.0 354.0 17.8 16.7 93.0 12.39 1.0 0.0 292.9 328.3 295.1
970.0 416.0 23.4 15.4 61.0 11.47 4.0 2.0 299.1 332.9 301.2
963.0 479.0 24.0 14.0 54.0 10.54 8.0 3.0 300.4 331.6 302.3
950.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
948.7 610.0 23.0 13.4 55.0 10.28 15.0 6.0 300.7 331.2 302.5
925.0 830.0 21.4 12.4 56.0 9.87 20.0 5.0 301.2 330.6 303.0
916.0 914.0 20.7 11.7 56.0 9.51 20.0 4.0 301.3 329.7 303.0
900.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
884.0 1219.0 18.2 9.2 56.0 8.31 60.0 4.0 301.8 326.7 303.3
875.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
853.1 1524.0 15.7 6.7 55.0 7.24 35.0 3.0 302.2 324.1 303.5
850.0 1555.0 15.4 6.4 55.0 7.14 20.0 2.0 302.3 323.9 303.6
822.8 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
If possible not unique values in PRES column, then use concat with sort_index:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = (pd.concat([df, pd.DataFrame(index=PRES)])
.sort_index(ascending=False)
.interpolate(method='index'))