Related
I am new to Python/Bokeh/Pandas.
I am able to plot line graph in pandas/bokeh using parse_date options.
However I have come across a dataset(.csv) where the column is like below
My code is as below which gives a blank graph if the column 'Year/Ports' is in YYYY-YY form like from 1952-53, 1953-54, 1954-55 etc.
Do I have to extract only the YYYY and plot because that works but I am sure that is not how the data is to be visualized.
If I extract only the YYYY using CSV or Notepad++ tools then there is no issue as the dates are read perfectly and I get a good meaningful line graph
#Total Cargo Handled at Mormugao Port from 1950-51 to 2019-20
import pandas as pd
from bokeh.plotting import figure,show
from bokeh.io import output_file
#read the CSV file shared by GOI
df = pd.read_csv("Cargo_Data_full.csv",parse_dates=["Year/Ports"])
# selecting rows based on condition
output_file("Cargo tracker.html")
f = figure(height=200,sizing_mode = 'scale_width',x_axis_type = 'datetime')
f.title.text = "Cargo Tracker"
f.xaxis.axis_label="Year/Ports"
f.yaxis.axis_label="Cargo handled"
f.line(df['Year/Ports'],df['OTHERS'])
show(f)
You can't use parse_dates in this case, since the format is not a valid datetime. You can use pandas string slicing to only keep the YYYY part.
df = pd.DataFrame({'Year/Ports':['1952-53', '1953-54', '1954-55'], 'val':[1,2,3]})
df['Year/Ports'] = df['Year/Ports'].str[:4]
print(df)
Year/Ports val
0 1952 1
1 1953 2
2 1954 3
From there you can turn it into a datetime if that makes sense for you.
df['Year/Ports'] = pd.to_datetime(df['Year/Ports'])
print(df)
Year/Ports val
0 1952-01-01 1
1 1953-01-01 2
2 1954-01-01 3
I have a wireless radio readout that basically dumps all of the data into one column (column 'A') a of a spreadsheet (.xlsx). Is there anyway to parse the twenty plus columns into a dataframe for pandas? This is example of the data that is in column A of the excel file:
DSP ALLMSINFO:SECTORID=0,CARRIERID=0;
Belgium351G
+++ HUAWEI 2020-04-03 10:04:47 DST
O&M #4421590
%%/*35687*/DSP ALLMSINFO:SECTORID=0,CARRIERID=0;%%
RETCODE = 0 Operation succeeded
Display Information of All MSs-
------------------------------
Sector ID Carrier ID MSID MSSTATUS MSPWR(dBm) DLCINR(dB) ULCINR(dB) DLRSSI(dBm) ULRSSI(dBm) DLFEC ULFEC DLREPETITIONFATCTOR ULREPETITIONFATCTOR DLMIMOFLAG BENUM NRTPSNUM RTPSNUM ERTPSNUM UGSNUM UL PER for an MS(0.001) NI Value of the Band Where an MS Is Located(dBm) DL Traffic Rate for an MS(byte/s) UL Traffic Rate for an MS(byte/s)
0 0 0011-4D10-FFBA Enter -2 29 27 -56 -107 21 20 0 0 MIMO B 2 0 0 0 0 0 -134 158000 46000
0 0 501F-F63B-FB3B Enter 13 27 28 -68 -107 21 20 0 0 MIMO A 2 0 0 0 0 0 -134 12 8
Basically I just want to parse this data and have the table in a dataframe. Any help would be greatly appreciated.
You could try pandas read excel
df = pd.read_excel(filename, skip_rows=9)
This assumes we want to ignore the first 9 rows that don't make up the dataframe! Docs here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
Load the excel file and split the column on the spaces.
A problem may occur with "DLMIMOFLAG" because it has a space in the data and this will cause it to be split over two columns. It's optional whether this is acceptable or if the columns are merged back together afterwards.
Add the header manually rather than load it, otherwise all the spaces in the header will confuse the loading & splitting routines.
import numpy as np
import pandas as pd
# Start on the first data row - row 10
# Make sure pandas knows that only data is being loaded by using
# header=None
df = pd.read_excel('radio.xlsx', skiprows=10, header=None)
This gives a dataframe that is only data, all held in one column.
To split these out, make sure pandas has a reference to the first column with df.iloc[:,0], split the column based on spaces with str.split() and inform pandas the output will be a numpy list values.tolist().
Together this looks like:
df2 = pd.DataFrame(df.iloc[:,0].str.split().values.tolist())
Note the example given has an extra column because of the space in "DLMIMOFLAG" causing it to be split over two columns. This will be referred to as "DLMIMOFLAG_A" and "DLMIMOFLAG_B".
Now add on the column headers.
Optionally create a list first.
column_names = ["Sector ID", "Carrier ID", "MSID", "MSSTATUS", "MSPWR(dBm)", "DLCINR(dB)", "ULCINR(dB)",
"DLRSSI(dBm)", "ULRSSI(dBm)", "DLFEC", "ULFEC", "DLREPETITIONFATCTOR", "ULREPETITIONFATCTOR",
"DLMIMOFLAG_A", "DLMIMOFLAG_B", "BENUM", "NRTPSNUM", "RTPSNUM", "ERTPSNUM", "UGSNUM",
"UL PER for an MS(0.001)", "NI Value of the Band Where an MS Is Located(dBm)",
"DL Traffic Rate for an MS(byte/s)", "UL Traffic Rate for an MS(byte/s)",]
df2.columns = column_names
This gives the output as a full dataframe with column headers.
Sector ID Carrier ID MSID MSSTATUS
0 0 0011-4D10-FFBA Enter
0 0 501F-F63B-FB3B Enter
I currently try to convert a CSV with python3 to a new format.
My later goal is to add some information to this file with pandas.
Thinks like "is the date a weekday or weekend?".
To achieve this, however, I have to overcome the first hurdle.
I need to transform my CSV file from this:
date,hour,price
2018-10-01,0-1,59.53
2018-10-01,1-2,56.10
2018-10-01,2-3,51.41
2018-10-01,3-4,47.38
2018-10-01,4-5,47.59
2018-10-01,5-6,51.61
2018-10-01,6-7,69.13
2018-10-01,7-8,77.32
2018-10-01,8-9,84.97
2018-10-01,9-10,79.56
2018-10-01,10-11,73.70
2018-10-01,11-12,71.63
2018-10-01,12-13,63.15
2018-10-01,13-14,60.24
2018-10-01,14-15,56.18
2018-10-01,15-16,53.00
2018-10-01,16-17,53.37
2018-10-01,17-18,60.42
2018-10-01,18-19,69.93
2018-10-01,19-20,75.00
2018-10-01,20-21,65.83
2018-10-01,21-22,53.86
2018-10-01,22-23,46.46
2018-10-01,23-24,42.50
2018-10-02,0-1,45.10
2018-10-02,1-2,44.10
2018-10-02,2-3,44.06
2018-10-02,3-4,43.70
2018-10-02,4-5,44.29
2018-10-02,5-6,48.13
2018-10-02,6-7,57.70
2018-10-02,7-8,68.21
2018-10-02,8-9,70.36
2018-10-02,9-10,54.53
2018-10-02,10-11,48.49
2018-10-02,11-12,46.19
2018-10-02,12-13,44.15
2018-10-02,13-14,30.79
2018-10-02,14-15,27.75
2018-10-02,15-16,30.74
2018-10-02,16-17,26.77
2018-10-02,17-18,38.68
2018-10-02,18-19,48.52
2018-10-02,19-20,49.03
2018-10-02,20-21,45.43
2018-10-02,21-22,32.04
2018-10-02,22-23,26.22
2018-10-02,23-24,1.08
2018-10-03,0-1,2.13
2018-10-03,1-2,0.10
...
to this:
date,0-1,1-2,2-3,3-4,4-5,5-6,6-7,7-8,8-9,...,23-24
2018-10-01,59.53,56.10,51.41,47.38,47.59,51.61,69.13,77.32,84.97,...,42.50
2018-10-02,45.10,44.10,44.06,43.70,44.29,....
2018-10.03,2.13,0.10,....
...
I've tried a lot with pandas DataFrames, but I can't come up with a solution.
import numpy as np
import pandas as pd
df = pd.read_csv('file.csv')
df
date hour price
0 2018-10-01 0-1 59.53
1 2018-10-01 1-2 56.10
2 2018-10-01 2-3 51.41
3 2018-10-01 3-4 47.38
4 2018-10-01 4-5 47.59
5 2018-10-01 5-6 51.61
6 2018-10-01 6-7 69.13
7 2018-10-01 7-8 77.32
8 2018-10-01 8-9 84.97
The DataFrame should look like this.
But I don't manage to fill the DataFrame.
df = pd.DataFrame(df, index=['date'], columns=['date','0-1','1-2','2-3', '3-4', '4-5', '5-6', '6-7', '7-8', '8-9', '9-10', '10-11', '11-12', '12-13', '13-14', '14-15', '15-16', '16-17', '17-18', '18-19', '19-20', '20-21', '21-22', '22-23', '23-24'])
How would you solve this?
You can use pandas.DataFrame.unstack():
# pivot the dataframe with hour to the columns
df1 = df.set_index(['date','hour']).unstack(1)
# drop level-0 on columns
df1.columns = [ c[1] for c in df1.columns ]
# sort the column names by numeric order of hours (the number before '-')
df1 = df1.reindex(columns=sorted(df1.columns, key=lambda x: int(x.split('-')[0]))).reset_index()
If I understand correctly, try using the index_col argument of pd.read_csv(), using integer labelling for the columns in the file:
df = pd.read_csv('file.csv', index_col=0)
read_csv docs here; don't be put off by the alarming number of keyword arguments, one of them will often do what you need!
You may need to parse the first two columns as a date, then add a column for weekend based on a condition on the result. See the parse_dates and infer_datetime_format keyword arguments.
i cant seems to remove the unnamed and also the serial number from the csv file. i've look online it says using index_col = 0. but still not working.
Is there any other way doing it?
Code is :
brics = pd.read_csv('brics.csv', index_col = 0)
and the csv output is :
Unnamed: 0.1 country capital area population
0 BR Brazil Brasilia 8.516 200.40
1 RU Russia Moscow 17.100 143.50
3 CH China Beijing 9.597 1357.00
4 SA South_Africa Pretoria 1.221 52.98
What i need to to remove the unnamed:0.1 and also the serial number
Thanks
Thanks
use if you really want no index to be printed:
brics.drop(['Unnamed:0.1'],axis=1,inplace=True)
print (brics.to_string(index=False))
if you don't want to loose the data in the unnamed column, simply rename it:
brics.rename(columns={'Unnamed:0.1':'something'},inplace = True)
if you want this to be your axis, add the below line after you rename the column:
brics.set_index('something')
post this you can call the print function.
Hope this helps!
What you call a "serial number" is the index of your dataframe (dataframe is your object type - pandas.DataFrame) so you cannot remove it.
Read more about removing the index in that SO Post:
Removing index column in pandas
to remove your unnamend column try:
brics = pd.read_csv('brics.csv', index_col = 0)
brics =brics.drop(columns=['Unnamend: 0.1'])
brics
Is there a way to specify a DataFrame index (row) based on matching text inside the dataframe?
I am importing a text file from the internet located here every day into a python pandas DataFrame. I am parsing out just some of the data and doing calculations to give me the peak value for each day. The specific group of data I am needing to gather starts with the section headed "RTO COMBINED HOUR ENDING INTEGRATED FORECAST LOAD MW".
I need to specifically only use part of the data to do the calculations I need and I am able to manually specify which index line to start with, but daily this number could change due to text added to the top of the file by the authors.
Updated as of: 05-05-2016 1700 Constrained operations ARE expected in
the AEP, APS, BC, COMED, DOM,and PS zones on 05-06-2016. Constrained
operations ARE expected in the AEP, APS, BC, COMED, DOM,and PS zones
on 05-07-2016. The PS/ConEd 600/400 MW contract will be limited to
700MW on 05-06-16.
Is there a way to match text in the pandas DataFrame and specify the index of that match? Currently I am manually specifying the index I want to start with using the variable 'day' below on the 6th line. I would like this variable to hold the index (row) of the dataframe that includes the text I want to match.
The code below works but may stop working if the line number (index) changes:
def forecastload():
wb = load_workbook(filename = 'pjmactualload.xlsx')
ws = wb['PJM Load']
printRow = 13
#put this in iteration to pull 2 rows of data at a time (one for each day) for 7 days max
day = 239
while day < 251:
#pulls in first day only
data = pd.read_csv("http://oasis.pjm.com/doc/projload.txt", skiprows=day, delim_whitespace=True, header=None, nrows=2)
#sets data at HE 24 = to data that is in HE 13- so I can delete column 0 data to allow checking 'max'
data.at[1,13]= data.at[1,1]
#get date for printing it with max load later on
newDate = str(data.at[0,0])
#now delete first column to get rid of date data. date already saved as newDate
data = data.drop(0,1)
data = data.drop(1,1)
#pull out max value of day
#add index to this for iteration ie dayMax[x] = data.values.max()
dayMax = data.max().max()
dayMin = data.min().min()
#print date and max load for that date
actualMax = "Forecast Max"
actualMin = "Forecast Min"
dayMax = int(dayMax)
maxResults = [str(newDate),int(dayMax),actualMax,dayMin,actualMin]
d = 1
for items in maxResults:
ws.cell(row=printRow, column=d).value = items
d += 1
printRow += 1
#print maxResults
#l.writerows(maxResults)
day = day + 2
wb.save('pjmactualload.xlsx')
In this case i recommend you to use the command line in order to obtain a dataset that you could read later with pandas and do whatever you want.
To retrieve the data you can use curl and grep:
$ curl -s http://oasis.pjm.com/doc/projload.txt | grep -A 17 "RTO COMBINED HOUR ENDING INTEGRATED FORECAST" | tail -n +5
05/06/16 am 68640 66576 65295 65170 66106 70770 77926 83048 84949 85756 86131 86089
pm 85418 85285 84579 83762 83562 83289 82451 82460 84009 82771 78420 73258
05/07/16 am 66809 63994 62420 61640 61848 63403 65736 68489 71850 74183 75403 75529
pm 75186 74613 74072 73950 74386 74978 75135 75585 77414 76451 72529 67957
05/08/16 am 63583 60903 59317 58492 58421 59378 60780 62971 66289 68997 70436 71212
pm 71774 71841 71635 71831 72605 73876 74619 75848 78338 77121 72665 67763
05/09/16 am 63865 61729 60669 60651 62175 66796 74620 79930 81978 83140 84307 84778
pm 85112 85562 85568 85484 85766 85924 85487 85737 87366 84987 78666 72166
05/10/16 am 67581 64686 62968 62364 63400 67603 75311 80515 82655 84252 86078 87120
pm 88021 88990 89311 89477 89752 89860 89256 89327 90469 87730 81220 74449
05/11/16 am 70367 67044 65125 64265 65054 69060 76424 81785 84646 87097 89541 91276
pm 92646 93906 94593 94970 95321 95073 93897 93162 93615 90974 84335 77172
05/12/16 am 71345 67840 65837 64892 65600 69547 76853 82077 84796 87053 89135 90527
pm 91495 92351 92583 92473 92541 92053 90818 90241 90750 88135 81816 75042
Let's use the previous output (in the rto.txt file) to obtain a more readable data using awk and sed:
$ awk '/^ [0-9]/{d=$1;print $0;next}{print d,$0}' rto.txt | sed 's/^ //;s/\s\+/,/g'
05/06/16,am,68640,66576,65295,65170,66106,70770,77926,83048,84949,85756,86131,86089
05/06/16,pm,85418,85285,84579,83762,83562,83289,82451,82460,84009,82771,78420,73258
05/07/16,am,66809,63994,62420,61640,61848,63403,65736,68489,71850,74183,75403,75529
05/07/16,pm,75186,74613,74072,73950,74386,74978,75135,75585,77414,76451,72529,67957
05/08/16,am,63583,60903,59317,58492,58421,59378,60780,62971,66289,68997,70436,71212
05/08/16,pm,71774,71841,71635,71831,72605,73876,74619,75848,78338,77121,72665,67763
05/09/16,am,63865,61729,60669,60651,62175,66796,74620,79930,81978,83140,84307,84778
05/09/16,pm,85112,85562,85568,85484,85766,85924,85487,85737,87366,84987,78666,72166
05/10/16,am,67581,64686,62968,62364,63400,67603,75311,80515,82655,84252,86078,87120
05/10/16,pm,88021,88990,89311,89477,89752,89860,89256,89327,90469,87730,81220,74449
05/11/16,am,70367,67044,65125,64265,65054,69060,76424,81785,84646,87097,89541,91276
05/11/16,pm,92646,93906,94593,94970,95321,95073,93897,93162,93615,90974,84335,77172
05/12/16,am,71345,67840,65837,64892,65600,69547,76853,82077,84796,87053,89135,90527
05/12/16,pm,91495,92351,92583,92473,92541,92053,90818,90241,90750,88135,81816,75042
now, read and reshape the above result with pandas:
df = pd.read_csv("rto2.txt",names=["date","period"]+list(range(1,13)),index_col=[0,1])
df = df.stack().reset_index().rename(columns={"level_2":"hour",0:"value"})
df.index = pd.to_datetime(df.apply(lambda x: "{date} {hour}:00 {period}".format(**x),axis=1))
df.drop(["date", "hour", "period"], axis=1, inplace=True)
At this point you have a beautiful time series :)
In [10]: df.head()
Out[10]:
value
2016-05-06 01:00:00 68640
2016-05-06 02:00:00 66576
2016-05-06 03:00:00 65295
2016-05-06 04:00:00 65170
2016-05-06 05:00:00 66106
to obtain the statistics:
In[11]: df.groupby(df.index.date).agg([min,max])
Out[11]:
value
min max
2016-05-06 65170 86131
2016-05-07 61640 77414
2016-05-08 58421 78338
2016-05-09 60651 87366
2016-05-10 62364 90469
2016-05-11 64265 95321
2016-05-12 64892 92583
I hope this can help you.
Regards.
Here is how you can do what you are looking for:
And the sample code:
import numpy as np
import pandas a pd
df = pd.DataFrame(np.random.rand(10,4), columns=list('abcd'))
df.loc[df['a'] < 0.5, 'a'] = 1
You can refer to this documentation
Added image showing how to access index: