I have a .csv file that has data something like this:
#file...out/houses.csv
#data...sun may 1 11:20:43 2011
#user...abs12
#host...(null)
#group...class=house
#property..change_per_hour
#limit...0
#interval..10000000
#timestamp,house_0,house_1,house_2,house_3,.....,house_1000
2010-07-01 00:00:00 EDT,1.2,1.3,1.4,1.5,........,9.72
2010-07-01 01:00:00 EDT,2.2,2.3,2.4,2.5,........,19.72
2010-07-01 02:00:00 EDT,3.2,3.3,3.4,3.5,........,29.72
2010-07-01 05:00:00 EDT,5.2,5.3,5.4,5.5,........,59.72
2010-07-01 06:00:00 EDT,6.2,,6.4,,..............,
...
I want to convert this and save to a new .csv and the data should look like:
#file...out/houses.csv
#data...sun may 1 11:20:43 2011
#user...abs12
#host...(null)
#group...class=house
#property..change_per_hour
#limit...0
#interval..10000000
#EntityName,2010-07-01 00:00:00 EDT,2010-07-01 01:00:00 EDT,2010-07-01 02:00:00 EDT,2010-07-01 05:00:00 EDT,2010-07-01 06:00:00 EDT
house_0,1.2,2.2,3.2,5.2,6.2,...
house_1,1.3,2.3,3.3,5.3,,...
house_2,1.4,2.4,3.4,5.4,6.4,...
house_3,1.5,2.5,3.5,5.5,,...
...
house_1000,9.72,19.72,29.72,59.72,
I tried to use pandas: convert to a dictionary that looks like dtDict={'house_0':{'datetimestamp_1':'value_1','datetimestamp_2':'value_2'...}...}but I am not able to convert to a dictionary and use panda's DataFrame such as pandas.DataFrame(dtDict) to do that conversion. I dont have to use pandas (can you anything in python) but thought pandas is good for csv manipulation. any help?
Assuming it is in a pandas dataframe already, this works:
df = pd.DataFrame(
data=[[1, 3], [2, 5]],
index=[0, 1],
columns=['a', 'b']
)
Output:
>>>print(df)
a b
0 1 3
1 2 5
Then, transpose the dataframe:
>>>print(df.transpose())
0 1
a 1 2
b 3 5
Related
I'm trying to read an excel file into a data frame and I want set the index later, so I don't want pandas to use column 0 for the index values.
By default (index_col=None), it shouldn't use column 0 for the index but I find that if there is no value in cell A1 of the worksheet it will.
Is there any way to over-ride this behaviour (I am loading many sheets that have no value in cell A1)?
This works as expected when test1.xlsx has the value "DATE" in cell A1:
In [19]: pd.read_excel('test1.xlsx')
Out[19]:
DATE A B C
0 2018-01-01 00:00:00 0.766895 1.142639 0.810603
1 2018-01-01 01:00:00 0.605812 0.890286 0.810603
2 2018-01-01 02:00:00 0.623123 1.053022 0.810603
3 2018-01-01 03:00:00 0.740577 1.505082 0.810603
4 2018-01-01 04:00:00 0.335573 -0.024649 0.810603
But when the worksheet has no value in cell A1, it automatically assigns column 0 values to the index:
In [20]: pd.read_excel('test2.xlsx', index_col=None)
Out[20]:
A B C
2018-01-01 00:00:00 0.766895 1.142639 0.810603
2018-01-01 01:00:00 0.605812 0.890286 0.810603
2018-01-01 02:00:00 0.623123 1.053022 0.810603
2018-01-01 03:00:00 0.740577 1.505082 0.810603
2018-01-01 04:00:00 0.335573 -0.024649 0.810603
This is not what I want.
Desired result: Same as first example (but with 'Unnamed' as the column label perhaps).
Documentation says
index_col : int, list of int, default None.
Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column.
The issue that you're describing matches a known pandas bug. This bug was fixed in the recent pandas 0.24.0 release:
Bug Fixes
Bug in read_excel() in which index_col=None was not being respected and parsing index columns anyway (GH18792, GH20480)
You can also use
index_col=0
instead of
index_col = None
I was facing essentially the same issue since last couple of days.
I have an excel file that also have the first column header as Blank. So when it is read it gets read as an index.
I tried many options but below code works using skiprows instead of the header option. Interestingly skiprows uses the "Unnamed: 0" naming patterns for columns that does not have a header where as using the header option it did not work. We are using pandas version 0.20.1 :
df = pd.read_excel( "ABC.xlsx" , dtype=str, sheetname='Supply', skiprows =6, usecols = mycols )
df.columns
Index([ 'Unnamed: 0', 2015-01-01 00:00:00, 2015-02-01 00:00:00,
2015-03-01 00:00:00, 2015-04-01 00:00:00, 2015-05-01 00:00:00,
2015-06-01 00:00:00, 2015-07-01 00:00:00, 2015-08-01 00:00:00,
2015-09-01 00:00:00,
...
],
dtype='object', length=120)
The documentation does not provide any more info on this. But above work-around can save your day.
I have a dataframe which looks like this:
pressure mean pressure std
2016-03-01 00:00:00 615.686441 0.138287
2016-03-01 01:00:00 615.555000 0.067460
2016-03-01 02:00:00 615.220000 0.262840
2016-03-01 03:00:00 614.993333 0.138841
2016-03-01 04:00:00 615.075000 0.072778
2016-03-01 05:00:00 615.513333 0.162049
................
The first column is the index column.
I want to create a new dataframe with only the rows of 3pm and 3am,
so it will look like this:
pressure mean pressure std
2016-03-01 03:00:00 614.993333 0.138841
2016-03-01 15:00:00 616.613333 0.129493
2016-03-02 03:00:00 615.600000 0.068889
..................
Any ideas ?
Thank you !
I couldn't load your data using pd.read_clipboard(), so I'm going to recreate some data:
df = pd.DataFrame(index=pd.date_range('2016-03-01', freq='H', periods=72),
data=np.random.random(size=(72,2)),
columns=['pressure', 'mean'])
Now your dataframe should have a DatetimeIndex. If not, you can use df.index = pd.to_datetime(df.index).
Then its really easy using boolean indexing:
df.ix[(df.index.hour == 3) | (df.index.hour == 15)]
I have a question regarding how to filter results in the pd.read_hdf function. So here's the setup, I have a pandas dataframe (with np.datetime64 index) which I put into a hdf5 file. There's nothing fancy going on here, so no use of hierarchy or anything (maybe I could incorporate it?). Here's an example:
Foo Bar
TIME
2014-07-14 12:02:00 0 0
2014-07-14 12:03:00 0 0
2014-07-14 12:04:00 0 0
2014-07-14 12:05:00 0 0
2014-07-14 12:06:00 0 0
2014-07-15 12:02:00 0 0
2014-07-15 12:03:00 0 0
2014-07-15 12:04:00 0 0
2014-07-15 12:05:00 0 0
2014-07-15 12:06:00 0 0
2014-07-16 12:02:00 0 0
2014-07-16 12:03:00 0 0
2014-07-16 12:04:00 0 0
2014-07-16 12:05:00 0 0
2014-07-16 12:06:00 0 0
Now I store this into a .h5 using the following command:
store = pd.HDFStore('qux.h5')
#generate df
store.append('data', df)
store.close()
Next, I'll have another process which accesses this data and I would like to take date/time slices of this data. So suppose I want dates between 2014-07-14 and 2014-07-15, and only for times between 12:02:00 and 12:04:00. Currently I am using the following command to retrieve this:
pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715').between_time(start_time=datetime.time(12,2), end_time=datetime.time(12,4))
As far as I'm aware, someone please correct me if I'm wrong here, but entire original dataset is not read into memory if I use 'where'. So in other words:
This:
pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715')
Is not the same as this:
pd.read_hdf('qux.h5', 'data')['20140714':'20140715']
While the end result is exactly the same, what's being done in the background is not. So my question is, is there a way to incorporate that time range filter (i.e. .between_time()) into my where statement? Or if there's another way I should structure my hdf5 file? Maybe store a table for each day?
Thanks!
EDIT:
Regarding using hierarchy, I'm aware that the structure should be highly dependent on how I'll be using the data. However, if we assume that the I define a table per date (e.g. 'df/date_20140714', 'df/date_20140715', ...). Again I may be mistaken here, but using my example of querying date/time range; I'll probably incur a performance penalty as I'll need to read each table and have to merge them if I want a consolidated output right?
See an example of selecting using a where mask
Here's an example
In [50]: pd.set_option('max_rows',10)
In [51]: df = DataFrame(np.random.randn(1000,2),index=date_range('20130101',periods=1000,freq='H'))
In [52]: df
Out[52]:
0 1
2013-01-01 00:00:00 -0.467844 1.038375
2013-01-01 01:00:00 0.057419 0.914379
2013-01-01 02:00:00 -1.378131 0.187081
2013-01-01 03:00:00 0.398765 -0.122692
2013-01-01 04:00:00 0.847332 0.967856
... ... ...
2013-02-11 11:00:00 0.554420 0.777484
2013-02-11 12:00:00 -0.558041 1.833465
2013-02-11 13:00:00 -0.786312 0.501893
2013-02-11 14:00:00 -0.280538 0.680498
2013-02-11 15:00:00 1.533521 -1.992070
[1000 rows x 2 columns]
In [53]: store = pd.HDFStore('test.h5',mode='w')
In [54]: store.append('df',df)
In [55]: c = store.select_column('df','index')
In [56]: where = pd.DatetimeIndex(c).indexer_between_time('12:30','4:00')
In [57]: store.select('df',where=where)
Out[57]:
0 1
2013-01-01 00:00:00 -0.467844 1.038375
2013-01-01 01:00:00 0.057419 0.914379
2013-01-01 02:00:00 -1.378131 0.187081
2013-01-01 03:00:00 0.398765 -0.122692
2013-01-01 04:00:00 0.847332 0.967856
... ... ...
2013-02-11 03:00:00 0.902023 1.416775
2013-02-11 04:00:00 -1.455099 -0.766558
2013-02-11 13:00:00 -0.786312 0.501893
2013-02-11 14:00:00 -0.280538 0.680498
2013-02-11 15:00:00 1.533521 -1.992070
[664 rows x 2 columns]
In [58]: store.close()
Couple of points to note. This reads in the entire index to start. Usually this is not a burden. If it is you can just chunk read it (provide start/stop, though its a bit manual to do this ATM). Current select_column I don't believe can accept a query either.
You could potentially iterate over the days (and do individual queries) if you have a gargantuan amount of data (tens of millions of rows, which are wide), which might be more efficient.
Recombing data is relatively cheap (via concat), so don't be afraid to sub-query (though doing this too much can drag perf as well).
I want to select all rows with a particular index. My DataFrame look like this:
>>> df
Code
Patient Date
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
2 2001-1-17 22:00:00 z
2002-1-21 00:00:00 d
2003-1-21 00:00:00 a
2005-12-1 00:00:00 ba
Selecting one of the first (Patient) index works:
>>> df.loc[1]
Code
Patient Date
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
But selecting multiple of the first (Patient) index does not:
>>> df.loc[[1, 2]]
Code
Patient Date
1 2003-01-12 00:00:00 a
2 2001-1-17 22:00:00 z
However, I would like to get the entire dataframe (as the result would be if [1,1,1,2] i.e, the original dataframe).
When using a single index it works fine. For example:
>>> df.reset_index().set_index("Patient").loc[[1, 2]]
Date Code
Patient
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
2 2001-1-17 22:00:00 z
2002-1-21 00:00:00 d
2003-1-21 00:00:00 a
2005-12-1 00:00:00 ba
TL;DR Why do I have to repeat the index when using multiple indexes but not when I use a single index?
EDIT: Apparently it can be done similar to:
>>> df.loc[df.index.get_level_values("Patient").isin([1, 2])]
But this seems quite dirty to me. Is this the way - or is any other, better, way possible?
For Pandas verison 0.14 the recommended way, according to the above comment, is:
df.loc[([1,2],),:]
I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs timestamp value
0 2010-10-30 1
2010-11-30 2
1 2000-01-01 300
2007-01-01 33
2010-01-01 400
2 2000-01-01 11
So basically the result I want is
IDs timestamp value
0 2010-11-30 2
1 2010-01-01 400
2 2000-01-01 11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs timestamp value
0 2010-10-30 1
0 2010-11-30 2
1 2000-01-01 300
1 2007-01-01 33
1 2010-01-01 400
2 2000-01-01 11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
timestamp value
IDs
0 2010-11-30 00:00:00 2
1 2010-01-01 00:00:00 400
2 2000-01-01 00:00:00 11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values.