Pandas read_excel sometimes creates index even when index_col=None - python

I'm trying to read an excel file into a data frame and I want set the index later, so I don't want pandas to use column 0 for the index values.
By default (index_col=None), it shouldn't use column 0 for the index but I find that if there is no value in cell A1 of the worksheet it will.
Is there any way to over-ride this behaviour (I am loading many sheets that have no value in cell A1)?
This works as expected when test1.xlsx has the value "DATE" in cell A1:
In [19]: pd.read_excel('test1.xlsx')
Out[19]:
DATE A B C
0 2018-01-01 00:00:00 0.766895 1.142639 0.810603
1 2018-01-01 01:00:00 0.605812 0.890286 0.810603
2 2018-01-01 02:00:00 0.623123 1.053022 0.810603
3 2018-01-01 03:00:00 0.740577 1.505082 0.810603
4 2018-01-01 04:00:00 0.335573 -0.024649 0.810603
But when the worksheet has no value in cell A1, it automatically assigns column 0 values to the index:
In [20]: pd.read_excel('test2.xlsx', index_col=None)
Out[20]:
A B C
2018-01-01 00:00:00 0.766895 1.142639 0.810603
2018-01-01 01:00:00 0.605812 0.890286 0.810603
2018-01-01 02:00:00 0.623123 1.053022 0.810603
2018-01-01 03:00:00 0.740577 1.505082 0.810603
2018-01-01 04:00:00 0.335573 -0.024649 0.810603
This is not what I want.
Desired result: Same as first example (but with 'Unnamed' as the column label perhaps).
Documentation says
index_col : int, list of int, default None.
Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column.

The issue that you're describing matches a known pandas bug. This bug was fixed in the recent pandas 0.24.0 release:
Bug Fixes
Bug in read_excel() in which index_col=None was not being respected and parsing index columns anyway (GH18792, GH20480)

You can also use
index_col=0
instead of
index_col = None

I was facing essentially the same issue since last couple of days.
I have an excel file that also have the first column header as Blank. So when it is read it gets read as an index.
I tried many options but below code works using skiprows instead of the header option. Interestingly skiprows uses the "Unnamed: 0" naming patterns for columns that does not have a header where as using the header option it did not work. We are using pandas version 0.20.1 :
df = pd.read_excel( "ABC.xlsx" , dtype=str, sheetname='Supply', skiprows =6, usecols = mycols )
df.columns
Index([ 'Unnamed: 0', 2015-01-01 00:00:00, 2015-02-01 00:00:00,
2015-03-01 00:00:00, 2015-04-01 00:00:00, 2015-05-01 00:00:00,
2015-06-01 00:00:00, 2015-07-01 00:00:00, 2015-08-01 00:00:00,
2015-09-01 00:00:00,
...
],
dtype='object', length=120)
The documentation does not provide any more info on this. But above work-around can save your day.

Related

Time Series Data Reformat

I am working on some code that will rearrange a time series. Currently I have a standard time series. I have a three columns with with the header being [Date, Time, Value]. I want to reformat the dataframe to index with the date and use a header with the time (i.e. 0:00, 1:00, ... , 23:00). The dataframe will be filled in with the value.
Here is the Dataframe currently have
essentially I'd like to mve the index toa single day and show the hours through the columns.
Thanks,
Use pivot:
df = df.pivot(index='Date', columns='Time', values='Total')
Output (first 10 columns and with random values for Total):
>>> df.pivot(index='Date', columns='Time', values='Total').iloc[0:10]
time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00
date
2019-01-01 0.732494 0.087657 0.930405 0.958965 0.531928 0.891228 0.664634 0.432684 0.009653 0.604878
2019-01-02 0.471386 0.575126 0.509707 0.715290 0.337983 0.618632 0.413530 0.849033 0.725556 0.186876
You could try this.
Split the time part to get only the hour. Add hr to it.
df = pd.DataFrame([['2019-01-01', '00:00:00',-127.57],['2019-01-01', '01:00:00',-137.57],['2019-01-02', '00:00:00',-147.57],], columns=['Date', 'Time', 'Totals'])
df['hours'] = df['Time'].apply(lambda x: 'hr'+ str(int(x.split(':')[0])))
print(pd.pivot_table(df, values ='Totals', index=['Date'], columns = 'hours'))
Output
hours hr0 hr1
Date
2019-01-01 -127.57 -137.57
2019-01-02 -147.57 NaN

Copy row to another dataframe

I have 2 dataframes with index type: Datatimeindex and I would like to copy one row to another. The dataframes are:
variable: df
DateTime
2013-01-01 01:00:00 0.0
2013-01-01 02:00:00 0.0
2013-01-01 03:00:00 0.0
....
Freq: H, Length: 8759, dtype: float64
variable: consumption_year
Potência Ativa ... Costs
Datetime ...
2019-01-01 00:00:00 11.500000 ... 1.08874
2019-01-01 01:00:00 6.500000 ... 0.52016
2019-01-01 02:00:00 5.250000 ... 0.38183
2019-01-01 03:00:00 5.250000 ... 0.38183
[8760 rows x 5 columns]
here is my code:
mc.run_model(tmy_data)
df=round(mc.ac.fillna(0)/1000,3)
consumption_year['PVProduction'] = df.iloc[:,[1]] #1
consumption_year['PVProduction'] = df[:,1] #2
I am trying to copy the second column of df, to a new column in consumption_year dataframe but none of those previous experiences worked. Looking to the index, I see 3 major differences:
year (2013 and 2019)
starting hour: 01:00 and 00:00
length: 8760 and 8759
Do I need to solve those 3 differences first (making an datetime from df equal to consumption_year), before I can copy one row to another? If so, could you provide me a solution to fix those differences.
Those are the errors:
1: consumption_year['PVProduction'] = df.iloc[:,[1]]
raise IndexingError("Too many indexers")
pandas.core.indexing.IndexingError: Too many indexers
2: consumption_year['PVProduction'] = df[:,1]
raise ValueError("Can only tuple-index with a MultiIndex")
ValueError: Can only tuple-index with a MultiIndex
You can merge two data frames together.
pd.merge(df, consumption_year, left_index=True, right_index=True, how='outer')

pandas datetime columns problem and i don't know what i am missing

I am a Korean student
Please understand that English is awkward
i want to make columns datetime > year,mounth .... ,second
train = pd.read_csv('input/Train.csv')
DateTime looks like this
(this is head(20) and I remove other columns easy to see)
datetime
0 2011-01-01 00:00:00
1 2011-01-01 01:00:00
2 2011-01-01 02:00:00
3 2011-01-01 03:00:00
4 2011-01-01 04:00:00
5 2011-01-01 05:00:00
6 2011-01-01 06:00:00
7 2011-01-01 07:00:00
8 2011-01-01 08:00:00
9 2011-01-01 09:00:00
10 2011-01-01 10:00:00
11 2011-01-01 11:00:00
12 2011-01-01 12:00:00
13 2011-01-01 13:00:00
14 2011-01-01 14:00:00
15 2011-01-01 15:00:00
16 2011-01-01 16:00:00
17 2011-01-01 17:00:00
18 2011-01-01 18:00:00
19 2011-01-01 19:00:00
then I write this code to see each columns (year,month,day,hour,minute,second)
train['year'] = train['datetime'].dt.year
train['month'] = train['datetime'].dt.month
train['day'] = train['datetime'].dt.day
train['hour'] = train['datetime'].dt.hour
train['minute'] = train['datetime'].dt.minute
train['second'] = train['datetime'].dt.seond
and error like this
AttributeError: Can only use .dt accessor with datetimelike values
please help me ㅠㅅㅠ
Note that by default read_csv is able to deduce column type only
for numeric and boolean columns.
Unless explicitely specified (e.g. passing converters or dtype
parameters), all other cases of input are left as strings
and the pandasonic type of such columns is object.
And just this occurred in your case.
So, as this column is of object type, you can not invoke dt accessor
on it, as it works only on columns of datetime type.
Actually, in this case, you can take the following approach:
do not specify any conversion of this column (it will be parsed
just as object),
after that split datetime column into "parts", using str.split
(all 6 columns with a single instruction),
set proper column names in the resulting DataFrame,
join it to the original DataFrame (then drop),
as late as now change the type of the original column.
To do it, you can run:
wrk = df['datetime'].str.split(r'[- :]', expand=True).astype(int)
wrk.columns = ['year', 'month', 'day', 'hour', 'minute', 'second']
df = df.join(wrk)
del wrk
df['datetime'] = pd.to_datetime(df['datetime'])
Note that I added astype(int). Otherwise these columns would be left as
object (actually string) type.
Or maybe this original column is not needed any more (as you have extracted
all date / time components)? In such case drop this column instead of
converting it.
And the last hint: datetime is used rather as a type name (with various
endings).
So it is better when you used some other name here, at least differing
in char case, e.g. DateTime.

Switching row header and column header in python

I have a .csv file that has data something like this:
#file...out/houses.csv
#data...sun may 1 11:20:43 2011
#user...abs12
#host...(null)
#group...class=house
#property..change_per_hour
#limit...0
#interval..10000000
#timestamp,house_0,house_1,house_2,house_3,.....,house_1000
2010-07-01 00:00:00 EDT,1.2,1.3,1.4,1.5,........,9.72
2010-07-01 01:00:00 EDT,2.2,2.3,2.4,2.5,........,19.72
2010-07-01 02:00:00 EDT,3.2,3.3,3.4,3.5,........,29.72
2010-07-01 05:00:00 EDT,5.2,5.3,5.4,5.5,........,59.72
2010-07-01 06:00:00 EDT,6.2,,6.4,,..............,
...
I want to convert this and save to a new .csv and the data should look like:
#file...out/houses.csv
#data...sun may 1 11:20:43 2011
#user...abs12
#host...(null)
#group...class=house
#property..change_per_hour
#limit...0
#interval..10000000
#EntityName,2010-07-01 00:00:00 EDT,2010-07-01 01:00:00 EDT,2010-07-01 02:00:00 EDT,2010-07-01 05:00:00 EDT,2010-07-01 06:00:00 EDT
house_0,1.2,2.2,3.2,5.2,6.2,...
house_1,1.3,2.3,3.3,5.3,,...
house_2,1.4,2.4,3.4,5.4,6.4,...
house_3,1.5,2.5,3.5,5.5,,...
...
house_1000,9.72,19.72,29.72,59.72,
I tried to use pandas: convert to a dictionary that looks like dtDict={'house_0':{'datetimestamp_1':'value_1','datetimestamp_2':'value_2'...}...}but I am not able to convert to a dictionary and use panda's DataFrame such as pandas.DataFrame(dtDict) to do that conversion. I dont have to use pandas (can you anything in python) but thought pandas is good for csv manipulation. any help?
Assuming it is in a pandas dataframe already, this works:
df = pd.DataFrame(
data=[[1, 3], [2, 5]],
index=[0, 1],
columns=['a', 'b']
)
Output:
>>>print(df)
a b
0 1 3
1 2 5
Then, transpose the dataframe:
>>>print(df.transpose())
0 1
a 1 2
b 3 5

How to access last element of a multi-index dataframe

I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs timestamp value
0 2010-10-30 1
2010-11-30 2
1 2000-01-01 300
2007-01-01 33
2010-01-01 400
2 2000-01-01 11
So basically the result I want is
IDs timestamp value
0 2010-11-30 2
1 2010-01-01 400
2 2000-01-01 11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs timestamp value
0 2010-10-30 1
0 2010-11-30 2
1 2000-01-01 300
1 2007-01-01 33
1 2010-01-01 400
2 2000-01-01 11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
timestamp value
IDs
0 2010-11-30 00:00:00 2
1 2010-01-01 00:00:00 400
2 2000-01-01 00:00:00 11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values.

Categories