Val ts year doy interpolat region_id
2000-02-18 NaN 950832000 2000 49 NaN 19987
2000-03-05 NaN 952214400 2000 65 NaN 19987
2000-03-21 NaN 953596800 2000 81 NaN 19987
2000-04-06 0.402539365 954979200 2000 97 NaN 19987
2000-04-22 0.54021746 956361600 2000 113 NaN 19987
The above dataframe has a datetime index. I resample it like so:
df = df.resample('D')
However, this resampling results in this dataframe:
ts year doy interpolat region_id
2000-01-01 1199180160 2008 1 1 19990
2000-01-02 NaN NaN NaN NaN NaN
2000-01-03 NaN NaN NaN NaN NaN
2000-01-04 NaN NaN NaN NaN NaN
2000-01-05 NaN NaN NaN NaN NaN
Why did the 'Val' column disappear? and all the other columns seem messed up too. See Linearly interpolate missing rows in pandas dataframe for an explanation of where the dataframe is coming from.
--EDIT
Based on #unutbu's questions:
df.reset_index().to_dict('list')
{'index': [Timestamp('2000-02-18 00:00:00'), Timestamp('2000-03-05 00:00:00'), Timestamp('2000-03-21 00:00:00'), ... '0.670709965', '0.631584375', '0.562112815', '0.50740686', '0.4447712', '0.47880806', nan, nan]}
-- EDIT: The csv file for the above data frame in its entirety is here:
https://www.dropbox.com/s/dp76hk6yfs6c1og/test.csv?dl=0
The Val columns will probably not have a numerical dtype for some reason, and all non-numerical (eg object dtype) columns are removed in resample.
To check, just look at df.info().
To convert it to a numerical columns, you can use astype(float) or the convert_objects (pd.to_numeric starting from v0.17).
Related
My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.
Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
I have a pandas data-frame with multiple occurrence of particular values. I want to either remove all the values that are duplicates or replace with NaN and finally get the name of column that has any number of unique values. Pandas drop_duplicates function only removes the rows that has duplicate value but I want to remove the values/cells in data-frame. Is there a solution for this?
Based on the input dataframe below, all the values except the first row of column "02" have duplicate occurrence in the dataframe, so column "02" is what I want. If the question is not clear please do let me know. Thanks.
DF:
02 03:10 03:02 03:02:09
0 6716 45355 45355 45355
1 4047 4047 7411 7411
2 945 2478 2478 945
Expected output:
col_with_unique_val = "02"
or
Expected output DF:
02 03:10 03:02 03:02:09
0 6716 NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
or
Expected output DF:
02
0 6716
Here is one way
df.mask(df.apply(pd.Series.duplicated,keep=False,axis=1))
02 03:10 03:02 03:02:09
0 6716.0 NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
df.mask(df.apply(pd.Series.duplicated,keep=False,axis=1)).stack().index.get_level_values(1)
Index(['02'], dtype='object')
stack, then check duplicated. where to make all non-uniques NaN
df1 = df.stack()
uniques = df1[~df1.duplicated(keep=False)].tolist()
df.where(df.isin(uniques))
# 02 03:10 03:02 03:02:09
#0 6716.0 NaN NaN NaN
#1 NaN NaN NaN NaN
#2 NaN NaN NaN NaN
df.isin(uniques).any().loc[lambda x: x].index
#Index(['02'], dtype='object')
TL:DR - how do I create a dataframe/series from one or more columns in an existing non-indexed dataframe based on the column(s) containing a specific piece of text?
Relatively new to Python and data analysis and (this is my first time posting a question on Stack Overflow but I've been hunting for an answer for a long time (and used to code regularly) and not having any success.
I have a dataframe import from an Excel file that doesn't have named/indexed columns. I am trying to successfully extract data from nearly 2000 of these files which all have slightly different columns of data (of course - why make it simple... or follow a template... or simply use something other than poorly formatted Excel spreadsheets...).
The original dataframe (from a poorly structured XLS file) looks a bit like this:
0 NaN RIGHT NaN
1 Date UCVA Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
3 4 5 6 7 8 9 \
0 NaN NaN NaN NaN NaN NaN NaN
1 Cyl Axis BSCVA Pentacam remarks K1 K2 K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
... 17 18 19 20 21 22 \
0 ... NaN NaN NaN NaN NaN NaN
1 ... BSCVA Pentacam remarks K1 K2 K2 back K max
2 ... 6/5 NaN NaN NaN NaN NaN
3 ... NaN NaN NaN NaN NaN NaN
4 ... NaN Pentacam 44.3 43.7 -6.2 45.5
5 ... 6/4-4 NaN NaN NaN NaN NaN
6 ... 6/5 NaN NaN NaN NaN NaN
I want to extract a set of dataframes/series that I can then combine back together to get a 'tidy' dataframe e.g.:
1 Date R-UCVA R-Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
1 R-Cyl R-Axis R-BSCVA R-Penta R-K1 R-K2 R-K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
etc. etc. so I'm trying to write some code that will pull a series of columns that I define by looking for the words "Date" or "UCVA" etc. etc. Then I plan to stitch them back together into a single dataframe with patient identifier as an extra column. And then cycle through all the XLS files, appending the whole lot to a single CSV file that I can then do useful stuff on (like put into an Access Database - yes, I know, but it has to be easy to use and already installed on an NHS computer - and statistical analysis).
Any suggestions? I hope that's enough information.
Thanks very much in advance.
Kind regards
Vicky
Here a something that will hopefully get you started.
I have prepared a text.xlsx file:
and I can read it as follows
path = 'text.xlsx'
df = pd.read_excel(path, header=[0,1])
# Deal with two levels of headers, here I just join them together crudely
df.columns = df.columns.map(lambda h: ' '.join(h))
# Slight hack because I messed with the column names
# I create two dataframes, one with the first column, one with the second column
df1 = df[[df.columns[0],df.columns[1]]]
df2 = df[[df.columns[0], df.columns[2]]]
# Stacking them on top of each other
result = pd.concat([df1, df2])
print(result)
#Merging them on the Date column
result = pd.merge(left=df1, right=df2, on=df1.columns[0])
print(result)
This gives the output
RIGHT Sph RIGHT UCVA Unnamed: 0_level_0 Date
0 NaN 6/38 2007-01-13 00:00:00
1 NaN 6/37 2009-11-05 00:00:00
2 NaN 9/56 2009-11-18 00:00:00
0 [-2.00] NaN 2007-01-13 00:00:00
1 NaN NaN 2009-11-05 00:00:00
2 NaN NaN 2009-11-18 00:00:00
and
Unnamed: 0_level_0 Date RIGHT UCVA RIGHT Sph
0 2007-01-13 00:00:00 6/38 [-2.00]
1 2009-11-05 00:00:00 6/37 NaN
2 2009-11-18 00:00:00 9/56 NaN
Some pointers:
How to merger two header rows? See this question and answer.
How to select pandas columns conditionally? See e.g. this or this
How to merge dataframes? There is a very good guide in the pandas doc
So I have a dataFrame:
Units fcast currerr curpercent fcastcum unitscum cumerrpercent
2013-09-01 3561 NaN NaN NaN NaN NaN NaN
2013-10-01 3480 NaN NaN NaN NaN NaN NaN
2013-11-01 3071 NaN NaN NaN NaN NaN NaN
2013-12-01 3234 NaN NaN NaN NaN NaN NaN
2014-01-01 2610 2706 -96 -3.678161 2706 2610 -3.678161
2014-02-01 NaN 3117 NaN NaN 5823 NaN NaN
2014-03-01 NaN 3943 NaN NaN 9766 NaN NaN
And I want to load a value, the index of the current month which is found by getting the last item that has "units" filled in, into a variable, "curr_month" that will have a number of uses (including text display and using as a slicing operator)
This is way ugly but almost works:
curr_month=mergederrs['Units'].dropna()
curr_month=curr_month[-1:].index
curr_month
But curr_month is
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01]
Length: 1, Freq: None, Timezone: None
Which is Unhashable, so this fails
mergederrs[curr_month:]
The docs are great for creating the DF but a bit sparse of getting individual items out!
I'd probably write
>>> df.Units.last_valid_index()
Timestamp('2014-01-01 00:00:00')
but a slight tweak on your approach should work too:
>>> df.Units.dropna().index[-1]
Timestamp('2014-01-01 00:00:00')
It's the difference between somelist[-1:] and somelist[-1].
[Note that I'm assuming that all of the nan values come at the end. If there are valids and then NaNs and then valids, and you want the last valid in the first group, that would be slightly different.]
I have a dictionary name date_dict keyed by datetime dates with values corresponding to integer counts of observations. I convert this to a sparse series/dataframe with censored observations that I would like to join or convert to a series/dataframe with continuous dates. The nasty list comprehension is my hack to get around the fact that pandas apparently won't automatically covert datetime date objects to an appropriate DateTime index.
df1 = pd.DataFrame(data=date_dict.values(),
index=[datetime.datetime.combine(i, datetime.time())
for i in date_dict.keys()],
columns=['Name'])
df1 = df1.sort(axis=0)
This example has 1258 observations and the DateTime index runs from 2003-06-24 to 2012-11-07.
df1.head()
Name
Date
2003-06-24 2
2003-08-13 1
2003-08-19 2
2003-08-22 1
2003-08-24 5
I can create an empty dataframe with a continuous DateTime index, but this introduces an unneeded column and seems clunky. I feel as though I'm missing a more elegant solution involving a join.
df2 = pd.DataFrame(data=None,columns=['Empty'],
index=pd.DateRange(min(date_dict.keys()),
max(date_dict.keys())))
df3 = df1.join(df2,how='right')
df3.head()
Name Empty
2003-06-24 2 NaN
2003-06-25 NaN NaN
2003-06-26 NaN NaN
2003-06-27 NaN NaN
2003-06-30 NaN NaN
Is there a simpler or more elegant way to fill a continuous dataframe from a sparse dataframe so that there is (1) a continuous index, (2) the NaNs are 0s, and (3) there is no left-over empty column in the dataframe?
Name
2003-06-24 2
2003-06-25 0
2003-06-26 0
2003-06-27 0
2003-06-30 0
You can just use reindex on a time series using your date range. Also it looks like you would be better off using a TimeSeries instead of a DataFrame (see documentation), although reindexing is also the correct method for adding missing index values to DataFrames as well.
For example, starting with:
date_index = pd.DatetimeIndex([pd.datetime(2003,6,24), pd.datetime(2003,8,13),
pd.datetime(2003,8,19), pd.datetime(2003,8,22), pd.datetime(2003,8,24)])
ts = pd.Series([2,1,2,1,5], index=date_index)
Gives you a time series like your example dataframe's head:
2003-06-24 2
2003-08-13 1
2003-08-19 2
2003-08-22 1
2003-08-24 5
Simply doing
ts.reindex(pd.date_range(min(date_index), max(date_index)))
then gives you a complete index, with NaNs for your missing values (you can use fillna if you want to fill the missing values with some other values - see here):
2003-06-24 2
2003-06-25 NaN
2003-06-26 NaN
2003-06-27 NaN
2003-06-28 NaN
2003-06-29 NaN
2003-06-30 NaN
2003-07-01 NaN
2003-07-02 NaN
2003-07-03 NaN
2003-07-04 NaN
2003-07-05 NaN
2003-07-06 NaN
2003-07-07 NaN
2003-07-08 NaN
2003-07-09 NaN
2003-07-10 NaN
2003-07-11 NaN
2003-07-12 NaN
2003-07-13 NaN
2003-07-14 NaN
2003-07-15 NaN
2003-07-16 NaN
2003-07-17 NaN
2003-07-18 NaN
2003-07-19 NaN
2003-07-20 NaN
2003-07-21 NaN
2003-07-22 NaN
2003-07-23 NaN
2003-07-24 NaN
2003-07-25 NaN
2003-07-26 NaN
2003-07-27 NaN
2003-07-28 NaN
2003-07-29 NaN
2003-07-30 NaN
2003-07-31 NaN
2003-08-01 NaN
2003-08-02 NaN
2003-08-03 NaN
2003-08-04 NaN
2003-08-05 NaN
2003-08-06 NaN
2003-08-07 NaN
2003-08-08 NaN
2003-08-09 NaN
2003-08-10 NaN
2003-08-11 NaN
2003-08-12 NaN
2003-08-13 1
2003-08-14 NaN
2003-08-15 NaN
2003-08-16 NaN
2003-08-17 NaN
2003-08-18 NaN
2003-08-19 2
2003-08-20 NaN
2003-08-21 NaN
2003-08-22 1
2003-08-23 NaN
2003-08-24 5
Freq: D, Length: 62