I have a pandas data frame where I know the start value is the same for all columns and the end value is 90% and 80% of the 100% end value. I need to interpolate between the starting and end values and follow the trend of the 100% value (which is not linear). On a side note: I attempted a linear interpolation between the start and end but that wouldn't work, it returned the same exact dataframe, no NaNs were filled.
100% 90% 80%
Date
2017-01-01 6.000000 6 6
2017-02-01 6.000000 NaN NaN
2017-03-01 6.000000 NaN NaN
2017-04-01 6.000000 NaN NaN
... ... ...
2027-08-01 24.666667 NaN NaN
2027-09-01 24.666667 NaN NaN
2027-10-01 24.666667 NaN NaN
2027-11-01 24.666667 NaN NaN
2027-12-01 24.666667 22.2 19.7333
Solution: My problem was that the data types were objects instead of floats. After converting to floats and setting the 100% as the index I used the following to achieve the desired result:
df = df.interpolate(method='piecewise_polynomial')
Related
I'm trying to find ways of analysing log data from a git repo. I've dumped the data to a file that looks like this:
"hash","email","date","subject"
"65319af6e","jbrockmendel#gmail.com","2020-11-28","REF-IntervalIndex._assert_can_do_setop-38112"
"0bf58d8a9","simonjayhawkins#gmail.com","2020-11-28","DOC-add-contibutors-to-1.2.0-release-notes-38132"
"d16df293c","45562402+rhshadrach#users.noreply.github.com","2020-11-28","TYP-Add-cast-to-ABC-Index-like-types-38043"
"2d661a899","jbrockmendel#gmail.com","2020-11-28","CLN-simplify-mask_missing-38127"
"ba2ae2f73","jbrockmendel#gmail.com","2020-11-28","CLN-remove-unreachable-in-core.dtypes-38128"
I am able to get rows that have a count more than:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
dataframe = pd.read_csv("git-log-2020.csv")
dataframe['date'] = pd.to_datetime(dataframe['date'])
grouped_dataframe = dataframe.groupby([pd.Grouper(key='date', freq='M'), "email"])[["subject"]].count()
# Select all the users that have contributed more than 20 times.
print(grouped_dataframe[grouped_dataframe['subject'] > 20])
But I would like to be able to find the following:
What is the top three users per month?
What are the total commits for each month?
what is the average number of commits per user / per month? What is the monthly average activity of the users?
My code and data can be found here: https://github.com/mooperd/git-analysis-pandas
Ta, Andrew
All answers using your sample data that I call df
Top three :
df.groupby(pd.Grouper(key='date', freq='1M')).apply(lambda d: d['email'].value_counts().sort_values(ascending = False).head(3))
produces
date email
2020-11-30 jbrockmendel#gmail.com 3
simonjayhawkins#gmail.com 1
45562402+rhshadrach#users.noreply.github.com 1
Name: email, dtype: int64
total commits per month
df.groupby(pd.Grouper(key='date', freq='M'))['subject'].count()
output:
date
2020-11-30 5
Freq: M, Name: subject, dtype: int64
as for average number of commits per user/per month not entirely clear what you want? For each user, the ratio of the number of commits for that user by the number of months in your dataset? or the number of months they made actual commits?. Would be useful to get sample output here
But I think the following transformation is useful, it produces a table of # of commits by email and month so you can take averages every which way
df2 = df.groupby([pd.Grouper(key='date', freq='M')])['email'].value_counts().unstack(level=0)
df2
On your real data it produces a table that is too large to reproduce here but it starts with
date 2020-01-31 2020-02-29 2020-03-31 2020-04-30 2020-05-31 2020-06-30 2020-07-31 2020-08-31 2020-09-30 2020-10-31 2020-11-30
email
10430241+xh2#users.noreply.github.com NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
12420863+danchev#users.noreply.github.com NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
12769364+tnwei#users.noreply.github.com NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN
12874561+jeet-parekh#users.noreply.github.com NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
as you can see it has a lot of NaNs that correspon to users not making any commits that month, which is useful when eg calculating averages over months with commits.For example
df2.mean(axis=1).sort_values(ascending = False)
produces average monthly commits for a user (sorted), and
df2.mean(axis=0).sort_values(ascending = False)
produces average monthly commits per month (sorted)
My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.
Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
TL:DR - how do I create a dataframe/series from one or more columns in an existing non-indexed dataframe based on the column(s) containing a specific piece of text?
Relatively new to Python and data analysis and (this is my first time posting a question on Stack Overflow but I've been hunting for an answer for a long time (and used to code regularly) and not having any success.
I have a dataframe import from an Excel file that doesn't have named/indexed columns. I am trying to successfully extract data from nearly 2000 of these files which all have slightly different columns of data (of course - why make it simple... or follow a template... or simply use something other than poorly formatted Excel spreadsheets...).
The original dataframe (from a poorly structured XLS file) looks a bit like this:
0 NaN RIGHT NaN
1 Date UCVA Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
3 4 5 6 7 8 9 \
0 NaN NaN NaN NaN NaN NaN NaN
1 Cyl Axis BSCVA Pentacam remarks K1 K2 K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
... 17 18 19 20 21 22 \
0 ... NaN NaN NaN NaN NaN NaN
1 ... BSCVA Pentacam remarks K1 K2 K2 back K max
2 ... 6/5 NaN NaN NaN NaN NaN
3 ... NaN NaN NaN NaN NaN NaN
4 ... NaN Pentacam 44.3 43.7 -6.2 45.5
5 ... 6/4-4 NaN NaN NaN NaN NaN
6 ... 6/5 NaN NaN NaN NaN NaN
I want to extract a set of dataframes/series that I can then combine back together to get a 'tidy' dataframe e.g.:
1 Date R-UCVA R-Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
1 R-Cyl R-Axis R-BSCVA R-Penta R-K1 R-K2 R-K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
etc. etc. so I'm trying to write some code that will pull a series of columns that I define by looking for the words "Date" or "UCVA" etc. etc. Then I plan to stitch them back together into a single dataframe with patient identifier as an extra column. And then cycle through all the XLS files, appending the whole lot to a single CSV file that I can then do useful stuff on (like put into an Access Database - yes, I know, but it has to be easy to use and already installed on an NHS computer - and statistical analysis).
Any suggestions? I hope that's enough information.
Thanks very much in advance.
Kind regards
Vicky
Here a something that will hopefully get you started.
I have prepared a text.xlsx file:
and I can read it as follows
path = 'text.xlsx'
df = pd.read_excel(path, header=[0,1])
# Deal with two levels of headers, here I just join them together crudely
df.columns = df.columns.map(lambda h: ' '.join(h))
# Slight hack because I messed with the column names
# I create two dataframes, one with the first column, one with the second column
df1 = df[[df.columns[0],df.columns[1]]]
df2 = df[[df.columns[0], df.columns[2]]]
# Stacking them on top of each other
result = pd.concat([df1, df2])
print(result)
#Merging them on the Date column
result = pd.merge(left=df1, right=df2, on=df1.columns[0])
print(result)
This gives the output
RIGHT Sph RIGHT UCVA Unnamed: 0_level_0 Date
0 NaN 6/38 2007-01-13 00:00:00
1 NaN 6/37 2009-11-05 00:00:00
2 NaN 9/56 2009-11-18 00:00:00
0 [-2.00] NaN 2007-01-13 00:00:00
1 NaN NaN 2009-11-05 00:00:00
2 NaN NaN 2009-11-18 00:00:00
and
Unnamed: 0_level_0 Date RIGHT UCVA RIGHT Sph
0 2007-01-13 00:00:00 6/38 [-2.00]
1 2009-11-05 00:00:00 6/37 NaN
2 2009-11-18 00:00:00 9/56 NaN
Some pointers:
How to merger two header rows? See this question and answer.
How to select pandas columns conditionally? See e.g. this or this
How to merge dataframes? There is a very good guide in the pandas doc
So I have a dataFrame:
Units fcast currerr curpercent fcastcum unitscum cumerrpercent
2013-09-01 3561 NaN NaN NaN NaN NaN NaN
2013-10-01 3480 NaN NaN NaN NaN NaN NaN
2013-11-01 3071 NaN NaN NaN NaN NaN NaN
2013-12-01 3234 NaN NaN NaN NaN NaN NaN
2014-01-01 2610 2706 -96 -3.678161 2706 2610 -3.678161
2014-02-01 NaN 3117 NaN NaN 5823 NaN NaN
2014-03-01 NaN 3943 NaN NaN 9766 NaN NaN
And I want to load a value, the index of the current month which is found by getting the last item that has "units" filled in, into a variable, "curr_month" that will have a number of uses (including text display and using as a slicing operator)
This is way ugly but almost works:
curr_month=mergederrs['Units'].dropna()
curr_month=curr_month[-1:].index
curr_month
But curr_month is
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01]
Length: 1, Freq: None, Timezone: None
Which is Unhashable, so this fails
mergederrs[curr_month:]
The docs are great for creating the DF but a bit sparse of getting individual items out!
I'd probably write
>>> df.Units.last_valid_index()
Timestamp('2014-01-01 00:00:00')
but a slight tweak on your approach should work too:
>>> df.Units.dropna().index[-1]
Timestamp('2014-01-01 00:00:00')
It's the difference between somelist[-1:] and somelist[-1].
[Note that I'm assuming that all of the nan values come at the end. If there are valids and then NaNs and then valids, and you want the last valid in the first group, that would be slightly different.]
I have a dictionary name date_dict keyed by datetime dates with values corresponding to integer counts of observations. I convert this to a sparse series/dataframe with censored observations that I would like to join or convert to a series/dataframe with continuous dates. The nasty list comprehension is my hack to get around the fact that pandas apparently won't automatically covert datetime date objects to an appropriate DateTime index.
df1 = pd.DataFrame(data=date_dict.values(),
index=[datetime.datetime.combine(i, datetime.time())
for i in date_dict.keys()],
columns=['Name'])
df1 = df1.sort(axis=0)
This example has 1258 observations and the DateTime index runs from 2003-06-24 to 2012-11-07.
df1.head()
Name
Date
2003-06-24 2
2003-08-13 1
2003-08-19 2
2003-08-22 1
2003-08-24 5
I can create an empty dataframe with a continuous DateTime index, but this introduces an unneeded column and seems clunky. I feel as though I'm missing a more elegant solution involving a join.
df2 = pd.DataFrame(data=None,columns=['Empty'],
index=pd.DateRange(min(date_dict.keys()),
max(date_dict.keys())))
df3 = df1.join(df2,how='right')
df3.head()
Name Empty
2003-06-24 2 NaN
2003-06-25 NaN NaN
2003-06-26 NaN NaN
2003-06-27 NaN NaN
2003-06-30 NaN NaN
Is there a simpler or more elegant way to fill a continuous dataframe from a sparse dataframe so that there is (1) a continuous index, (2) the NaNs are 0s, and (3) there is no left-over empty column in the dataframe?
Name
2003-06-24 2
2003-06-25 0
2003-06-26 0
2003-06-27 0
2003-06-30 0
You can just use reindex on a time series using your date range. Also it looks like you would be better off using a TimeSeries instead of a DataFrame (see documentation), although reindexing is also the correct method for adding missing index values to DataFrames as well.
For example, starting with:
date_index = pd.DatetimeIndex([pd.datetime(2003,6,24), pd.datetime(2003,8,13),
pd.datetime(2003,8,19), pd.datetime(2003,8,22), pd.datetime(2003,8,24)])
ts = pd.Series([2,1,2,1,5], index=date_index)
Gives you a time series like your example dataframe's head:
2003-06-24 2
2003-08-13 1
2003-08-19 2
2003-08-22 1
2003-08-24 5
Simply doing
ts.reindex(pd.date_range(min(date_index), max(date_index)))
then gives you a complete index, with NaNs for your missing values (you can use fillna if you want to fill the missing values with some other values - see here):
2003-06-24 2
2003-06-25 NaN
2003-06-26 NaN
2003-06-27 NaN
2003-06-28 NaN
2003-06-29 NaN
2003-06-30 NaN
2003-07-01 NaN
2003-07-02 NaN
2003-07-03 NaN
2003-07-04 NaN
2003-07-05 NaN
2003-07-06 NaN
2003-07-07 NaN
2003-07-08 NaN
2003-07-09 NaN
2003-07-10 NaN
2003-07-11 NaN
2003-07-12 NaN
2003-07-13 NaN
2003-07-14 NaN
2003-07-15 NaN
2003-07-16 NaN
2003-07-17 NaN
2003-07-18 NaN
2003-07-19 NaN
2003-07-20 NaN
2003-07-21 NaN
2003-07-22 NaN
2003-07-23 NaN
2003-07-24 NaN
2003-07-25 NaN
2003-07-26 NaN
2003-07-27 NaN
2003-07-28 NaN
2003-07-29 NaN
2003-07-30 NaN
2003-07-31 NaN
2003-08-01 NaN
2003-08-02 NaN
2003-08-03 NaN
2003-08-04 NaN
2003-08-05 NaN
2003-08-06 NaN
2003-08-07 NaN
2003-08-08 NaN
2003-08-09 NaN
2003-08-10 NaN
2003-08-11 NaN
2003-08-12 NaN
2003-08-13 1
2003-08-14 NaN
2003-08-15 NaN
2003-08-16 NaN
2003-08-17 NaN
2003-08-18 NaN
2003-08-19 2
2003-08-20 NaN
2003-08-21 NaN
2003-08-22 1
2003-08-23 NaN
2003-08-24 5
Freq: D, Length: 62