I have a pandas.DataFrame df as:
>>> df = pd.DataFrame([[1,2,2,2,3], [1,2,3,3,3],[1,3,2,3,5],[7,9,9,3,2]], columns=list("ABCDE"))
I want to achieve this type of table in html (with control of which cells I can merge)
I know that it can be achieved manipulating the table obtained from df.to_html() function and using jquery to expand the rowspans, yet I'm asking about the pythonic way to do it, i.e. is there a possible way of obtaining the merged table directly from some sort of pivot table / dataframe.
I thought about temporary setting the columns to merge as indexes in multi indexed data frame, however this approach is.. crude, to say the least.
It would be perfect if I had the full control of which cells can I merge, based on, for example, values of other cells in the same row.
UPDATE: I've managed to find similar question, with extensive answer (see jpp's answer), yet unfortunately negating possibility of simple solution to my problem. As silkworm suggested, for now the only possibility is to delve the to_html source, or meddling with multi indexing the dataframe.
Related
I have a seemingly complicated problem and I have a general idea of how I should solve it but I am not sure if it is the best way to go about it. I'll give the scenario and would appreciate any help on how to break this down. I'm fairly new with Pandas so please excuse my ignorance.
The Scenario
I have a CSV file that I import as a dataframe. My example I am working through contains 2742 rows × 136 columns. The rows are variable but the columns are set. I have a set of 23 lookup tables (also as CSV files) named per year, per quarter (range is 2020 3rd quarter - 2015 1st quarter) The lookup files are named as such: PPRRVU203.csv. So that contains values from the 3rd quarter of 2020. The lookup tables are matched by two columns ('Code' and 'Mod') and I use three values that are associated in the lookup.
I am trying to filter sections of my data frame, pull the correct values from the matching lookup file, merge back into the original subset, and then replace into the original dataframe.
Thoughts
I can probably abstract this and wrap in a function but not sure how I can place back in. My question, for those that understand Pandas better than myself, what is the best method to filter, replace the values, and write the file back out.
The straight forward solution would be to filter the original dataframe into 23 separate dataframes, then do the merge on each individual file, then concat into a new dataframe and output to CSV.
This seems highly inefficient?
I can post code but I am looking for more of any high-level thoughts?
Not sure exactly how your DataFrame looks like but Pandas.query() method will maybe prove useful for the selection of data.
name = df.query('columnname == "something"')
I have data that's to be analysed for a project I'm working in, mostly done using pandas at the moment as the data comes in from Excel.
I'm trying to merge some of these tables, based on a column, which isn't the issue, the issue is that the tables have column names that are the same, looking kind of like below:
the columns that get reused are 10-30, 30-50 etc.
I want to do it so that i can have a higher index on the numbered columns, and have it called something like "Percentages", "Real miles" etc, so that when I'm completing calculations later on it's easier to link up the relevant cells, as well as have it more presentable at the end
Right now I'm having difficulty producing this, as the only place I've seen that have something more akin to what I want is when you see people creating dataframes from tuple/dictionaries, but considering how large the final inputs will be in this project, I wouldn't know how to go about writing them in.
I'm basically looking to have it look like below:
I hope I understood your issue
Using the Merge function you can set suffix for each of the columns from each of the dataframes. E.g.:
df1.merge(df2, left_on='lkey', right_on='rkey',suffixes=('_left', '_right'))
This way you will differentiate between columns coming from each of your dataframes.
I was asked to do some data manipulation on an excel table with a head that's heavily merged as in the following picture...
And here is the some of the data inside the table...
If I tried to drop the first 17 rows of the head to drop the nonsense and get to the column names it still wouldn't read the column names correctly due to current merge, and I couldn't seem to figure a way to do it using pandas yet.
Any ideas?
I want to create a "presentation ready" excel document with embedded pandas DataFrames and additional data and formatting
A typical document will include some titles and meta data, several Data Frames with sum row\column for each data frame.
The DataFrame itself should be formatted
The best thing I found was this which explains how to use pandas with XlsxWriter.
The main problem is that there's no apparent method to get the exact location of the embedded DataFrame to add the summary row below (the shape of the DataFrame is a good estimate, but it might no be exact when rendering complex DataFrames.
If there's a solution that relies on some kind of template, and not hard coding it would be even better.
I need to do an apply on a dataframe using inputs from multiple rows. As a simple example, I can do the following if all the inputs are from a single row:
df['c'] = df[['a','b']].apply(lambda x: awesome stuff, axis=1)
# or
df['d'] = df[['b','c']].shift(1).apply(...) # to get the values from the previous row
However, if I need 'a' from the current row, and 'b' from the previous row, is there a way to do that with apply? I could add a new 'bshift' column and then just use df[['a','bshift']] but it seems there must be a more direct way.
Related but separate, when accessing a specific value in the df, is there a way to combine labeled indexing with integer-offset? E.g. I know the label of the current row but need the row before. Something like df.at['labelIknow'-1, 'a'] (which of course doesn't work). This is for when I'm forced to iterate through rows. Thanks in advance.
Edit: Some info on what I'm doing etc. I have a pandas store containing tables of OHLC bars (one table per security). When doing backtesting, currently I pull the full date range I need for a security into memory, and then resample it into a frequency that makes sense for the test at hand. Then I do some vectorized operations for things like trade entry signals etc. Finally I loop over the data from start to finish doing the actual backtest, e.g. checking for trade entry exit, drawdown etc - this looping part is the part I'm trying to speed up.
This should directly answer your question and let you use apply, although I'm not sure it's ultimately any better than a two-line solution. It does avoid creating extra variables at least.
df['c'] = pd.concat([ df['a'], df['a'].shift() ], axis=1).apply(np.mean,axis=1)
That will put the mean of 'a' values from the current and previous rows into 'c', for example.
This isn't as general, but for simpler cases you can do something like this (continuing the mean example):
df['c'] = ( df['a'] + df['a'].shift() ) / 2
That is about 10x faster than the concat() method on my tiny example dataset. I imagine that's as fast as you could do it, if you can code it in that style.
You could also look into reshaping the data with stack() and hierarchical indexing. That would be a way to get all your variables into the same row but I think it will likely be more complicated than the concat method or just creating intermediate variables via shift().
For the first part, I don't think such a thing is possible. If you update on what you actually want to achieve, I can update this answer.
Also looking at the second part, your data structure seems to be relying an awfully lot on the order of rows. This is typically not how you want to manage your databases. Again, if you tell us what your overall goal is, we may hint you towards a solution (and potentially a better way to structure the data base).
Anyhow, one way to get the row before, if you know a given index label, is to do:
df.ix[:'labelYouKnow'].iloc[-2]
Note that this is not the optimal thing to do efficiency-wise, so you may want to improve your your db structure in order to prevent the need for doing such things.