Normalize heavily merged excel table - python

I was asked to do some data manipulation on an excel table with a head that's heavily merged as in the following picture...
And here is the some of the data inside the table...
If I tried to drop the first 17 rows of the head to drop the nonsense and get to the column names it still wouldn't read the column names correctly due to current merge, and I couldn't seem to figure a way to do it using pandas yet.
Any ideas?

Related

Merging multiple excel files into a master file using python with out any repeated values

I have multiple excel files with different columns and some of them have same columns with additional data added as additional columns. I created a masterfile which contain all the column headers from each excel file and now I want to export data from individual excel files into the masterfile. Ideally, each row representing all the information about one single item.
I tried merging and concatenating the files, it adds all the data as new rows so, now I have some columns with repeated data but they also contain additional data in different columns.
What I want now is to recognize the columns that are already present and fill in the new data instead of repeating the all columns using python. I cannot share the data or the code so, looking for some help or idea to get this done. Any help would be appreciated, Thanks in advance!
You are probably merging the wrong way.
Not sure about your masterfile, sounds not very intuitive.
Make sure your rows have a specific ID that identifies it.
Then always perform the merge with that id and the 'inner' merge type.

Pandas and complicated filtering and merge/join multiple sub-data frames

I have a seemingly complicated problem and I have a general idea of how I should solve it but I am not sure if it is the best way to go about it. I'll give the scenario and would appreciate any help on how to break this down. I'm fairly new with Pandas so please excuse my ignorance.
The Scenario
I have a CSV file that I import as a dataframe. My example I am working through contains 2742 rows × 136 columns. The rows are variable but the columns are set. I have a set of 23 lookup tables (also as CSV files) named per year, per quarter (range is 2020 3rd quarter - 2015 1st quarter) The lookup files are named as such: PPRRVU203.csv. So that contains values from the 3rd quarter of 2020. The lookup tables are matched by two columns ('Code' and 'Mod') and I use three values that are associated in the lookup.
I am trying to filter sections of my data frame, pull the correct values from the matching lookup file, merge back into the original subset, and then replace into the original dataframe.
Thoughts
I can probably abstract this and wrap in a function but not sure how I can place back in. My question, for those that understand Pandas better than myself, what is the best method to filter, replace the values, and write the file back out.
The straight forward solution would be to filter the original dataframe into 23 separate dataframes, then do the merge on each individual file, then concat into a new dataframe and output to CSV.
This seems highly inefficient?
I can post code but I am looking for more of any high-level thoughts?
Not sure exactly how your DataFrame looks like but Pandas.query() method will maybe prove useful for the selection of data.
name = df.query('columnname == "something"')

How to read two tables from single excel sheet using python?

please refer this image: Two tables in single excel sheet
I need dynamic python code which can read two tables from single excel sheet without specifying the header position. The number of columns and number of rows can change with time.
Please help!
It's a little hard for me personally to write the actual code for something like this without the excel file itself, but I can definitely tell you the strategy/steps for dealing with it. As you know, pandas treats it as a single DataFrame. That means you should too. The trick is to not get fooled into thinking that this is truly structured data and works with identical logic to a structured table. Think of what you're doing to be less similar to cleaning structured data than it is telling a computer how to measure and cut a piece of paper. Instead of approaching it as two tables, think of it as a large DataFrame where rows fall into three categories:
Rows with nothing
Rows that you want to end up in the first table
Rows that you want to end up in the second table
The first thing to do is try and create a column that will sort the rows into those three groups. Looking at it, I would rely on the cells that say "information about table (1/2)". You can create a column that says 1 if the first column has "table 1", 2 if it has "table 2" and will be null otherwise. You may be worried about all of the actual table values having null values for this new column. Don't be yet.
Now, with the new column, you want to use the .ffill() method on the column. This will take all of the non-null values in the column and propagate them downwards to all available null values. At this point, all rows of the first table will have 1 for the column and the rows for the second table will have 2. We have the first major step out of the way.
Now, the first column should still have null values because you haven't done anything with it. Fortunately, the null values here only exist where the entire row is empty. Drop all rows with null values for the first column. At last, you should now be able to create two new DataFrames using Boolean masking.
e.g.: df1 = df.loc[df["filter"]==1].copy(deep=True)
You will still have the columns and headers to handle/clean up how you'd like, but at this point, it should be much easier for you to clean those up from a single table rather than two tables smashed together within a DataFrame.

Automation of splitting excel sheets based on column values with Python

Consider I have a huge excel sheet, with multiple columns and entries. However, there exists a particular column (COLUMN A) containing boolean values 0s and 1s. Now I wish to split my parent excel sheet into 2 sheets, based on the values of the COLUMN A. I already know that this can be done using VBA codes. However, I wanna try this on python.
My idea is that we can iterate through the said column values, and if a condition is satisfied, pick up the whole row and write it in a new sheet.
I am learning the language, can use numpy and pandas a bit to create linear regression models and the like. I'd like to work on this 'personal-project'. Would be glad if anyone would help me with this, provide a few hints or something to start with. Thank you.
How I would go about it:
Read the full excel sheet into a pandas dataframe
df = pd.from_excel("file_name.xlsx")
Filter the dataframe by values in that columns
df1 = df[df["COLUMN A"]==1]
df0 = df[df["COLUMN A"]==0]
Read those new dataframes to a new excel workbook, or new excel sheet on an exisiting workbook, using the pandas ExcelWriter: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelWriter.html
Don't forget to handle missing data in column A, if there is any.
I am just a student, so perhaps there are more efficient ways to do this, but I use pandas quite a bit in my undergraduate research and this is what I would do. Best of luck you :)

How to output html table with merged cells from pandas DataFrame

I have a pandas.DataFrame df as:
>>> df = pd.DataFrame([[1,2,2,2,3], [1,2,3,3,3],[1,3,2,3,5],[7,9,9,3,2]], columns=list("ABCDE"))
I want to achieve this type of table in html (with control of which cells I can merge)
I know that it can be achieved manipulating the table obtained from df.to_html() function and using jquery to expand the rowspans, yet I'm asking about the pythonic way to do it, i.e. is there a possible way of obtaining the merged table directly from some sort of pivot table / dataframe.
I thought about temporary setting the columns to merge as indexes in multi indexed data frame, however this approach is.. crude, to say the least.
It would be perfect if I had the full control of which cells can I merge, based on, for example, values of other cells in the same row.
UPDATE: I've managed to find similar question, with extensive answer (see jpp's answer), yet unfortunately negating possibility of simple solution to my problem. As silkworm suggested, for now the only possibility is to delve the to_html source, or meddling with multi indexing the dataframe.

Categories