please refer this image: Two tables in single excel sheet
I need dynamic python code which can read two tables from single excel sheet without specifying the header position. The number of columns and number of rows can change with time.
Please help!
It's a little hard for me personally to write the actual code for something like this without the excel file itself, but I can definitely tell you the strategy/steps for dealing with it. As you know, pandas treats it as a single DataFrame. That means you should too. The trick is to not get fooled into thinking that this is truly structured data and works with identical logic to a structured table. Think of what you're doing to be less similar to cleaning structured data than it is telling a computer how to measure and cut a piece of paper. Instead of approaching it as two tables, think of it as a large DataFrame where rows fall into three categories:
Rows with nothing
Rows that you want to end up in the first table
Rows that you want to end up in the second table
The first thing to do is try and create a column that will sort the rows into those three groups. Looking at it, I would rely on the cells that say "information about table (1/2)". You can create a column that says 1 if the first column has "table 1", 2 if it has "table 2" and will be null otherwise. You may be worried about all of the actual table values having null values for this new column. Don't be yet.
Now, with the new column, you want to use the .ffill() method on the column. This will take all of the non-null values in the column and propagate them downwards to all available null values. At this point, all rows of the first table will have 1 for the column and the rows for the second table will have 2. We have the first major step out of the way.
Now, the first column should still have null values because you haven't done anything with it. Fortunately, the null values here only exist where the entire row is empty. Drop all rows with null values for the first column. At last, you should now be able to create two new DataFrames using Boolean masking.
e.g.: df1 = df.loc[df["filter"]==1].copy(deep=True)
You will still have the columns and headers to handle/clean up how you'd like, but at this point, it should be much easier for you to clean those up from a single table rather than two tables smashed together within a DataFrame.
Related
I have multiple excel files with different columns and some of them have same columns with additional data added as additional columns. I created a masterfile which contain all the column headers from each excel file and now I want to export data from individual excel files into the masterfile. Ideally, each row representing all the information about one single item.
I tried merging and concatenating the files, it adds all the data as new rows so, now I have some columns with repeated data but they also contain additional data in different columns.
What I want now is to recognize the columns that are already present and fill in the new data instead of repeating the all columns using python. I cannot share the data or the code so, looking for some help or idea to get this done. Any help would be appreciated, Thanks in advance!
You are probably merging the wrong way.
Not sure about your masterfile, sounds not very intuitive.
Make sure your rows have a specific ID that identifies it.
Then always perform the merge with that id and the 'inner' merge type.
I'm manually comparing two or three rows very similar using pandas. Is there a more automated way to do this? I would like a better method than using '=='.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html
See if this will satisfy your needs.
df['sales_diff'] = df['sales'].diff()
The above code snippet creates a new column in your data frame, which contains the difference between the previous row by default. You can screw around with the parameters (axis) to compare rows or columns and you can change (period) to compare to a specific row or column.
I have a seemingly complicated problem and I have a general idea of how I should solve it but I am not sure if it is the best way to go about it. I'll give the scenario and would appreciate any help on how to break this down. I'm fairly new with Pandas so please excuse my ignorance.
The Scenario
I have a CSV file that I import as a dataframe. My example I am working through contains 2742 rows × 136 columns. The rows are variable but the columns are set. I have a set of 23 lookup tables (also as CSV files) named per year, per quarter (range is 2020 3rd quarter - 2015 1st quarter) The lookup files are named as such: PPRRVU203.csv. So that contains values from the 3rd quarter of 2020. The lookup tables are matched by two columns ('Code' and 'Mod') and I use three values that are associated in the lookup.
I am trying to filter sections of my data frame, pull the correct values from the matching lookup file, merge back into the original subset, and then replace into the original dataframe.
Thoughts
I can probably abstract this and wrap in a function but not sure how I can place back in. My question, for those that understand Pandas better than myself, what is the best method to filter, replace the values, and write the file back out.
The straight forward solution would be to filter the original dataframe into 23 separate dataframes, then do the merge on each individual file, then concat into a new dataframe and output to CSV.
This seems highly inefficient?
I can post code but I am looking for more of any high-level thoughts?
Not sure exactly how your DataFrame looks like but Pandas.query() method will maybe prove useful for the selection of data.
name = df.query('columnname == "something"')
I am trying to turn the values in a column into separate columns with the value coming from another column, similar to this post, except dynamically. In other words, I'm trying to turn a table form long to wide format, similar to the functionality of spread in r or pivot in python.
Is there a way to pivot a table in athena dynamically -- without having to hard code the columns to pull?
No, there is no way to write a query that results in different number of columns depending on the data. The columns must be known before query execution starts.
I was asked to do some data manipulation on an excel table with a head that's heavily merged as in the following picture...
And here is the some of the data inside the table...
If I tried to drop the first 17 rows of the head to drop the nonsense and get to the column names it still wouldn't read the column names correctly due to current merge, and I couldn't seem to figure a way to do it using pandas yet.
Any ideas?