I'm manually comparing two or three rows very similar using pandas. Is there a more automated way to do this? I would like a better method than using '=='.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html
See if this will satisfy your needs.
df['sales_diff'] = df['sales'].diff()
The above code snippet creates a new column in your data frame, which contains the difference between the previous row by default. You can screw around with the parameters (axis) to compare rows or columns and you can change (period) to compare to a specific row or column.
Related
I have two datasets that partially overlap. The overlapping part should have identical values in two columns. However, I suspect that's not always the case. I want to check this using pandas, but I run into a problem: since the dataframes are structured differently, their row indexes do not correspond. Moreover, corresponding rows have a different "Name" or "ID". Therefore, I wanted to match rows by matching values from three other columns that I am confident are the same: latitude, longitude and number of samples (I need all three because some rows are collected at the same location and some rows may have the same number of samples).
In short, I want to formulate a condition that requires three columns in a row from either dataframe to be equal, and then check the values of the columns that I suspect are different. Unfortunately, I have not been able to formulate this problem well enough to make google find me the correct function.
Many thanks!
I have a dataframe whose columns are derived from summary statistics of a data set. During the process it seems an index column is created that I can't get rid of. I describe the problem in detail in the screenshots below. Can you tell how to remove the seemingly index Financial Year without exporting the dataframe to Excel?
I should also note that attempt to drop index at the beginning doesn't work either as shown below
You could use the reset_index() method. More in: https://datagy.io/pandas-drop-index-column/
So Basically here is a simplified version of my dataframe, and the 2. picture is what I want to get. : https://imgur.com/a/44QgR44
An explanation: Basically the 20201001 stuff is the date in a number format, and I want to group up the values for each date for some Group and Name.
Here comes my issue: I tried using df.groupby(by=['Credit','Equity','Bond').sum but It grouped everything up, not only the ones in the list (there are much more in the original dataset which I dont want to group up.
The second issue is that there are 2 things which group up in a different row (Stock and Option) so not sure how I could do that with pandas.
In excel I just got the result by a simple SUMIF function.
I have a seemingly complicated problem and I have a general idea of how I should solve it but I am not sure if it is the best way to go about it. I'll give the scenario and would appreciate any help on how to break this down. I'm fairly new with Pandas so please excuse my ignorance.
The Scenario
I have a CSV file that I import as a dataframe. My example I am working through contains 2742 rows × 136 columns. The rows are variable but the columns are set. I have a set of 23 lookup tables (also as CSV files) named per year, per quarter (range is 2020 3rd quarter - 2015 1st quarter) The lookup files are named as such: PPRRVU203.csv. So that contains values from the 3rd quarter of 2020. The lookup tables are matched by two columns ('Code' and 'Mod') and I use three values that are associated in the lookup.
I am trying to filter sections of my data frame, pull the correct values from the matching lookup file, merge back into the original subset, and then replace into the original dataframe.
Thoughts
I can probably abstract this and wrap in a function but not sure how I can place back in. My question, for those that understand Pandas better than myself, what is the best method to filter, replace the values, and write the file back out.
The straight forward solution would be to filter the original dataframe into 23 separate dataframes, then do the merge on each individual file, then concat into a new dataframe and output to CSV.
This seems highly inefficient?
I can post code but I am looking for more of any high-level thoughts?
Not sure exactly how your DataFrame looks like but Pandas.query() method will maybe prove useful for the selection of data.
name = df.query('columnname == "something"')
please refer this image: Two tables in single excel sheet
I need dynamic python code which can read two tables from single excel sheet without specifying the header position. The number of columns and number of rows can change with time.
Please help!
It's a little hard for me personally to write the actual code for something like this without the excel file itself, but I can definitely tell you the strategy/steps for dealing with it. As you know, pandas treats it as a single DataFrame. That means you should too. The trick is to not get fooled into thinking that this is truly structured data and works with identical logic to a structured table. Think of what you're doing to be less similar to cleaning structured data than it is telling a computer how to measure and cut a piece of paper. Instead of approaching it as two tables, think of it as a large DataFrame where rows fall into three categories:
Rows with nothing
Rows that you want to end up in the first table
Rows that you want to end up in the second table
The first thing to do is try and create a column that will sort the rows into those three groups. Looking at it, I would rely on the cells that say "information about table (1/2)". You can create a column that says 1 if the first column has "table 1", 2 if it has "table 2" and will be null otherwise. You may be worried about all of the actual table values having null values for this new column. Don't be yet.
Now, with the new column, you want to use the .ffill() method on the column. This will take all of the non-null values in the column and propagate them downwards to all available null values. At this point, all rows of the first table will have 1 for the column and the rows for the second table will have 2. We have the first major step out of the way.
Now, the first column should still have null values because you haven't done anything with it. Fortunately, the null values here only exist where the entire row is empty. Drop all rows with null values for the first column. At last, you should now be able to create two new DataFrames using Boolean masking.
e.g.: df1 = df.loc[df["filter"]==1].copy(deep=True)
You will still have the columns and headers to handle/clean up how you'd like, but at this point, it should be much easier for you to clean those up from a single table rather than two tables smashed together within a DataFrame.