So Basically here is a simplified version of my dataframe, and the 2. picture is what I want to get. : https://imgur.com/a/44QgR44
An explanation: Basically the 20201001 stuff is the date in a number format, and I want to group up the values for each date for some Group and Name.
Here comes my issue: I tried using df.groupby(by=['Credit','Equity','Bond').sum but It grouped everything up, not only the ones in the list (there are much more in the original dataset which I dont want to group up.
The second issue is that there are 2 things which group up in a different row (Stock and Option) so not sure how I could do that with pandas.
In excel I just got the result by a simple SUMIF function.
Related
I have a dataframe whose columns are derived from summary statistics of a data set. During the process it seems an index column is created that I can't get rid of. I describe the problem in detail in the screenshots below. Can you tell how to remove the seemingly index Financial Year without exporting the dataframe to Excel?
I should also note that attempt to drop index at the beginning doesn't work either as shown below
You could use the reset_index() method. More in: https://datagy.io/pandas-drop-index-column/
I am working on a program to go through tweets and predict whether the author falls into one of two categories. I want to get_dummies for whether or not a tweet contains any of the top 10 hashtags or if it contains 'other'. (In the end I will probably be using the top 500 or so hashtags not just 10, the data set is over 500,000 columns in total with over 50,000 unique hashtags)
This is my first time using pandas, so apologies if my question is unclear, but I think what I'm expecting is each row in the data set would be given a new column, one for each hashtag, and then the value of that [row][column] pair would be 1 if the row contains that hashtag or 0 if it does not. There would also be a column for other indicating it has other hashtags not in the top 10.
I know how to determine the most frequently occurring in the column already
counts = df.hashtags.value_counts()
counts.nlargest(10)
I also understand how to get dummies, I just don't know how to add the parameter of not making one for every hashtag.
dummies = pd.get_dummies(df, columns=['hashtags'])
Please let me know if I could be clearer or provide more info. Appreciate the help!
Don't have time to gen data and work it all out. But though I'd get you this idea in case it might help you out.
The idea is to leverage .isin() to get the values that you need to build the dummies. Then leverage the power of the index to match to the source rows.
Something like:
pd.get_dummies(df.loc[df['hashtags'].isin(counts.nlargest(10).index)], columns=['hashtags'])
You will have to see if the indices will give you what you need.
I'm manually comparing two or three rows very similar using pandas. Is there a more automated way to do this? I would like a better method than using '=='.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html
See if this will satisfy your needs.
df['sales_diff'] = df['sales'].diff()
The above code snippet creates a new column in your data frame, which contains the difference between the previous row by default. You can screw around with the parameters (axis) to compare rows or columns and you can change (period) to compare to a specific row or column.
I have a seemingly complicated problem and I have a general idea of how I should solve it but I am not sure if it is the best way to go about it. I'll give the scenario and would appreciate any help on how to break this down. I'm fairly new with Pandas so please excuse my ignorance.
The Scenario
I have a CSV file that I import as a dataframe. My example I am working through contains 2742 rows × 136 columns. The rows are variable but the columns are set. I have a set of 23 lookup tables (also as CSV files) named per year, per quarter (range is 2020 3rd quarter - 2015 1st quarter) The lookup files are named as such: PPRRVU203.csv. So that contains values from the 3rd quarter of 2020. The lookup tables are matched by two columns ('Code' and 'Mod') and I use three values that are associated in the lookup.
I am trying to filter sections of my data frame, pull the correct values from the matching lookup file, merge back into the original subset, and then replace into the original dataframe.
Thoughts
I can probably abstract this and wrap in a function but not sure how I can place back in. My question, for those that understand Pandas better than myself, what is the best method to filter, replace the values, and write the file back out.
The straight forward solution would be to filter the original dataframe into 23 separate dataframes, then do the merge on each individual file, then concat into a new dataframe and output to CSV.
This seems highly inefficient?
I can post code but I am looking for more of any high-level thoughts?
Not sure exactly how your DataFrame looks like but Pandas.query() method will maybe prove useful for the selection of data.
name = df.query('columnname == "something"')
please refer this image: Two tables in single excel sheet
I need dynamic python code which can read two tables from single excel sheet without specifying the header position. The number of columns and number of rows can change with time.
Please help!
It's a little hard for me personally to write the actual code for something like this without the excel file itself, but I can definitely tell you the strategy/steps for dealing with it. As you know, pandas treats it as a single DataFrame. That means you should too. The trick is to not get fooled into thinking that this is truly structured data and works with identical logic to a structured table. Think of what you're doing to be less similar to cleaning structured data than it is telling a computer how to measure and cut a piece of paper. Instead of approaching it as two tables, think of it as a large DataFrame where rows fall into three categories:
Rows with nothing
Rows that you want to end up in the first table
Rows that you want to end up in the second table
The first thing to do is try and create a column that will sort the rows into those three groups. Looking at it, I would rely on the cells that say "information about table (1/2)". You can create a column that says 1 if the first column has "table 1", 2 if it has "table 2" and will be null otherwise. You may be worried about all of the actual table values having null values for this new column. Don't be yet.
Now, with the new column, you want to use the .ffill() method on the column. This will take all of the non-null values in the column and propagate them downwards to all available null values. At this point, all rows of the first table will have 1 for the column and the rows for the second table will have 2. We have the first major step out of the way.
Now, the first column should still have null values because you haven't done anything with it. Fortunately, the null values here only exist where the entire row is empty. Drop all rows with null values for the first column. At last, you should now be able to create two new DataFrames using Boolean masking.
e.g.: df1 = df.loc[df["filter"]==1].copy(deep=True)
You will still have the columns and headers to handle/clean up how you'd like, but at this point, it should be much easier for you to clean those up from a single table rather than two tables smashed together within a DataFrame.