How can I select specific instances of a column in pandas? - python

I'm working on improving my client reporting for fundraising performance, and I put in specific reference codes to be able to track email performance. I've successfully loaded in my data frame, but I don't know how to select specific instances of a column.
For example, I'm trying to graph performance of 3 reference codes (SAN_20210811_GEN_FDR_X, SAN_20210808_GEN_ENG_X, and SAN_20210803_GEN_FDR_X). I loaded in my data frame:
import pandas as pd
data = pd.read_excel(r'C:\Users\Sandro\Downloads\SANdata.xlsx')
df = pd.DataFrame(data, columns= ['Amount','Reference Code'].
print (df)
Is there a way to filter the specific instances of my reference codes to then get a sum for each?
I'm thinking something like:
df.loc[df['Reference Code'] == "SAN_20210811_GEN_FDR_X"]
But I can't seem to get this to work.

For a specific reference code:
df['Amount'].loc[df['Reference Code'] == "SAN_20210811_GEN_FDR_X"].sum()
For all reference codes:
df.groupby('Reference Code')['Amount'].sum()

Related

Selecting row from a Pandas DataFrame based on constraints

I have several datasets that I import as csv files and display them in a DataFrame in Pandas. The csv files are info about Covid updates.
The datasets has several columns relating to this, for example "country_region", "last_update" & "confirmed".
Let's say I wanted to look up the confirmed cases of Covid for Germany.
I'm trying to write a function that will return a slice of the DataFrame that corresponds to those constraints to be able to display the match I'm looking for.
I need to do this in some generic way so I can provide any value from any column.
I wish I had some code to include but I'm stuck on how to even proceed.
Everything I find online only specifies for looking up values relating to a pre-defined value.
Something like this?
def filter(county_region_val, last_update_val, confirmed_val, df):
df = df.loc[((df['county_region'] == county_region_val) & (df['last_update'] == last_update_val) & (df[''confirmed'] == confirmed_val)).reset_index(drop=True)
return df

Automatically format pandas table without changing data

Is there a way to format the output of a particular column of a panda's data frame (e.g., as currency, with ${:,.2f}, or percentage, with {:,.2%}) without changing the data itself?
In this post I see that map can be used, but it changes the data to strings.
I also see that I can use .style.format (see here) to print a data frame with some formatting, but it returns a Styler object.
I would like just to change the default print out of the data frame itself, so that it always print it formatted as specified. (I suppose this means changing __repr__ or __repr_html__.) I'd assume that there is a simple way of doing this in pandas, but I could not find it.
Any help would be greatly appreciated!
EDIT (for clarification): Suppose I have a data frame df:
df = pd.DataFrame({"Price": [1234.5, 3456.789], "Increase": [0.01234, 0.23456]})
I want the column Price to be formatted with "${:,.2f}" and column Increase to be formatted with "{:,.2%}" whenever I print df in a Jupyter notebook (with print or just running a cell ending in df).
I can use
df.style.format({"Price": "${:,.2f}", "Increase": "{:,.2%}"})
but I do not want to type that every time I print df.
I could also do
df["Price"] = df["Price"].map("${:,.2f}".format)
df["Increase"] = df["Increase"].map("{:,.2%}".format)
which does always print as I want (with print(df)), but this changes the columns from float64 to object, so I cannot manipulate the data frame anymore.
It is a natural feature, but pandas cannot guess what your format is, and each time you create a styler it has to be informed of such decisions, since it is a separate object. It does not dynamically update if you change your DataFrame.
The best you can do is to create a generic print via styler.
def p(df):
styler = dy.style
styler.format({"Price": "${:,.2f}", "Increase": "{:,.2%}"})
return styler
df = DataFrame([[1,2],[3,4]], columns=["Price", "Increase"])
p(df)

Looking for names that somewhat match

I am creating an app that loads a DataFrame from spreadsheet. Sometimes the names are not the same as before (someone changed a column name slightly) and are just barely different. What would be the best way to "look" for these closely related column names in new spreadsheets when loading the df? If I as python to look for "col_1" but in the users spreadsheet the column is "col1" then python won't find it.
Example:
import pandas as pd
df = pd.read_excel('data.xlsx')
Here are the column names I am looking for, the rest of the columns load just fine and the column names that a just barely different get skipped and the data never gets loaded. How can I make sure that if the name is close to the name that python is looking for it will get loaded in the df?
Names I'm looking for:
'Water Consumption'
'Female Weight'
'Uniformity'
data.xlsx, incoming data column names that are slightly different:
'Water Consumed Actual'
'Body Weight Actual'
'Unif %'
The builtin function difflib.get_close_matches can help you with column names that are slightly wrong. Using that with the usecols argument of pd.read_excel should get you most of the way there.
You can do something like:
import difflib
import pandas as pd
desired_columns = ['Water Consumption', 'Female Weight', 'Uniformity']
def column_checker(col_name):
if difflib.get_close_matches(col_name, desired_columns):
return True:
else:
return False
df = pd.read_excel('data.xlsx', usecols=column_checker)
You can mess with the parameters of get_close_matches to make it more or less sensitive too.

Unable to access data using pandas composite index

I'm trying to organise data using a pandas dataframe.
Given the structure of the data it seems logical to use a composite index; 'league_id' and 'fixture_id'. I believe I have implemented this according to the examples in the docs, however I am unable to access the data using the index.
My code can be found here;
https://repl.it/repls/OldCorruptRadius
** I am very new to Pandas and programming in general, so any advice would be much appreciated! Thanks! **
For multi-indexing, you would need to use the pandas MutliIndex API, which brings its own learning curve; thus, I would not recommend it for beginners. Link: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
The way that I use multi-indexing is only to display a final product to others (i.e. making it easy/pretty to view). Before the multi-indexing, you filter the fixture_id and league_id as columns first:
df = pd.DataFrame(fixture, columns=features)
df[(df['fixture_id'] == 592) & (df['league_id'] == 524)]
This way, you are still technically targeting the indexes if you would have gone through with multi-indexing the two columns.
If you have to use multi-indexing, try the transform feature of a pandas DataFrame. This turns the indexes into columns and vise-versa. For example, you can do something like this:
df = pd.DataFrame(fixture, columns=features).set_index(['league_id', 'fixture_id'])
df.T[524][592].loc['event_date'] # gets you the row of `event_dates`
df.T[524][592].loc['event_date'].iloc[0] # gets you the first instance of event_dates

Python bson: How to create list of ObjectIds in a New Column

I have a CSV that I'm needing to create a column of random unique MongoDB ids in python.
Here's my csv file:
import pandas as pd
df = pd.read_csv('file.csv', sep=';')
print(df)
Zone
Zone_1
Zone_2
I'm currently using this line of code to generate a unique ObjectId - UPDATE
import bson
x = bson.objectid.ObjectId()
df['objectids'] = x
print(df)
Zone; objectids
Zone_1; 5bce2e42f6738f20cc12518d
Zone_2; 5bce2e42f6738f20cc12518d
How can I get the ObjectId to be unique for each row?
Hate to see you down voted... stack overflow goes nuts with the duplicate question nonsense, refuses to provide useful assistance, then down votes you for having the courage to ask about something you don't know.
The referenced question clearly has nothing to do with ObjectIds, let alone adding them (or any other object not internal to NumPy or Pandas) into data frames.
You probably need to use a map
this assumes the column "objectids" is not a Series in your frame and "Zone" is a Series in your frame
df['objectids'] = df['Zone'].map(lambda x: bson.objectid.ObjectId())
Maps are super helpful (though slow) way to poke every record in your series and particularly helpful as an initial method to connect external functions.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html

Categories