Pandas dataframe - Difference between using loc and query - python

I have a dataframe,df that looks like
I need to add a column Weekday to it, which is obtained through the index.
What is the difference between using
df['Weekday']=df.index.weekday
and
df.loc[:,'Weekday'] = df.index.weekday

In this case , you wont have any issues.
But is it advisable to use .loc functionality.
You can read the difference in detailed here

Related

python pandas.loc not finding row name: KeyError

This is driving me crazy because it should be so simple and yet it's not working. It's a duplicate question and yet the answers from previous questions don't work.
My csv looks similar to this:
name,val1,val2,val3
ted,1,2,
bob,1,,
joe,,,4
I want to print the contents of row 'joe'. I use the line below and pycharm gives me a KeyError.
print(df.loc['joe'])
The problem with your logic is that you have not let pandas know which column it should search joe for.
print(df.loc[df['name'] == 'joe'])
or
print(df[df['name'] == 'joe'])
Using .loc directly is achievable only on index.
If you just used pd.read_csv without mentioning the index, by default pandas will use number as index. You can set name to be the index if it is unique. Then .loc will work:
df.set_index("name")
print(df.loc['joe'])
Another option, and it's how usually working with .loc, is to name specifically what column you refer to:
print(df.loc[df["name"]=="joe"])
Note that the condition df["name"]=="joe" returns a series with true/false for each row. df.loc[...] on that series will return only rows where the value is true, and therefore it will return only rows where name is "joe". Keep that in mind when in future you will try to do more complex conditioning on your dataframe using .loc.

Unable to access data using pandas composite index

I'm trying to organise data using a pandas dataframe.
Given the structure of the data it seems logical to use a composite index; 'league_id' and 'fixture_id'. I believe I have implemented this according to the examples in the docs, however I am unable to access the data using the index.
My code can be found here;
https://repl.it/repls/OldCorruptRadius
** I am very new to Pandas and programming in general, so any advice would be much appreciated! Thanks! **
For multi-indexing, you would need to use the pandas MutliIndex API, which brings its own learning curve; thus, I would not recommend it for beginners. Link: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
The way that I use multi-indexing is only to display a final product to others (i.e. making it easy/pretty to view). Before the multi-indexing, you filter the fixture_id and league_id as columns first:
df = pd.DataFrame(fixture, columns=features)
df[(df['fixture_id'] == 592) & (df['league_id'] == 524)]
This way, you are still technically targeting the indexes if you would have gone through with multi-indexing the two columns.
If you have to use multi-indexing, try the transform feature of a pandas DataFrame. This turns the indexes into columns and vise-versa. For example, you can do something like this:
df = pd.DataFrame(fixture, columns=features).set_index(['league_id', 'fixture_id'])
df.T[524][592].loc['event_date'] # gets you the row of `event_dates`
df.T[524][592].loc['event_date'].iloc[0] # gets you the first instance of event_dates

Groupby follow by series

df.Last_3mth_Avg.isnull().groupby([df['ShopID'],df['ProductID']]).sum().astype(int).reset_index(name='count')
The code above help me to see the number of null values by shopid and productid. Question is df.Last_3mth_Avg.isnull() becomes a series, how a groupby([df['ShopID'],df['ProductID']]) can be used afterwards?
I use the solution from:
Pandas count null values in a groupby function
You should filter your df first:
df[df.Last_3mth_Avg.isnull()].groupby(['ShopID','ProductID']).agg('count')
There are two ways to use groupby:
The common way is to use on the dataframe so you just mention the column names in the by= parameter
The second way is you apply on a series but use equal sized series in the by= parameter. This is rarely used and helps when you want to do convertions on a specific column and use groupby in the same line
So, the above code line should work

How do I get a section of a Dataframe that meets a certain criteria?

I want to get a section of a dataframe that meets a certain requirement.
I want to do:
new_df = old_df[old_df.timevariable.date() == thisdateiwant]
Is there an efficient way to do this that works?
the issue here is the .date() part. I've done this before using the same syntax but not with a modifier on the part of old_df. For example of old_df.timevariable is a datetime, then I could match this with a ==datetime but as I want a date, I need to modify each element in the dataframe, which the syntax doesn't like.
I know I could take it all out and have it loop through with a bunch of variables, but I'm pretty sure this would be much slower. The first code snippet seemed to be the fastest way of doing this (like a WHERE SQL clause), although doesn't seem to work if you need to modify the variable you're comparing (such as .date()).
The old_df is about (900k, 15) in size so I want to get something efficient. Currently, I'm just changing variables and reimporting from SQL which seems to take 5-10 seconds for each date (thisdateiwant). I presume something in python with the larger initial database will be quicker than this. Typically it returns about 30k rows into new_df for each date.
What is the fastest way of doing this?
Edit
Happy to mark this as a duplicate, I got it working from some code in that other question (from #Pault).
basically did:
mask = old_db['timevariable'] >= thisdateiwant
mask2 = old_db['timevariable'] < thisdateiwant (+1day)
new_db = old_db.loc[mask]
new_db = new_db.loc[mask2]
I don't think there's an easy way to do both masks at the same time, seemed to throw and error. It's nice and quick so I'm happy.
if your column is truly a timestamp then you can make use of the dt accessor.
new_df = old_df[old_df.timevariable.dt.floor('D') == '2018-05-09']
otherwise change the target column to timestamp using pd.to_datetime
old_df[timevariable] = pd.to_datetime(old_df[timevariable])
ranges of dates are supported more naturally without the dt accessor
new_df = old_df[old_df.timevariable >= '2018-05-09'] <- dates after may 9th inclusive

What is the difference between these two Python pandas dataframe commands?

Let's say I have an empty pandas dataframe.
import pandas as pd
m = pd.DataFrame(index=range(1,100), columns=range(1,100))
m = m.fillna(0)
What is the difference between the following two commands?
m[2][1]
m[2].ix[1] # This code actually allows you to edit the dataframe
Feel free to provide further reading if it would be helpful for future reference.
The short answer is that you probably shouldn't do either of these (see #EdChum's link for the reason):
m[2][1]
m[2].ix[1]
You should generally use a single ix, iloc, or loc command on the entire dataframe anytime you want access by both row and column -- not a sequential column access, followed by row access as you did here. For example,
m.iloc[1,2]
Note that the 1 and 2 are reversed compared to your example because ix/iloc/loc all use standard syntax of row then column. Your syntax is reversed because you are chaining, and are first selecting a column from a dataframe (which is a series) and then selecting a row from that series.
In simple cases like yours, it often won't matter too much but the usefulness of ix/iloc/loc is that they are designed to let you "edit" the dataframe in complex ways (aka set or assign values).
There is a really good explanation of ix/iloc/loc here:
pandas iloc vs ix vs loc explanation?
and also in standard pandas documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html

Categories