What is the difference between these two Python pandas dataframe commands?

What is the difference between these two Python pandas dataframe commands? - python

Let's say I have an empty pandas dataframe.
import pandas as pd
m = pd.DataFrame(index=range(1,100), columns=range(1,100))
m = m.fillna(0)
What is the difference between the following two commands?
m[2][1]
m[2].ix[1] # This code actually allows you to edit the dataframe
Feel free to provide further reading if it would be helpful for future reference.

The short answer is that you probably shouldn't do either of these (see #EdChum's link for the reason):
m[2][1]
m[2].ix[1]
You should generally use a single ix, iloc, or loc command on the entire dataframe anytime you want access by both row and column -- not a sequential column access, followed by row access as you did here. For example,
m.iloc[1,2]
Note that the 1 and 2 are reversed compared to your example because ix/iloc/loc all use standard syntax of row then column. Your syntax is reversed because you are chaining, and are first selecting a column from a dataframe (which is a series) and then selecting a row from that series.
In simple cases like yours, it often won't matter too much but the usefulness of ix/iloc/loc is that they are designed to let you "edit" the dataframe in complex ways (aka set or assign values).
There is a really good explanation of ix/iloc/loc here:
pandas iloc vs ix vs loc explanation?
and also in standard pandas documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html

Related

python pandas.loc not finding row name: KeyError

This is driving me crazy because it should be so simple and yet it's not working. It's a duplicate question and yet the answers from previous questions don't work.
My csv looks similar to this:
name,val1,val2,val3
ted,1,2,
bob,1,,
joe,,,4
I want to print the contents of row 'joe'. I use the line below and pycharm gives me a KeyError.
print(df.loc['joe'])

The problem with your logic is that you have not let pandas know which column it should search joe for.
print(df.loc[df['name'] == 'joe'])
or
print(df[df['name'] == 'joe'])

Using .loc directly is achievable only on index.
If you just used pd.read_csv without mentioning the index, by default pandas will use number as index. You can set name to be the index if it is unique. Then .loc will work:
df.set_index("name")
print(df.loc['joe'])
Another option, and it's how usually working with .loc, is to name specifically what column you refer to:
print(df.loc[df["name"]=="joe"])
Note that the condition df["name"]=="joe" returns a series with true/false for each row. df.loc[...] on that series will return only rows where the value is true, and therefore it will return only rows where name is "joe". Keep that in mind when in future you will try to do more complex conditioning on your dataframe using .loc.

Why does vaex change column names that contain a period?

When using vaex I came across an unexpected error NameError: name 'column_2_0' is not defined.
After some investigation I found that in my data source (HDF5 file) the column name causing problems is actually called column_2.0 and that vaex renames it to column_2_0 but when performing operations using column names I run into the error. Here is a simple example that reproduces this error:
import pandas as pd
import vaex
cols = ['abc_1', 'abc1', 'abc.1']
vals = list(range(0,len(cols)))
df = pd.DataFrame([vals], columns=cols)
dfv = vaex.from_pandas(df)
for col in dfv.column_names:
dfv = dfv[dfv[col].notna()]
dfv.count()
...
NameError: name 'abc_1_1' is not defined
In this case it appears that vaex tries to rename abc.1 to abc_1 which is already taken so instead it ends up using abc_1_1.
I know that I can rename the column like dfv.rename('abc_1_1', 'abc_dot_1'), but (a) I'd need to introduce special logic for naming conflicts like in this example where the column name that vaex comes up with is already taken and (b) I'd rather not have to do this manually each time I have a column that contains a period.
I could also enforce all my column names from source data to never use a period but this seems like a stretch given that pandas and other sources where data might come from in general don't have this restriction.
What are some ideas to deal with this problem other than the two I mentioned above?

In Vaex the columns are in fact "Expressions". Expressions allow you do build sort of a computational graph behind the scenes as you are doing your regular dataframe operations. However, that requires the column names to be as "clean" as possible.
So column names like '2', or '2.5' are not allows, since the expression system can interpret them as numbers rather than column names. Also column names like 'first-name', the expressions system can interpret as df['first'] - df['name'].
To avoid this, vaex will smartly rename columns so that they can be used in the expression system. This is extremely complicated actually. So in your example above, you've found a case that has not been covered yet (isna/ notna).
Btw, you can always access the original names via df.get_column_names(alias=True).

Unable to access data using pandas composite index

I'm trying to organise data using a pandas dataframe.
Given the structure of the data it seems logical to use a composite index; 'league_id' and 'fixture_id'. I believe I have implemented this according to the examples in the docs, however I am unable to access the data using the index.
My code can be found here;
https://repl.it/repls/OldCorruptRadius
** I am very new to Pandas and programming in general, so any advice would be much appreciated! Thanks! **

For multi-indexing, you would need to use the pandas MutliIndex API, which brings its own learning curve; thus, I would not recommend it for beginners. Link: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
The way that I use multi-indexing is only to display a final product to others (i.e. making it easy/pretty to view). Before the multi-indexing, you filter the fixture_id and league_id as columns first:
df = pd.DataFrame(fixture, columns=features)
df[(df['fixture_id'] == 592) & (df['league_id'] == 524)]
This way, you are still technically targeting the indexes if you would have gone through with multi-indexing the two columns.
If you have to use multi-indexing, try the transform feature of a pandas DataFrame. This turns the indexes into columns and vise-versa. For example, you can do something like this:
df = pd.DataFrame(fixture, columns=features).set_index(['league_id', 'fixture_id'])
df.T[524][592].loc['event_date'] # gets you the row of `event_dates`
df.T[524][592].loc['event_date'].iloc[0] # gets you the first instance of event_dates

Pandas dataframe - Difference between using loc and query

I have a dataframe,df that looks like
I need to add a column Weekday to it, which is obtained through the index.
What is the difference between using
df['Weekday']=df.index.weekday
and
df.loc[:,'Weekday'] = df.index.weekday

In this case , you wont have any issues.
But is it advisable to use .loc functionality.
You can read the difference in detailed here

Pandas SettingWithCopyWarning when using iloc

I'm trying to change values in my DataFrame after merging it with another DataFrame and coming across some issues (doesn't appear to be an issue prior to merging).
I am indexing and changing values in my DataFrame with:
df.iloc[0]['column'] = 1
Subsequently I've joined (left outer join) along both indexes using merge (I realize left.join(right) would work too). After this when I perform the same value assignment using iloc, I receive the following warning:
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A review of the linked document doesn't clarify the understanding hence, am I using an incorrect method of slicing with iloc? (keeping in mind I require positional based slicing for the purpose of my code)
I notice that df.ix[0,'column'] = 1 works, and similarly based on this page I can reference the column location with df.columns.get_loc('column') but on the surface this seems unnecessarily convoluted.
What's the difference between these methods under the hood, and what about merging causes the previous method (df.iloc[0]['column']) to break?

You are using chained indexing above, this is to be avoided "df.iloc[0]['column'] = 1" and generates the SettingWithCopy Warning you are getting. The Pandas docs are a bit complicated but see SettingWithCopy Warning with chained indexing for the under the hood explanation on why this does not work.
Instead you should use df.loc[0, 'column'] = 1
.loc is for "Access a group of rows and columns by label(s) or a boolean array."
.iloc is for "Purely integer-location based indexing for selection by position."

It sucks, but the best solution I've come so far about updating a dataframe's column based on the .ilocs is find the iloc of a column, then use .iloc for everything:
column_i_loc = np.where(df.columns == 'column')[0][0]
df.iloc[0, column_i_loc] = 1
Note you could also disable the warning, but really do not!...
Also, if you face this warning and were not trying to update some original DataFrame, then you forgot to make a copy and end up with a nasty bug...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the difference between these two Python pandas dataframe commands? - python

Related

python pandas.loc not finding row name: KeyError

Why does vaex change column names that contain a period?

Unable to access data using pandas composite index

Pandas dataframe - Difference between using loc and query

Pandas SettingWithCopyWarning when using iloc

Categories

Resources