How do I find the row # of a string index? - python

I have a dataframe where the indexes are not numbers but strings (specifically, name of countries) and they are all unique. Given the name of a country, how do I find its row number (the 'number' value of the index)?
I tried df[df.index == 'country_name'].index but this doesn't work.

We can use Index.get_indexer:
df.index.get_indexer(['Peru'])
[3]
Or we can build a RangeIndex based on the size of the DataFrame then subset that instead:
pd.RangeIndex(len(df))[df.index == 'Peru']
Int64Index([3], dtype='int64')
Since we're only looking for a single label and the indexes are "all unique" we can also use Index.get_loc:
df.index.get_loc('Peru')
3
Sample DataFrame:
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5]
}, index=['Bahamas', 'Cameroon', 'Ecuador', 'Peru', 'Japan'])
df:
A
Bahamas 1
Cameroon 2
Ecuador 3
Peru 4
Japan 5

pd.Index.get_indexer
We can use pd.Index.get_indexer to get integer index.
idx = df.index.get_indexer(list_of_target_labels)
# If you only have single label we can use tuple unpacking here.
[idx] = df.index.get_indexer([country_name])
NB: pd.Index.get_indexer takes a list and returns a list. Integers from 0 to n - 1 indicating that the index at these positions matches the corresponding target values. Missing values in the target are marked by -1.
np.where
You could also use np.where here.
idx = np.where(df.index == country_name)[0]
list.index
We could also use list.index after converting Pd.Index to list using pd.Index.tolist
idx = df.index.tolist().index(country_name)

Why you don make the index to be created with numbers instead of text? Because your df can be sorted in many ways beyond the alphabetical, and you can lose the rows count.
With numbered index this wouldn't be a problem.

Related

Filtering columns in pandas by length

I have a column in a dataframe that contains IATA_Codes (Abbreviations) for Airports (such as: LAX, SFO, ...) However, if I analyze the column values a little more (column.unique()), it says that there are also 4 digit numbers in it.
How can I filter the column so that my Datafram will only consist of rows containing a real Airport code?
My idea was to filter the length (Airports Code Length is always 3, while the Number length is always 4) but I don't know how to implement this idea.
array(['LFT', 'HYS', 'ELP', 'DVL', 'ISP', 'BUR', 'DAB', 'DAY', 'GRK',
'GJT', 'BMI', 'LBE', 'ASE', 'RKS', 'GUM', 'TVC', 'ALO', 'IMT',
...
10170, 11577, 14709, 14711, 12255, 10165, 10918, 15401, 13970,
15497, 12265, 14254, 10581, 12016, 11503, 13459, 14222, 14025,
'10333', '14222', '14025', '13502', '15497', '12265'], dtype=object)
You can use df.columns.str.len to get the length, and pass that to the second indexer position of df.loc:
df = df.loc[:, df.columns.astype(str).str.len() == 3]
one another possibility is to use lambda expression :
df[df['IATA_Codes'].apply(lambda x : len(str(x))==3)]['IATA_Codes'].unique()

Removing columns if values are not in an ascending order python

Given a data like so:
Symbol
One
Two
1
28.75
25.10
2
29.00
25.15
3
29.10
25.00
I want to drop the column which does not have its values in an ascending order (though I want to allow for gaps) across all rows. In this case, I want to drop column 'Two'.I tried to following code with no luck:
df.drop(df.columns[df.all(x <= y for x,y in zip(df, df[1:]))])
Thanks
Dropping those columns that give at least one (any) negative value (lt(0)) when their values are differenced by 1 lag (diff(1)) after NaNs are neglected (dropna):
columns_to_drop = [col for col in df.columns if df[col].diff(1).dropna().lt(0).any()]
df.drop(columns=columns_to_drop)
Symbol One
0 1 28.75
1 2 29.00
2 3 29.10
An expression that works with gaps (NaN)
A.loc[:, ~(A.iloc[1:, :].reset_index() > A.iloc[:-1, :].reset_index()).any()]
Without gaps it would be equivalent to
A.loc[:, (A.iloc[1:, :].reset_index() <= A.iloc[:-1, :].reset_index()).all()]
Without loops to take better advantage of the framework for bigger dataframes.
A.iloc[1:, :] returns a dataframe without the first line
A.iloc[:-1, :] returns a dataframe without the last line
Slices in a dataframe keep the indices for corresponding rows, so the different slices have different indices, reset_index will create another index counting [0,1,...], thus making the two sides of the inequality compatible. You can pass drop=True if you want to remove the previous index.
Any (implicitly with axis=0) check for every column if any value is true, if so, it means that a number was followed by another.
A.loc[:, mask] select the columns where mask is true, drops the columns where mask is false.
The logic is could be read as not any value smaller than its predecessor or all values greater than its predecessor.
Check out code and only logic is:
map(lambda i: list(df[i]) == sorted(list(df[i])), df.columns)]
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'Symbol': [1, 2, 3],
'One': [28.75, 29.00, 29.10],
'Two': [25.10, 25.15, 25.10],
}
)
print(df.loc[:,map(lambda i: list(df[i]) == sorted(list(df[i])), df.columns)])

Python PANDAS: Applying a function to a dataframe, with arguments defined within dataframe

I have a dataframe with headers 'Category', 'Factor1', 'Factor2', 'Factor3', 'Factor4', 'UseFactorA', 'UseFactorB'.
The value of 'UseFactorA' and 'UseFactorB' are one of the strings ['Factor1', 'Factor2', 'Factor3', 'Factor4'], keyed based on the value in 'Category'.
I want to generate a column, 'Result', which equals dataframe[UseFactorA]/dataframe[UseFactorB]
Take the below dataframe as an example:
[Category] [Factor1] [Factor2] [Factor3] [Factor4] [useFactor1] [useFactor2]
A 1 2 5 8 'Factor1' 'Factor3'
B 2 7 4 2 'Factor3' 'Factor1'
The 'Result' series should be [2, .2]
However, I cannot figure out how to feed the value of useFactor1 and useFactor2 into an index to make this happen--if the columns to use were fixed, I would just give
df['Result'] = df['Factor1']/df['Factor2']
However, when I try to give
df['Results'] = df[df['useFactorA']]/df[df['useFactorB']]
I get the error
ValueError: Wrong number of items passed 3842, placement implies 1
Is there a method for doing what I am trying here?
Probably not the prettiest solution (because of the iterrows), but what comes to mind is to iterate through the sets of factors and set the 'Result' value at each index:
for i, factors in df[['UseFactorA', 'UseFactorB']].iterrows():
df.loc[i, 'Result'] = df[factors['UseFactorA']] / df[factors['UseFactorB']]
Edit:
Another option:
def factor_calc_for_row(row):
factorA = row['UseFactorA']
factorB = row['UseFactorB']
return row[factorA] / row[factorB]
df['Result'] = df.apply(factor_calc_for_row, axis=1)
Here's the one liner:
df['Results'] = [df[df['UseFactorA'][x]][x]/df[df['UseFactorB'][x]][x] for x in range(len(df))]
How it works is:
df[df['UseFactorA']]
Returns a data frame,
df[df['UseFactorA'][x]]
Returns a Series
df[df['UseFactorA'][x]][x]
Pulls a single value from the series.

retrieve all values in column exceeding 5 strings - pandas dataframe

I have a column in dataframe - df where all values should be length of 5 strings/characters but due to an error in my code, some have erroneous values and length of strings is either below 5 or greater than 5. Is there a way to just retrieve these columns?
For your next question, please provide an example df and an expected output.
df = pd.DataFrame({'a' : [1, 2, 3], 'b' : ["jasdjdj", "abcde", "hmmamamam"]})
df[df.b.str.len() != 5]
#gives:
a b
0 1 jasdjdj
2 3 hmmamamam
How does this work for you? This will return a dataframe where values meet the condition.
new_DF= your_df[your_df['COLUMN TO CHECK HERE'].str.len() != 5]
print(new_DF)
I think you're looking for a simple masking operation:
filter = lambda string: len(string) == 5
mask = df[col_to_filter].apply(filter, 1) # Return a boolean vector
new_df = df[mask].copy() # Create a new dataframe
You can apply an opposite filter to find items that aren't length 5 on your original dataframe.
For more details on df.apply() look here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

Remove empty lists in pandas series

I have a long series like the following:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
In [151]: series
Out[151]:
0 [(1, 2)]
1 [(3, 5)]
2 []
3 [(3, 5)]
dtype: object
I want to remove all entries with an empty list. For some reason, boolean indexing does not work.
The following tests both give the same error:
series == [[(1,2)]]
series == [(1,2)]
ValueError: Arrays were different lengths: 4 vs 1
This is very strange, because in the simple example below, indexing works just like above:
In [146]: pd.Series([1,2,3]) == [3]
Out[146]:
0 False
1 False
2 True
dtype: bool
P.S. ideally, I'd like to split the tuples in the series into a DataFrame of two columns also.
You could check to see if the lists are empty using str.len():
series.str.len() == 0
and then use this boolean series to remove the rows containing empty lists.
If each of your entries is a list containing a two-tuple (or else empty), you could create a two-column DataFrame by using the str accessor twice (once to select the first element of the list, then to access the elements of the tuple):
pd.DataFrame({'a': series.str[0].str[0], 'b': series.str[0].str[1]})
Missing entries default to NaN with this method.
Using the built in apply you can filter by the length of the list:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
series = series[series.apply(len) > 0]
Your series is in a bad state -- having a Series of lists of tuples of ints
buries the useful data, the ints, inside too many layers of containers.
However, to form the desired DataFrame, you could use
df = series.apply(lambda x: pd.Series(x[0]) if x else pd.Series()).dropna()
which yields
0 1
0 1 2
1 3 5
2 3 5
A better way would be to avoid building the malformed series altogether and
form df directly from the data:
data = [[(1,2)],[(3,5)],[],[(3,5)]]
data = [pair for row in data for pair in row]
df = pd.DataFrame(data)

Categories