I have a column in dataframe - df where all values should be length of 5 strings/characters but due to an error in my code, some have erroneous values and length of strings is either below 5 or greater than 5. Is there a way to just retrieve these columns?
For your next question, please provide an example df and an expected output.
df = pd.DataFrame({'a' : [1, 2, 3], 'b' : ["jasdjdj", "abcde", "hmmamamam"]})
df[df.b.str.len() != 5]
#gives:
a b
0 1 jasdjdj
2 3 hmmamamam
How does this work for you? This will return a dataframe where values meet the condition.
new_DF= your_df[your_df['COLUMN TO CHECK HERE'].str.len() != 5]
print(new_DF)
I think you're looking for a simple masking operation:
filter = lambda string: len(string) == 5
mask = df[col_to_filter].apply(filter, 1) # Return a boolean vector
new_df = df[mask].copy() # Create a new dataframe
You can apply an opposite filter to find items that aren't length 5 on your original dataframe.
For more details on df.apply() look here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
Related
I am writing a custom error message when 2 Pandas series are not equal and want to use '<' to point at the differences.
Here's the workflow for a failed equality:
Convert both lists to Python: pd.Series([list])
Side by side comparison in a dataframe: table = pd.concat([list1], [list2]), axis=1
Add column and index names: table.columns = ['...', '...'], table.index = ['...', '...']
Current output:
|Yours|Actual|
|1|1|
|2|2|
|4|3|
Desired output:
|Yours|Actual|-|
|1|1||
|2|2||
|4|3|<|
The naive solution is iterating through each list index and if it's not equal, appending '<' to another list then putting this list into pd.concat() but I am looking for a method using Pandas. For example,
error_series = '<' if (abs(yours - actual) >= 1).all(axis=None) else ''
Ideally it would append '<' to a list if the difference between the results is greater than the Margin of Error of 1, otherwise append nothing
Note: Removed tables due to StackOverflow being picky and not letting my post my question
You can create the DF and give index and column names in one line:
import pandas as pd
list1 = [1,2,4]
list2 = [1,2,10]
df = pd.DataFrame(zip(list1, list2), columns=['Yours', 'Actual'])
Create a boolean mask to find the rows that have a too large difference:
margin_of_error = 1
mask = df.diff(axis=1)['Actual'].abs()>margin_of_error
Add a column to the DF and set the values of the mask as you want:
df['too_different'] = df.diff(axis=1)['Actual'].abs()>margin_of_error
df['too_different'].replace(True, '<', inplace=True)
df['too_different'].replace(False, '', inplace=True)
output:
Yours Actual too_different
0 1 1
1 2 2
2 4 10 <
or you can do something like this:
df = df.assign(diffr=df.apply(lambda x: '<'
if (abs(x['yours'] - x['actual']) >= 1)
else '', axis=1))
print(df)
'''
yours actual diffr
0 1 1
1 2 2
2 4 3 <
I have a dataframe where the indexes are not numbers but strings (specifically, name of countries) and they are all unique. Given the name of a country, how do I find its row number (the 'number' value of the index)?
I tried df[df.index == 'country_name'].index but this doesn't work.
We can use Index.get_indexer:
df.index.get_indexer(['Peru'])
[3]
Or we can build a RangeIndex based on the size of the DataFrame then subset that instead:
pd.RangeIndex(len(df))[df.index == 'Peru']
Int64Index([3], dtype='int64')
Since we're only looking for a single label and the indexes are "all unique" we can also use Index.get_loc:
df.index.get_loc('Peru')
3
Sample DataFrame:
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5]
}, index=['Bahamas', 'Cameroon', 'Ecuador', 'Peru', 'Japan'])
df:
A
Bahamas 1
Cameroon 2
Ecuador 3
Peru 4
Japan 5
pd.Index.get_indexer
We can use pd.Index.get_indexer to get integer index.
idx = df.index.get_indexer(list_of_target_labels)
# If you only have single label we can use tuple unpacking here.
[idx] = df.index.get_indexer([country_name])
NB: pd.Index.get_indexer takes a list and returns a list. Integers from 0 to n - 1 indicating that the index at these positions matches the corresponding target values. Missing values in the target are marked by -1.
np.where
You could also use np.where here.
idx = np.where(df.index == country_name)[0]
list.index
We could also use list.index after converting Pd.Index to list using pd.Index.tolist
idx = df.index.tolist().index(country_name)
Why you don make the index to be created with numbers instead of text? Because your df can be sorted in many ways beyond the alphabetical, and you can lose the rows count.
With numbered index this wouldn't be a problem.
I am new to Pandas (but not to data science and Python). This question is not anly about how to solve this specific problem but how to handle problems like this the panda-way.
Please feel free to improve the title of that question. Because I am not sure what are the correct terms here.
Here is my MWE
#!/usr/bin/env python3
import pandas as pd
data = {'A': [1, 2, 3, 3, 1, 4],
'B': ['One', 'Two', 'Three', 'Three', 'Eins', 'Four']}
df = pd.DataFrame(data)
print(df)
Resulting in
A B
0 1 One
1 2 Two
2 3 Three
3 3 Three
4 1 Eins
5 4 Four
My assumption is that when the value in A column is 1 that the value in B column is always One. And so on...
I want to proof that assumption.
Secondary I also assume that if my first assumption is incorrect that this is not an error but there are valid (human) reasons for that. e.g. see row index 4 where the A-value is related to Eins (and not One) in the B column.
Because of that I also need to see and explore the cases where my assumption is incorrect.
Update of the question:
This data is only an example. In real world I am not aware of the pairing of the two columns. Because of that solutions like this do not work in my case
df.loc[df['A'] == 1, 'B']
I do not know how many and which expressions are in column A.
I do not know how to do that with pandas. How would a panda professional would solve this?
My approach would be to use pure Python code with list(), set() and some iterations. ;)
You can filter your data frame this way:
df.loc[df['A'] == 1, 'B']
This gives you the values of B where A is 1. Next you can add an equals statement:
df.loc[df['A'] == 1, 'B'] == 'One'
Which results in a boolean series (True, False in this case). If you want to check if all are true, you add:
all(df.loc[df['A'] == 1, 'B'] == 'One')
And the answer is False because of the Eins.
EDIT
If you want to create a new column which says if your criterion is met (always the same value for B if A) then you can do this:
df['C'] = df['A'].map(df.groupby('A')['B'].nunique() < 2)
Which results in a bool column. It creates column C by mapping the values in A in a by the list in the brackets. In between the brackets it is a groupby function of the values in A and counting the unique values in B. If that is under 2 it is unique it yields True.
If solution should be testing if only one unique value per A and return all rows which failed use DataFrameGroupBy.nunique for count unique values in GroupBy.transform for repeat aggregate values per groups, so possible filter rows which are not 1, it means there are 2 or more unique values per A:
df1 = df[df.groupby('A').B.transform('nunique').ne(1)]
print (df1)
A B
0 1 One
4 1 Eins
if df1.empty:
print ('My assumption is good')
else:
print ('My assumption is wrong')
print (df1)
I have a database with multiple columns and rows. I want to locate within the database rows that meet certain criteria of a subset of the columns AND if it meets that criteria change the value of a different column in that same row.
I am prototyping with the following database
df = pd.DataFrame([[1, 2], [4, 5], [5, 5], [5, 9], [55, 55]], columns=['max_speed', 'shield'])
df['frcst_stus'] = 'current'
df
which gives the following result:
max_speed shield frcst_stus
0 1 2 current
1 4 5 current
2 5 5 current
3 5 9 current
4 55 55 current
I want to change index row 2 to read 5, 5, 'hello' without changing the rest of the dataframe.
I can do the examples in the Pandas.loc documentation at setting values. I can set a row, a column, and rows matching a callable condition. But the call is on a single column or series. I want two.
And I have found a number of stackoverflow answers that answer the question using loc on a single column to set a value in a second column. That's not my issue. I want to search two columns worth of data.
The following allows me to get the row I want:
result = df[(df['shield'] == 5) & (df['max_speed'] == 5) & (df['frcst_stus'] == 'current')]
And I know that just changing the equal signs (== 'current') to (= 'current') gives me an error.
And when I select on two columns I can set the columns (see below), but both columns get set. ('arghh') and when I try to test the value of 'max_speed' I get a false is not in index error.
df.loc[:, ['max_speed', 'frcst_stus']] = 'hello'
I also get an error trying to explain the boolean issues with Python. Frankly, I just don't understand the whole overloading yet.
If need to set different values to both columns by mask m:
m = (df['shield'] == 5) & (df['max_speed'] == 5) & (df['frcst_stus'] == 'current')
df.loc[m, ['max_speed', 'frcst_stus']] = [100, 'hello']
If need to set same values to both columns by mask m:
df.loc[m, ['max_speed', 'frcst_stus']] = 'hello'
If need to set only one column by mask m:
df.loc[m, 'frcst_stus'] = 'hello'
I have a dataframe with headers 'Category', 'Factor1', 'Factor2', 'Factor3', 'Factor4', 'UseFactorA', 'UseFactorB'.
The value of 'UseFactorA' and 'UseFactorB' are one of the strings ['Factor1', 'Factor2', 'Factor3', 'Factor4'], keyed based on the value in 'Category'.
I want to generate a column, 'Result', which equals dataframe[UseFactorA]/dataframe[UseFactorB]
Take the below dataframe as an example:
[Category] [Factor1] [Factor2] [Factor3] [Factor4] [useFactor1] [useFactor2]
A 1 2 5 8 'Factor1' 'Factor3'
B 2 7 4 2 'Factor3' 'Factor1'
The 'Result' series should be [2, .2]
However, I cannot figure out how to feed the value of useFactor1 and useFactor2 into an index to make this happen--if the columns to use were fixed, I would just give
df['Result'] = df['Factor1']/df['Factor2']
However, when I try to give
df['Results'] = df[df['useFactorA']]/df[df['useFactorB']]
I get the error
ValueError: Wrong number of items passed 3842, placement implies 1
Is there a method for doing what I am trying here?
Probably not the prettiest solution (because of the iterrows), but what comes to mind is to iterate through the sets of factors and set the 'Result' value at each index:
for i, factors in df[['UseFactorA', 'UseFactorB']].iterrows():
df.loc[i, 'Result'] = df[factors['UseFactorA']] / df[factors['UseFactorB']]
Edit:
Another option:
def factor_calc_for_row(row):
factorA = row['UseFactorA']
factorB = row['UseFactorB']
return row[factorA] / row[factorB]
df['Result'] = df.apply(factor_calc_for_row, axis=1)
Here's the one liner:
df['Results'] = [df[df['UseFactorA'][x]][x]/df[df['UseFactorB'][x]][x] for x in range(len(df))]
How it works is:
df[df['UseFactorA']]
Returns a data frame,
df[df['UseFactorA'][x]]
Returns a Series
df[df['UseFactorA'][x]][x]
Pulls a single value from the series.