I have the following code:
export_file_name = 'output.csv'
export_df = pd.read_csv(export_file_name)
companies = export_df[export_df['title'] > ''].company_name.to_list()
I was wondering what the > operator does in this case?
export_df is a data frame, and export_df['title'] returns a series of titles from that file. In Pandas, many operators are overloaded for series types, so, for example when dealing with series:
export_df['title'] > ''
is equivalent to:
export_df['title'].gt('')
That returns a series of boolean values in the same order: each non-empty title will have True on the corresponding position, and each empty will have False.
Consequently, when you provide that sequence of boolean values as an index to the original data frame, it will return a new data frame that includes only the rows with True on the corresponding positions, i.e. those with non-empty titles.
This is an idiomatic way to filter data frame rows in Pandas.
Related
I have created a function for which the input is a pandas dataframe.
It should return the row-indices of the rows with a missing value.
It works for all the defined Missingness values except when the cell is entirely empty - even though I tried to specify this in the missing_values List as [...,""] .
What could be the issue here? Or is there even a more intuitive way to solve this in general?
def missing_values(x):
df=x
missing_values = ["NaN","NAN","NA","Na","n/a", "na", "--","-"," ","","None","0","-inf"] #common ways to indicate missingness
observations = df.shape[0] # Gives number of observations (rows)
variables = df.shape[1] # Gives number of variables (columns)
row_index_list = []
#this goes through each observation in the first row
for n in range(0,variables): #this iterates over all variables
column_list = [] #creates a list for each value per variable
for i in range(0,observations): #now this iterates over every observation per variable
column_list.append(df.iloc[i,n]) #and adds the values to the list
for i in range(0,len(column_list)): #now for every value
if column_list[i] in missing_values: #it is checked, whether the value is a Missing one
row_index_list.append(column_list.index(column_list[i])) #and if yes, the row index is appended
finished = list(set(row_index_list)) #set is used to make sure the index only appears once if there are multiple occurences in one row and then it is listed
return finished
There might be spurious whitespace, so try adding strip() on this line:
if column_list[i].strip() in missing_values: #it is checked, whether the value is a Missing one
Also a simpler way to get the indexes of rows containing missing_values is with isin() and any(axis=1):
x = x.replace('\s+', '', regex=True)
row_index_list = x[x.isin(missing_values).any(axis=1)].index
When you import a file to Pandas using for example read_csv or read_excel, the missing variable (literally missing) is then can only be specify using np.nan or other type of null value with the numpy library.
(Sorry my bad right here, I was really silly when doing np.nan == np.nan)
You can replace the np.nan value first with:
df = df.replace(np.nan, 'NaN')
then your function can catch it.
Another way is to use isna() in pandas,
df.isna()
This will return the same DataFrame but with cell contains boolean value, True for each cell that is np.nan
If you do df.isna().any(),
This will return a Series with True value for any columns that contains null value.
If you want to retrieved the ID, simply adding the parameter axis = 1 to any():
df.isna().any(axis = 1)
This will return a Series show all the rows with np.nan value.
Now you have the boolean values that indicate which row contains null values. If you add these boolean value to a list and apply that on the DF.index this will took out the index value of the rows containing null.
booleanlist = df.isna().any(axis =1).tolist()
null_row_id = df.index[booleanlist]
I need to drop all the rows if the value of a string in a column begins with # or Account.
I used list comprehension and .index method but the list is not a index number.
I need index numbers of all rows to be dropped.
x = [
i.index for i in df['Cleared/Open Items Symbol']
if isinstance(i, str) if i.startswith('#') or i.startswith('Account')
]
print(x)
Use str.contains. This is part of pandas. By default it checks regular expressions. It will return True or False, i.e. type bool for all rows. As you want to drop those you negate it with the tilde operator and take the according elements from your dataframe. This can be written in a simple one-liner.
dfnew = df[~df['columnName'].str.contains('^#|^Account')]
This should hopefully be a straightforward question but I'm new to Pandas.
I've got a DataFrame called RawData, and a list of permissible indexes called AllowedIndexes.
What I want to do is split the DataFrame into two new ones:
one DataFrame only with the indexes that appear on the AllowedIndexes list.
one DataFrame only with the indexes that don't appear on the
AllowedIndexes list, for data cleaning purposes.
I've provided a simplified version of the actual data I'm using which in reality contains several series.
[image]
import pandas as pd
RawData = pd.DataFrame({'Quality':['#000000', '#FF0000', '#FFFFFF', '#PURRRR','#123Z']}, index = ['Black','Red','White', 'Cat','Blcak'])
AllowedIndexes = ['Black','White','Yellow','Red']
Thanks!
.index takes the index for each row of the RawData dataframe.
.isin() checks if the element exists in the AllowedIndexes list.
allowed = RawData[(RawData.index.isin(AllowedIndexes))==True]
not_allowed = RawData[(RawData.index.isin(AllowedIndexes))==False]
Another way without checking if True, or False:
allowed = RawData[RawData.index.isin(AllowedIndexes)]
not_allowed = RawData[~(RawData.index.isin(AllowedIndexes))]
~ is not in pandas.
I'm attempting to use Pandas with a JSON object to flatten it, clean up the data, and write it to a relational database.
For any List objects, I want to convert them to a string and create a new column that has a count of the values in the original column.
I've gotten as far as getting a Series of columns which did contain a list. Now I want to filter that list to get back the list of columns that were true. I feel like there should be a straight forward way to filter this Series to only true items but things like filter only seem to work on the index.
Current:
d False
a.b True
Desired:
a.b True
My original code:
import pandas as _pd
data = {"a":{"b":['x','y','z']},"c":1,"d":None}
df = _pd.json_normalize(data).convert_dtypes()
ldf = df.select_dtypes(include=['object']).applymap(lambda x: isinstance(x,list)).max()
Any suggestions on how to easily filter this down to just the true values?
I have some data in a CSV that I want to run some analysis on, to check the quality of data. I have been using Pandas due to how easy it is to load data in from a CSV.
I was wondering what would be the most effective method for comparing all values in a series to see if it exists within another list of values? I want to do this to check for any errors in a CSV. Then later I will use these values to try clean the data. The data could potentially be very large.
For example.
I have a CSV that contains data on the suburbs that people have listed where they live. Many of these have been entered manually and could be prone to to typos, incorrect spelling, etc.
To check this I have a list which contains valid suburbs names. I will iterate through each value in a series and compare it to each value in the list of valid suburbs. Then return all unique values which are not valid.
Read in values from csv
df = read_csv(“user_address”)
Extract Series I want to work with (suburb), and get all unique strings from series to lower the amount of comparison methods I have to do
series = df['Suburb'].unique()
Iterate through each unique strings, to see if it matches any of the valid suburb name stored in a list
L = ......list of suburbs
for value in series:
if value not in L:
print value #Will use value for something more in reality
Return the strings which do not match any of the valid suburb names
The isin() method does this for you and is part of pandas. It's function is to compare a column to an array of values and returns True if a value in the pandas data frame is in the array and False if not.
values_not_in_array = df[~df.Suburb.isin(L)].Suburb
values_in_array = df[df.Suburb.isin(L)].Suburb