Filter Pandas Series by Boolean Value - python

I'm attempting to use Pandas with a JSON object to flatten it, clean up the data, and write it to a relational database.
For any List objects, I want to convert them to a string and create a new column that has a count of the values in the original column.
I've gotten as far as getting a Series of columns which did contain a list. Now I want to filter that list to get back the list of columns that were true. I feel like there should be a straight forward way to filter this Series to only true items but things like filter only seem to work on the index.
Current:
d False
a.b True
Desired:
a.b True
My original code:
import pandas as _pd
data = {"a":{"b":['x','y','z']},"c":1,"d":None}
df = _pd.json_normalize(data).convert_dtypes()
ldf = df.select_dtypes(include=['object']).applymap(lambda x: isinstance(x,list)).max()
Any suggestions on how to easily filter this down to just the true values?

Related

How can I assign a lists elements to corresponding rows of a dataframe in pandas?

I have numbers in a List that should get assigned to certain rows of a dataframe consecutively.
List=[2,5,7,12….]
In my dataframe that looks similar to the below table, I need to do the following:
A frame_index that starts with 1 gets the next element of List as “sequence_number”
Frame_Index==1 then assign first element of List as Sequence_number.
Frame_index == 1 again, so assign second element of List as Sequence_number.
So my goal is to achieve a new dataframe like this:
I don't know which functions to use. If this weren't python language, I would use a for loop and check where frame_index==1, but my dataset is large and I need a pythonic way to achieve the described solution. I appreciate any help.
EDIT: I tried the following to fill with my List values to use fillna with ffill afterwards:
concatenated_df['Sequence_number']=[List[i] for i in
concatenated_df.index if (concatenated_df['Frame_Index'] == 1).any()]
But of course I'm getting "list index out of range" error.
I think you could do that in two steps.
Add column and fill with your list where frame_index == 1.
Use df.fillna() with method="ffill" kwarg.
import pandas as pd
df = pd.DataFrame({"frame_index": [1,2,3,4,1,2]})
sequence = [2,5]
df.loc[df["frame_index"] == 1, "sequence_number"] = sequence
df.ffill(inplace=True) # alias for df.fillna(method="ffill")
This puts the sequence_number as float64, which might be acceptable in your use case, if you want it to be int64, then you can just force it when creating the column (line 4) or cast it later.

What does the `>` operator do in python?

I have the following code:
export_file_name = 'output.csv'
export_df = pd.read_csv(export_file_name)
companies = export_df[export_df['title'] > ''].company_name.to_list()
I was wondering what the > operator does in this case?
export_df is a data frame, and export_df['title'] returns a series of titles from that file. In Pandas, many operators are overloaded for series types, so, for example when dealing with series:
export_df['title'] > ''
is equivalent to:
export_df['title'].gt('')
That returns a series of boolean values in the same order: each non-empty title will have True on the corresponding position, and each empty will have False.
Consequently, when you provide that sequence of boolean values as an index to the original data frame, it will return a new data frame that includes only the rows with True on the corresponding positions, i.e. those with non-empty titles.
This is an idiomatic way to filter data frame rows in Pandas.

Splitting a Pandas DataFrame based on whether the index name appears in a list

This should hopefully be a straightforward question but I'm new to Pandas.
I've got a DataFrame called RawData, and a list of permissible indexes called AllowedIndexes.
What I want to do is split the DataFrame into two new ones:
one DataFrame only with the indexes that appear on the AllowedIndexes list.
one DataFrame only with the indexes that don't appear on the
AllowedIndexes list, for data cleaning purposes.
I've provided a simplified version of the actual data I'm using which in reality contains several series.
[image]
import pandas as pd
RawData = pd.DataFrame({'Quality':['#000000', '#FF0000', '#FFFFFF', '#PURRRR','#123Z']}, index = ['Black','Red','White', 'Cat','Blcak'])
AllowedIndexes = ['Black','White','Yellow','Red']
Thanks!
.index takes the index for each row of the RawData dataframe.
.isin() checks if the element exists in the AllowedIndexes list.
allowed = RawData[(RawData.index.isin(AllowedIndexes))==True]
not_allowed = RawData[(RawData.index.isin(AllowedIndexes))==False]
Another way without checking if True, or False:
allowed = RawData[RawData.index.isin(AllowedIndexes)]
not_allowed = RawData[~(RawData.index.isin(AllowedIndexes))]
~ is not in pandas.

Select a subset of an object type cell in panda Dataframe

I try to select a subset of the object type column cells with str.split(pat="'")
dataset['pictures'].str.split(pat=",")
I want to get the values of the numbers 40092 and 39097 and the two dates of the pictures as two columns ID and DATE but as result I get one column consisting of NaNs.
'pictures' column:
{"col1":"40092","picture_date":"2017-11-06"}
{"col1":"39097","picture_date":"2017-10-31"}
...
Here's what I understood from your question:
You have a pandas Dataframe with one of the columns containing json strings (or any other string that need to be parsed into multiple columns)
E.g.
df = pd.DataFrame({'pictures': [
'{"col1":"40092","picture_date":"2017-11-06"}',
'{"col1":"39097","picture_date":"2017-10-31"}']
})
You want to parse the two elements ('col1' and 'picture_date') into two separate columns for further processing (or perhaps just one of them)
Define a function for parsing the row:
import json
def parse_row(r):
j=json.loads(r['pictures'])
return j['col1'],j['picture_date']
And use Pandas DataFrame.apply() method as follows
df1=df.apply(parse_row, axis=1,result_type='expand')
The result is a new dataframe with two columns - each containing the parsed data:
0 1
0 40092 2017-11-06
1 39097 2017-10-31
If you need just one column you can return a single element from parse_row (instead of a two element tuple in the above example) and just use df.apply(parse_row).
If the values are not in json format, just modify parse_row accordingly (Split, convert string to numbers, etc.)
Thanks for the replies but I solved it by loading the 'pictures' column from the dataset into a list:
picturelist= dataset['pictures'].values.tolist()
And afterwards creating a dataframe of the list made from the column pictures and concat it with the original dataset without the picture column
two_new_columns = pd.Dataframe(picturelist)
new_dataset = pd.concat(dataset, two_new_columns)

Filter pandas dataframe based on defined list of strings which is present in one column

I need to filter cars pandas dataframe based on list of strings that could appear (among other strings) in one of the columns.
So I have list of countries like this:
filterLocation = ['Germany','Austria','Slovenia']
I want to filter out all rows that contains any of these words in Location column of pandas dataframe.
I have this:
carsresult = cars.loc[~cars['adCarLocation'].isin(filterLocation)]
but this doesn't work for some reason..
figure out the answer:
for country in filterLocation:
carsML = carsML[~carsML['adCarLocation'].str.contains(country, na=False)]
na=False is neccessary if you have empty empty values othervise you'll get:
TypeError: bad operand type for unary ~: float

Categories