How to make a column of categorised group in pandas

How to make a column of categorised group in pandas - python

Given a column of “food” (apple, banana, carrot, donuts, egg,...), I want to make the “category” column that contains values which correspond to each item in “food” column.
Ex. given the information below
import pandas as pd
fruit =['apple', 'banana', 'orange']
veg =['carrot', 'onion']
meat=['chicken', 'pork', 'beef']
food = fruit + veg + meat
df = pd.DataFrame(food, columns=['food'])
df
When I write the code like this:
df[df['food']=='apple']['category']='fruit'
df[df['food']=='carrot']['category']='vegetable'
However, a SettingWithCopyWarning occurs when I write down in this way.
What would be the best way to set this value?

You probably got a SettingWithCopy warning from pandas. You can resolve that in a few different ways:
# Use loc
df['category'] = None # Initialize an empty column
df.loc[df['food']=='apple', 'category'] = 'fruit'
df.loc[df['food']=='carrot', 'category'] = 'vegetable'
# Use map
df['category'] = df['food'].map({
'apple': 'fruit',
'carrot': 'vegetable'
})

Related

Convert panda dataframe group of values to multiple lists

I have pandas dataframe, where I listed items, and categorised them:
col_name |col_group
-------------------------
id | Metadata
listing_url | Metadata
scrape_id | Metadata
name | Text
summary | Text
space | Text
To reproduce:
import pandas
df = pandas.DataFrame([
['id','metadata'],
['listing_url','metadata'],
['scrape_id','metadata'],
['name','Text'],
['summary','Text'],
['space','Text']],
columns=['col_name', 'col_group'])
Can you suggest how I can convert this dataframe to multiple lists based on "col_group":
Metadata = ['id','listing_url','scraping_id]
Text = ['name','summary','space']
This is to allow me to pass these lists of columns to panda and drop columns.
I googled a lot and got stuck: all answers are about converting lists to df, not vice versa. Should I aim to convert into dictionary, or list of lists?
I have over 100 rows, belonging to 10 categories, so would like to avoid manual hard-coding.

I've try this code:
import pandas
df = pandas.DataFrame([
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a'],
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b'],
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']],
columns=['id', 'listing_url', 'scrape_id', 'name', 'summary', 'space'])
print(df)
for row in df.iterrows():
print(row[1].to_list())
which give this answer:
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a']
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b']
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']
You can use
for row in df[['name', 'summary', 'space']].iterrows():
to only iter over specific columns.

Like this:
In [245]: res = df.groupby('col_group', as_index=False)['Col_name'].apply(list)
In [248]: res.tolist()
Out[248]: [['id', 'listing_url', 'scrape_id'], ['name', 'summary', 'space']]

my_vars = df.groupby('col_group').agg(list)['col_name'].to_dict()
Output:
>>> my_vars
{'Text': ['name', 'summary', 'space'], 'metadata': ['id', 'listing_url', 'scrape_id']}
The recommended usage would be just my_vars['Text'] to access the Text, and etc. If you must have this as distinct names you can force it upon your target scope, e.g. globals:
globals().update(df.groupby('col_group').agg(list)['col_name'].to_dict())
Result:
>>> Text
['name', 'summary', 'space']
>>> metadata
['id', 'listing_url', 'scrape_id']
However I would advise against that as you might unwittingly overwrite some of your other objects, or they might not be in the proper scope you needed (e.g. locals).

Pandas: Define categorical dtype with categories in DataFrame constructor

I want to initialize the dtypes of a DataFrame's columns to categorical types and specify each column's categories on its creation.
This way seems less efficient because I loop over animals twice:
col_name = pd.Categorical([a.name for a in animals], categories=['bird','cat','dog'])
col_food = pd.Categorical([a.food for a in animals], categories=['meat','veggies'])
df = pd.DataFrame({'Animal': col_name, 'Food': col_food})
This way seems more efficient because I loop over animals just once but how can I specify the categorical columns' categories?:
df = pd.DataFrame([{'Animal': a.name, 'Food': a.food} for a in animals],
dtype={'Animal': ???, 'Food': ???})
I also want to avoid creating the DataFrame first, then converting the columns' types to categorical.
Something like:
dtype={'Food': dtype('category', categories=['meat','veggies]), ...}

since you don't put your animal class i use a easy one that have name and food attributes.
import pandas as pd
class Animal():
def __init__(self, name, food):
self.name = name
self.food = food
cat = Animal('cat','meat')
bird = Animal('bird', 'veggies')
dog = Animal('dog','meat')
animals = [cat, dog, bird, bird, dog, cat, cat, cat, dog, dog]
df = pd.DataFrame([{'Animal': a.name, 'Food': a.food} for a in animals], dtype=(pd.Categorical))
print(df.Animal.cat.categories)
print(df.Food.cat.categories)
And the output is:
Index(['bird', 'cat', 'dog'], dtype='object')
Index(['meat', 'veggies'], dtype='object')
I hope this is what are you looking for.

Pandas DataFrame from Dictionary with Lists

I have an API that returns a single row of data as a Python dictionary. Most of the keys have a single value, but some of the keys have values that are lists (or even lists-of-lists or lists-of-dictionaries).
When I throw the dictionary into pd.DataFrame to try to convert it to a pandas DataFrame, it throws a "Arrays must be the same length" error. This is because it cannot process the keys which have multiple values (i.e. the keys which have values of lists).
How do I get pandas to treat the lists as 'single values'?
As a hypothetical example:
data = { 'building': 'White House', 'DC?': True,
'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] }
I want to turn it into a DataFrame like this:
ix building DC? occupants
0 'White House' True ['Barack', 'Michelle', 'Sasha', 'Malia']

This works if you pass a list (of rows):
In [11]: pd.DataFrame(data)
Out[11]:
DC? building occupants
0 True White House Barack
1 True White House Michelle
2 True White House Sasha
3 True White House Malia
In [12]: pd.DataFrame([data])
Out[12]:
DC? building occupants
0 True White House [Barack, Michelle, Sasha, Malia]

This turns out to be very trivial in the end
data = { 'building': 'White House', 'DC?': True, 'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] }
df = pandas.DataFrame([data])
print df
Which results in:
DC? building occupants
0 True White House [Barack, Michelle, Sasha, Malia]

Solution to make dataframe from dictionary of lists where keys become a sorted index and column names are provided. Good for creating dataframes from scraped html tables.
d = { 'B':[10,11], 'A':[20,21] }
df = pd.DataFrame(d.values(),columns=['C1','C2'],index=d.keys()).sort_index()
df
C1 C2
A 20 21
B 10 11

Would it be acceptable if instead of having one entry with a list of occupants, you had individual entries for each occupant? If so you could just do
n = len(data['occupants'])
for key, val in data.items():
if key != 'occupants':
data[key] = n*[val]
EDIT: Actually, I'm getting this behavior in pandas (i.e. just with pd.DataFrame(data)) even without this pre-processing. What version are you using?

I had a closely related problem, but my data structure was a multi-level dictionary with lists in the second level dictionary:
result = {'hamster': {'confidence': 1, 'ids': ['id1', 'id2']},
'zombie': {'confidence': 1, 'ids': ['id3']}}
When importing this with pd.DataFrame([result]), I end up with columns named hamster and zombie. The (for me) correct import would be to have these as row titles, and confidence and ids as column titles. To achieve this, I used pd.DataFrame.from_dict:
In [42]: pd.DataFrame.from_dict(result, orient="index")
Out[42]:
confidence ids
hamster 1 [id1, id2]
zombie 1 [id3]
This works for me with python 3.8 + pandas 1.2.3.

if you know the keys of the dictionary beforehand, why not first create an empty data frame and then keep adding rows?

Create list with one key of list of dictionaries

this should be an easy one, but because I am not so familiar with python, I haven't quite figured out how it works.
I have the following csv file
name ; type
apple ; fruit
pear ; fruit
cucumber ; vegetable
cherry ; fruit
green beans ; vegetable
What I want to achieve is to list all distinct types with its corresponding name such as:
fruit: apple, pear, cherry
vegetable: cucumber, green beans
Reading it in with csv.DictReader I can generate a list of dictionaries of that csv File, saved in the variable alldata.
alldata =
[
{'name':'apple', 'type':'fruit'},
{'name':'pear', 'type':'fruit'},
...
]
Now I need a list of all distinct type values from alldata
types = ??? #it should contain [fruit, vegetable]
such that I can iterate over the list and extract my names corresponding to these types:
foreach type in types
list_of_names = ??? #extract all values of alldata["type"]==type and put them in a new list
print type + ': ' + list_of_names
Does anybody know, how to achieve this?

You can use list comprehension to solve this problem :
types = set([data['type'] for data in alldata])
list_of_name = [data['name'] for data in alldata if data['type']==type]

More general approach is to use itertools.groupby:
from itertools import groupby
food = [
{'name': 'apple', 'type': 'fruit'},
{'name': 'pear', 'type': 'fruit'},
{'name': 'parrot', 'type': 'vegetable'}]
for group, items in groupby(sorted(food, key=lambda x: x['type']), lambda x: x['type']):
print group, list(items) # here is group and items' objects in the group
result is:
fruit [{'type': 'fruit', 'name': 'apple'}, {'type': 'fruit', 'name': 'pear'}]
vegetable [{'type': 'vegetable', 'name': 'parrot'}]
UPD: sort dict before groupby. Thanks #mgilson for point!
Make an iterator that returns consecutive keys and groups from the iterable. The key is a function computing a key value for each element. If not specified or is None, key defaults to an identity function and returns the element unchanged. Generally, the iterable needs to already be sorted on the same key function.
https://docs.python.org/2/library/itertools.html#itertools.groupby

Use the set structure:
types = set((d['type'] for d in alldata))

index a Python Pandas dataframe with multiple conditions SQL like where statement

I am experienced in R and new to Python Pandas. I am trying to index a DataFrame to retrieve rows that meet a set of several logical conditions - much like the "where" statement of SQL.
I know how to do this in R with dataframes (and with R's data.table package, which is more like a Pandas DataFrame than R's native dataframe).
Here's some sample code that constructs a DataFrame and a description of how I would like to index it. Is there an easy way to do this?
import pandas as pd
import numpy as np
# generate some data
mult = 10000
fruits = ['Apple', 'Banana', 'Kiwi', 'Grape', 'Orange', 'Strawberry']*mult
vegetables = ['Asparagus', 'Broccoli', 'Carrot', 'Lettuce', 'Rutabaga', 'Spinach']*mult
animals = ['Dog', 'Cat', 'Bird', 'Fish', 'Lion', 'Mouse']*mult
xValues = np.random.normal(loc=80, scale=2, size=6*mult)
yValues = np.random.normal(loc=79, scale=2, size=6*mult)
data = {'Fruit': fruits,
'Vegetable': vegetables,
'Animal': animals,
'xValue': xValues,
'yValue': yValues,}
df = pd.DataFrame(data)
# shuffle the columns to break structure of repeating fruits, vegetables, animals
np.random.shuffle(df.Fruit)
np.random.shuffle(df.Vegetable)
np.random.shuffle(df.Animal)
df.head(30)
# filter sets
fruitsInclude = ['Apple', 'Banana', 'Grape']
vegetablesExclude = ['Asparagus', 'Broccoli']
# subset1: All rows and columns where:
# (fruit in fruitsInclude) AND (Vegetable not in vegetablesExlude)
# subset2: All rows and columns where:
# (fruit in fruitsInclude) AND [(Vegetable not in vegetablesExlude) OR (Animal == 'Dog')]
# subset3: All rows and specific columns where above logical conditions are true.
All help and inputs welcomed and highly appreciated!
Thanks,
Randall

# subset1: All rows and columns where:
# (fruit in fruitsInclude) AND (Vegetable not in vegetablesExlude)
df.ix[df['Fruit'].isin(fruitsInclude) & ~df['Vegetable'].isin(vegetablesExclude)]
# subset2: All rows and columns where:
# (fruit in fruitsInclude) AND [(Vegetable not in vegetablesExlude) OR (Animal == 'Dog')]
df.ix[df['Fruit'].isin(fruitsInclude) & (~df['Vegetable'].isin(vegetablesExclude) | (df['Animal']=='Dog'))]
# subset3: All rows and specific columns where above logical conditions are true.
df.ix[df['Fruit'].isin(fruitsInclude) & ~df['Vegetable'].isin(vegetablesExclude) & (df['Animal']=='Dog')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make a column of categorised group in pandas - python

Related

Convert panda dataframe group of values to multiple lists

Pandas: Define categorical dtype with categories in DataFrame constructor

Pandas DataFrame from Dictionary with Lists

Create list with one key of list of dictionaries

index a Python Pandas dataframe with multiple conditions SQL like where statement

Categories

Resources