index a Python Pandas dataframe with multiple conditions SQL like where statement - python

I am experienced in R and new to Python Pandas. I am trying to index a DataFrame to retrieve rows that meet a set of several logical conditions - much like the "where" statement of SQL.
I know how to do this in R with dataframes (and with R's data.table package, which is more like a Pandas DataFrame than R's native dataframe).
Here's some sample code that constructs a DataFrame and a description of how I would like to index it. Is there an easy way to do this?
import pandas as pd
import numpy as np
# generate some data
mult = 10000
fruits = ['Apple', 'Banana', 'Kiwi', 'Grape', 'Orange', 'Strawberry']*mult
vegetables = ['Asparagus', 'Broccoli', 'Carrot', 'Lettuce', 'Rutabaga', 'Spinach']*mult
animals = ['Dog', 'Cat', 'Bird', 'Fish', 'Lion', 'Mouse']*mult
xValues = np.random.normal(loc=80, scale=2, size=6*mult)
yValues = np.random.normal(loc=79, scale=2, size=6*mult)
data = {'Fruit': fruits,
'Vegetable': vegetables,
'Animal': animals,
'xValue': xValues,
'yValue': yValues,}
df = pd.DataFrame(data)
# shuffle the columns to break structure of repeating fruits, vegetables, animals
np.random.shuffle(df.Fruit)
np.random.shuffle(df.Vegetable)
np.random.shuffle(df.Animal)
df.head(30)
# filter sets
fruitsInclude = ['Apple', 'Banana', 'Grape']
vegetablesExclude = ['Asparagus', 'Broccoli']
# subset1: All rows and columns where:
# (fruit in fruitsInclude) AND (Vegetable not in vegetablesExlude)
# subset2: All rows and columns where:
# (fruit in fruitsInclude) AND [(Vegetable not in vegetablesExlude) OR (Animal == 'Dog')]
# subset3: All rows and specific columns where above logical conditions are true.
All help and inputs welcomed and highly appreciated!
Thanks,
Randall

# subset1: All rows and columns where:
# (fruit in fruitsInclude) AND (Vegetable not in vegetablesExlude)
df.ix[df['Fruit'].isin(fruitsInclude) & ~df['Vegetable'].isin(vegetablesExclude)]
# subset2: All rows and columns where:
# (fruit in fruitsInclude) AND [(Vegetable not in vegetablesExlude) OR (Animal == 'Dog')]
df.ix[df['Fruit'].isin(fruitsInclude) & (~df['Vegetable'].isin(vegetablesExclude) | (df['Animal']=='Dog'))]
# subset3: All rows and specific columns where above logical conditions are true.
df.ix[df['Fruit'].isin(fruitsInclude) & ~df['Vegetable'].isin(vegetablesExclude) & (df['Animal']=='Dog')]

Related

How to make a column of categorised group in pandas

Given a column of “food” (apple, banana, carrot, donuts, egg,...), I want to make the “category” column that contains values which correspond to each item in “food” column.
Ex. given the information below
import pandas as pd
fruit =['apple', 'banana', 'orange']
veg =['carrot', 'onion']
meat=['chicken', 'pork', 'beef']
food = fruit + veg + meat
df = pd.DataFrame(food, columns=['food'])
df
When I write the code like this:
df[df['food']=='apple']['category']='fruit'
df[df['food']=='carrot']['category']='vegetable'
However, a SettingWithCopyWarning occurs when I write down in this way.
What would be the best way to set this value?
You probably got a SettingWithCopy warning from pandas. You can resolve that in a few different ways:
# Use loc
df['category'] = None # Initialize an empty column
df.loc[df['food']=='apple', 'category'] = 'fruit'
df.loc[df['food']=='carrot', 'category'] = 'vegetable'
# Use map
df['category'] = df['food'].map({
'apple': 'fruit',
'carrot': 'vegetable'
})

Updating/updating a data table using python

I would like some advice on how to update/insert new data into an already existing data table using Python/Databricks:
# Inserting and updating already existing data
# Original data
import pandas as pd
source_data = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df1 = pd.DataFrame (source_data, columns = ['Customer Number','Colour', 'Flow'])
print(df1)
# New data
new_data = {'Customer Number': ['1', '4',],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df2 = pd.DataFrame (new_data, columns = ['Customer Number','Colour', 'Flow'])
print(df2)
# What the updated table will look like
updated_data = {'Customer Number': ['1', '2', '3', '4',],
'Colour': ['Blue', 'Blue', 'Green', 'Blue',],
'Flow': ['Bad', 'Bad', "Good", 'Bad']
}
df3 = pd.DataFrame (updated_data, columns = ['Customer Number','Colour', 'Flow'])
print(df3)
What you can see here is that the original data has three customers. I then get 'new_data' which contains an update of customer 1's data and new data for 'customer 4', who was not already in the original data. Then if you look at 'updated_data' you can see what the final data should look like. Here 'Customer 1's data has been updated and customer 4s data has been inserted.
Does anyone know where I should start with this? Which module I could use?
I’m not expecting someone to solve this in terms of developing, just need a nudge in the right direction.
Edit: the data source is .txt or CSV, the output is JSON, but as I load the data to Cosmos DB it’ll automatically convert so don’t worry too much about that.
Thanks
Current data frame structure and 'pd.update'
With some preparation, you can use the pandas 'update' function.
First, the data frames must be indexed (this is often useful anyway).
Second, the source data frame must be extended by the new indices with dummy/NaN data so that it can be updated.
# set indices of original data frames
col = 'Customer Number'
df1.set_index(col, inplace=True)
df2.set_index(col, inplace=True)
df3.set_index(col, inplace=True)
# extend source data frame by new customer indices
df4 = df1.copy().reindex(index=df1.index.union(df2.index))
# update data
df4.update(df2)
# verify that new approach yields correct results
assert all(df3 == df4)
Current data frame structure and 'pd.concat'
A slightly easier approach joins the data frames and removes duplicate
rows (and sorts by index if wanted). However, the temporary concatenation requires
more memory which may limit the size of the data frames.
df5 = pd.concat([df1, df2])
df5 = df5.loc[~df5.index.duplicated(keep='last')].sort_index()
assert all(df3 == df5)
Alternative data structure
Given that 'Customer Number' is the crucial attribute of your data,
you may also consider restructuring your original dictionaries like that:
{'1': ['Red', 'Good'], '2': ['Blue', 'Bad'], '3': ['Green', 'Good']}
Then updating your data simply corresponds to (re)setting the key of the source data with the new data. Typically, working directly on dictionaries is faster than using data frames.
# define function to restructure data, for demonstration purposes only
def restructure(data):
# transpose original data
# https://stackoverflow.com/a/6473724/5350621
vals = data.values()
rows = list(map(list, zip(*vals)))
# create new restructured dictionary with customers as keys
restructured = dict()
for row in rows:
restructured[row[0]] = row[1:]
return restructured
# restructure data
source_restructured = restructure(source_data)
new_restructured = restructure(new_data)
# simply (re)set new keys
final_restructured = source_restructured.copy()
for key, val in new_restructured.items():
final_restructured[key] = val
# convert to data frame and check results
df6 = pd.DataFrame(final_restructured, index=['Colour', 'Flow']).T
assert all(df3 == df6)
PS: When setting 'df1 = pd.DataFrame(source_data, columns=[...])' you do not need the 'columns' argument because your dictionaries are nicely named and the keys are automatically taken as column names.
You can use set intersection to find the Customer Numbers to update and set difference to find new Customer Number to add.
Then you can first update the initial data frame rows iterating through the intersection of Costumer Number and then merge the initial data frame only with the new rows of the data frame with the new values.
# same name column for clarity
cn = 'Customer Number'
# convert Consumer Number values into integer to use set
CusNum_df1 = [int(x) for x in df1[cn].values]
CusNum_df2 = [int(x) for x in df2[cn].values]
# find Customer Numbers to update and to add
CusNum_to_update = list(set(CusNum_df1).intersection(set(CusNum_df2)))
CusNum_to_add = list(set(CusNum_df2) - set(CusNum_df1))
# update rows in initial data frame
for num in CusNum_to_update:
index_initial = df1.loc[df1[cn]==str(num)].index[0]
index_new = df2.loc[df2[cn]==str(num)].index[0]
for col in df1.columns:
df1.at[index_initial,col]= df2.loc[index_new,col]
# concatenate new rows to initial data frame
for num in CusNum_to_add:
df1 = pd.concat([df1, df2.loc[df2[cn]==str(num)]]).reset_index(drop=True)
out:
Customer Number Colour Flow
0 1 Blue Bad
1 2 Blue Bad
2 3 Green Good
3 4 Blue Bad
There are many ways, but in terms of readability, I would prefer to do this.
import pandas as pd
dict_source = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df_origin = pd.DataFrame.from_dict(dict_source)
dict_new = {'Customer Number': ['1', '4', ],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df_new = pd.DataFrame.from_dict(dict_new)
df_result = df_origin.copy()
df_result.set_index(['Customer Number', ], inplace=True)
df_new.set_index(['Customer Number', ], inplace=True)
df_result.update(df_new) # update number 1
# handle number 4
df_result.reset_index(['Customer Number', ], inplace=True)
df_new.reset_index(['Customer Number', ], inplace=True)
df_result = df_result.merge(df_new, on=list(df_result), how='outer')
print(df_result)
Customer Number Colour Flow
0 1 Blue Bad
1 2 Blue Bad
2 3 Green Good
3 4 Blue Bad
You can use 'Customer Number' as index and use update method:
import pandas as pd
source_data = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df1 = pd.DataFrame (source_data, index=source_data['Customer Number'], columns=['Colour', 'Flow'])
print(df1)
# New data
new_data = {'Customer Number': ['1', '4',],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df2 = pd.DataFrame (new_data, index=new_data['Customer Number'], columns=['Colour', 'Flow'])
print(df2)
df3 = df1.reindex(index=df1.index.union(df2.index))
df3.update(df2)
print(df3)
Colour Flow
1 Blue Bad
2 Blue Bad
3 Green Good
4 Blue Bad

How do I set the order of a grouped bar chart with Chartify?

How can users change the order of the grouped bars in the example below?
ch = chartify.Chart(blank_labels=True, x_axis_type='categorical')
ch.plot.bar(
data_frame=quantity_by_fruit_and_country,
categorical_columns=['fruit', 'country'],
numeric_column='quantity')
ch.show('png')
The bar plot method has a categorical_order_by parameter that can be used to change the order. As specified in the documentation, set it equal to values or labels to sort by those corresponding dimensions.
For a custom sort, you can pass a list of values to the categorical_order_by parameter. Since the bar is grouped by two dimensions, the list should contain tuples as in the example below:
from itertools import product
outside_groups = ['Apple', 'Orange', 'Banana', 'Grape']
inner_groups = ['US', 'JP', 'BR', 'CA', 'GB']
sort_order = list(product(outside_groups, inner_groups))
# Plot the data
ch = chartify.Chart(blank_labels=True, x_axis_type='categorical')
ch.plot.bar(
data_frame=quantity_by_fruit_and_country,
categorical_columns=['fruit', 'country'],
numeric_column='quantity',
categorical_order_by=sort_order)
ch.show('png')

Pandas df to dictionary with values as python lists aggregated from a df column

I have a pandas df containing 'features' for stocks, which looks like this:
I am now trying to create a dictionary with unique sector as key, and a python list of tickers for that unique sector as values, so I end up having something that looks like this:
{'consumer_discretionary': ['AAP',
'AMZN',
'AN',
'AZO',
'BBBY',
'BBY',
'BWA',
'KMX',
'CCL',
'CBS',
'CHTR',
'CMG',
etc.
I could iterate over the pandas df rows to create the dictionary, but I prefer a more pythonic solution. Thus far, this code is a partial solution:
df.set_index('sector')['ticker'].to_dict()
Any feedback is appreciated.
UPDATE:
The solution by #wrwrwr
df.set_index('ticker').groupby('sector').groups
partially works, but it returns a pandas series as a the value, instead of a python list. Any ideas about how to transform the pandas series into a python list in the same line and w/o having to iterate the dictionary?
Wouldn't f.set_index('ticker').groupby('sector').groups be what you want?
For example:
f = DataFrame({
'ticker': ('t1', 't2', 't3'),
'sector': ('sa', 'sb', 'sb'),
'name': ('n1', 'n2', 'n3')})
groups = f.set_index('ticker').groupby('sector').groups
# {'sa': Index(['t1']), 'sb': Index(['t2', 't3'])}
To ensure that they have the type you want:
{k: list(v) for k, v in f.set_index('ticker').groupby('sector').groups.items()}
or:
f.set_index('ticker').groupby('sector').apply(lambda g: list(g.index)).to_dict()

Get the indicies of a dataframe to use on a list

I am trying to extract elements of a list based on the contents of a pandas dataframe.This is probably best explained through an example:
Say I have a list of lists called a
a = [['Lazy', 'Brown', 'Fox'], ['Jumps', 'Over'], ['Big', 'Blue', 'Sea']]
and a pandas dataframe called df in the form of
Name Group
A 1
B 1
C 2
I want to index list a based on the Group variable in df. So I would have a result
[['Lazy', 'Brown', 'Fox', 'Jumps', 'Over'], ['Big', 'Blue', 'Sea']]
Or something similar.
I am more used to using R to code, in which this process would be relatively straight forward - so I am hoping that is also the case in python, but I haven't found anything that will help me solve this problem in python yet.
You could express this as a groupby/agg operation:
import pandas as pd
a = [['Lazy', 'Brown', 'Fox'], ['Jumps', 'Over'], ['Big', 'Blue', 'Sea']]
df = pd.DataFrame({'Name':list('ABC'), 'Group':[1,1,2]})
df['a'] = a
print(df.groupby(['Group'])['a'].sum())
yields
Group
1 [Lazy, Brown, Fox, Jumps, Over]
2 [Big, Blue, Sea]
Name: a, dtype: object
Aggregation by summing works because the sum of two lists is a concatenated list:
In [322]: ['Lazy', 'Brown', 'Fox'] + ['Jumps', 'Over']
Out[322]: ['Lazy', 'Brown', 'Fox', 'Jumps', 'Over']

Categories