Operations on multiple data frame in PANDAS - python

I have several tables that look like this:
ID YY ZZ
2 97 826
2 78 489
4 47 751
4 110 322
6 67 554
6 88 714
code:
raw = {'ID': [2, 2, 4, 4, 6, 6,],
'YY': [97,78,47,110,67,88],
'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
For each of these dfs, I have to perform a number of operations.
First, group by id,
extract the length of the column zz and average of the column zz,
put results in new df
New df that looks like this
Cities length mean
Paris 0 0
Madrid 0 0
Berlin 0 0
Warsaw 0 0
London 0 0
code:
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2)
I pulled out the average and the size of individual groups
df_grouped = df.groupby('ID').ZZ.size()
df_grouped2 = df.groupby('ID').ZZ.mean()
the problem occurs when trying to transfer results to a new table because it does not contain all the cities and the results must be matched according to the appropriate key.
I tried to use a dictionary:
dic_cities = {"Paris":df_grouped.loc[2],
"Madrid":df_grouped.loc[4],
"Warsaw":df_grouped.loc[6],
"Berlin":df_grouped.loc[8],
"London":df_grouped.loc[10]}
Unfortunately, I'm receiving KeyError: 8
I have 19 df's from which I have to extract this data and the final tables have to look like this:
Cities length mean
Paris 2 657.5
Madrid 2 536.5
Berlin 0 0.0
Warsaw 2 634.0
London 0 0.0
Does anyone know how to deal with it using groupby and the dictionary or knows a better way to do it?

First, you should index df2 on 'Cities':
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2).set_index('Cities')
Then you should reverse you dictionary:
dic_cities = {2: "Paris",
4: "Madrid",
6: "Warsaw",
8: "Berlin",
10: "London"}
Once this is done, the processing is as simple as a groupby:
for i, sub in df.groupby('ID'):
df2.loc[dic_cities[i]] = sub.ZZ.agg([len, np.mean]).tolist()
Which gives for df2:
length mean
Cities
Paris 2.0 657.5
Madrid 2.0 536.5
Berlin 0.0 0.0
Warsaw 2.0 634.0
London 0.0 0.0

See this:
import pandas as pd
# setup raw data
raw = {'ID': [2, 2, 4, 4, 6, 6,], 'YY': [97,78,47,110,67,88], 'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
# get mean values
mean_values = df.groupby('ID').mean()
# drop column
mean_values = mean_values.drop(['YY'], axis=1)
# get occurrence number
occurrence = df.groupby('ID').size()
# save data
result = pd.concat([occurrence, mean_values], axis=1, sort=False)
# rename columns
result.rename(columns={0:'length', 'ZZ':'mean'}, inplace=True)
# city data
raw2 = 'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'], 'length': 0, 'mean': 0}
df2 = pd.DataFrame(raw2)
# rename indexes
df2 = df2.rename(index={0: 2, 1:4, 2:8, 3:6, 4:10}
# merge data
df2['length'] = result['length']
df2['mean'] = result['mean']
Outout:
Cities length mean
2 Paris 2.0 657.5
4 Madrid 2.0 536.5
8 Berlin NaN NaN
6 Warsaw 2.0 634.0
10 London NaN NaN

Related

How to Lookup from Different Dataframe into a Middle Column of First Dataframe

I have 2 DataFrames from 2 different csv file and both file have let's say 5 columns each. And I need to lookup 1 column from the second DataFrame into the first DataFrame so the first DataFrame will has 6 columns and lookup using the ID.
Example are as below:
import pandas as pd
data = [[6661, 'Lily', 21, 5000, 'USA'], [6672, 'Mark', 32, 32500, 'Canada'], [6631, 'Rose', 20, 1500, 'London'],
[6600, 'Jenifer', 42, 50000, 'London'], [6643, 'Sue', 27, 8000, 'Turkey']]
ds_main = pd.DataFrame(data, columns = ['ID', 'Name', 'Age', 'Income', 'Country'])
data2 = [[6672, 'Mark', 'Shirt', 8.5, 2], [6643, 'Sue', 'Scraft', 2.0, 5], [6661, 'Lily', 'Blouse', 11.9, 2],
[6600, 'Jenifer', 'Shirt', 9.8, 1], [6631, 'Rose', 'Pants', 4.5, 2]]
ds_rate = pd.DataFrame(data2, columns = ['ID', 'Name', 'Product', 'Rate', 'Quantity'])
I wanted to lookup the 'Rate' from ds_rate into the ds_main. However, I wanted the rate to be place in the middle of the ds_main DataFrame.
The result should be as below:
I have tried using merge and insert, still unable to get the result that I wanted. Is there any easy way to do it?
You could use set_index + loc to get "Rate" sorted according to its "ID" in ds_main; then insert:
ds_main.insert(3, 'Rate', ds_rate.set_index('ID')['Rate'].loc[ds_main['ID']].reset_index(drop=True))
Output:
ID Name Age Rate Income Country
0 6661 Lily 21 11.9 5000 USA
1 6672 Mark 32 8.5 32500 Canada
2 6631 Rose 20 4.5 1500 London
3 6600 Jenifer 42 9.8 50000 London
4 6643 Sue 27 2.0 8000 Turkey
Assuming 'ID' is unique
ds_main.iloc[:, :3].merge(ds_rate[['ID', 'Rate']]).join(ds_main.iloc[:, 3:])
ID Name Age Rate Income Country
0 6661 Lily 21 11.9 5000 USA
1 6672 Mark 32 8.5 32500 Canada
2 6631 Rose 20 4.5 1500 London
3 6600 Jenifer 42 9.8 50000 London
4 6643 Sue 27 2.0 8000 Turkey

How to eliminate rows in a dataframe by selecting a specific range for each column? - Pandas

I am working on a dataframe that displays information on property rentals in Brazil. This is a sample of the dataset:
data = {
'city': ['São Paulo', 'Rio', 'Recife'],
'area(m2)': [90, 120, 60],
'Rooms': [3, 2, 4],
'Bathrooms': [2, 3, 3],
'animal': ['accept', 'do not accept', 'accept'],
'rent($)': [2000, 3000, 800]}
df = pd.DataFrame(
data,
columns=['city', 'area(m2)', 'Rooms', 'Bathrooms', 'animal', 'rent($)'])
print(df)
This is how the sample looks:
city area(m2) Rooms Bathrooms animal rent($)
0 São Paulo 90 3 2 accept 2000
1 Rio 120 2 3 do not accept 3000
2 Recife 60 4 3 accept 800
I want to filter the dataset in order to select only the apartments that have at maximum 2 rooms and 2 bathrooms.
Do you know how I can do this?
Try with
out = df.loc[(df.Rooms>=2) & (df.Bathrooms>=2)]
You can use query() method:
out=test_gdata.query('Bathrooms<=2 and Rooms<=2')
You can filter the values on the dataframe
import pandas as pd
data = {
'city': ['São Paulo', 'Rio', 'Recife'],
'area(m2)': [90, 120, 60],
'Rooms': [3, 2, 4],
'Bathrooms': [2, 3, 3],
'animal': ['accept', 'do not accept', 'accept'],
'rent($)': [2000, 3000, 800]}
df = pd.DataFrame(
data,
columns=['city', 'area(m2)', 'Rooms', 'Bathrooms', 'animal', 'rent($)'])
df_filtered = df[(df['Rooms'] <= 2) & (df['Bathrooms'] <= 2)]
print(df_filtered)
Returns
city area(m2) Rooms Bathrooms animal rent($)
0 São Paulo 90 3 2 accept 2000
1 Rio 120 2 3 do not accept 3000
2 Recife 60 4 3 accept 800

Pandas label encoding column with default label for invalid row values

For a data frame I replaced a set of items in a column with a range of values as follows:
df['borough_num'] = df['Borough'].replace(regex=['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX'], value=[1, 2, 3, 4,5])
The issue that I want to replace all the rest of elements in 'Borough' that not mentioned before with the value 0
also I need to use regex because there are looks like data eg. 07 BRONX, I need it also to be replaced by 5 not 0
To replace all other values by 0, you can do:
# create maps
new_values = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX']
maps = dict(zip(new_values, [1]*len(new_values)))
# map the values
df['borough_num'] = df['Borough'].apply(lambda x: maps.get(x, 0))
Data from cold using map with fillna, all the value not in the map dict will return NaN, then we just fillna
df.Borough.map(dict(zip(['QUEENS', 'BRONX'],[1,2]))).fillna(0).astype(int)
0 1
1 2
2 2
3 0
Name: Borough, dtype: int32
I see you want to perform category encoding with some imposed order. I would recommend using pd.Categorical with ordered=True:
df = pd.DataFrame({
'Borough': ['QUEENS', 'BRONX', 'MANHATTAN', 'BROOKLYN', 'INVALID']})
df
Borough
0 QUEENS
1 BRONX
2 MANHATTAN
3 BROOKLYN
4 INVALID
keys = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX']
df['borough_num'] = pd.Categorical(
df['Borough'], categories=keys, ordered=True).codes+1
df
Borough borough_num
0 QUEENS 3
1 BRONX 5
2 MANHATTAN 1
3 BROOKLYN 2
4 INVALID 0
pd.Categorical returns invalid strings as -1:
pd.Categorical(
df['Borough'], categories=keys, ordered=True).codes
array([ 2, 4, 0, 1, -1], dtype=int8)
This should be much faster than using replace, anyway, but for reference, you would do this with replace and a dictionary:
from collections import defaultdict
d = defaultdict(int)
d.update(dict(zip(keys, range(len(keys)))))
df['borough_num'] = df['Borough'].map(d)
df
Borough borough_num
0 QUEENS 2
1 BRONX 4
2 MANHATTAN 0
3 BROOKLYN 1
4 INVALID 0
You can also use np.where:
Creating a dummy DataFrame
df = pd.DataFrame({'Borough': ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX', 'TEST']})
df
Borough
0 MANHATTAN
1 BROOKLYN
2 QUEENS
3 STATEN ISLAND
4 BRONX
5 TEST
Your Operation:
df['borough_num'] = df['Borough'].replace(regex=['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX'], value=[1, 2, 3, 4,5])
df
Borough borough_num
0 MANHATTAN 1
1 BROOKLYN 2
2 QUEENS 3
3 STATEN ISLAND 4
4 BRONX 5
5 TEST TEST
Replacing values of column Borough not in keys with 0 using np.where:
keys = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX']
df['Borough'] = np.where(~df['Borough'].isin(keys), 0 ,df['Borough'])
df
Borough borough_num
0 MANHATTAN 1
1 BROOKLYN 2
2 QUEENS 3
3 STATEN ISLAND 4
4 BRONX 5
5 0 TEST

How to encode a categorical variable (series) in the data frame in Python?

I have a dictionary of the following form:
{CA: California, NV: Nevada, TX: Texas}
I want to transform my data frame
{
'state':['California', 'California, 'Texas', 'Nevada', 'Texas],
'var':[100,200,300,400, 500]
}
into
{
'state':['CA','CA','TX','NV','TX'],
'var':[100,200,300,400,500]
}
What's the best way to do this?
If you reversed the keys and values in your dict then you can just use map:
# to swap the keys and values:
new_map = dict (zip(my_dict.values(),my_dict.keys()))
then call map:
df.state = df.state.map(new_map)
This assumes that your keys are present in the map, if not you will get a KeyError raised
So create dataframe:
In [12]:
df = pd.DataFrame({
'state':['California', 'California', 'Texas', 'Nevada', 'Texas'],
'var':[100,200,300,400, 500]
})
df
Out[12]:
state var
0 California 100
1 California 200
2 Texas 300
3 Nevada 400
4 Texas 500
[5 rows x 2 columns]
your dict:
my_dict = {'CA': 'California', 'NV': 'Nevada', 'TX': 'Texas'}
reverse the keys and values
new_dict = dict(zip(my_dict.values(), my_dict.keys()))
now call map to perform the lookup and assign back to state:
In [13]:
df.state = df.state.map(new_dict)
df
Out[13]:
state var
0 CA 100
1 CA 200
2 TX 300
3 NV 400
4 TX 500
[5 rows x 2 columns]
If you are worried that some values may not exist then you can use get on the dict so that it handles the KeyError and assigns None as a value:
setup a new df with 'New York'
In [19]:
df = pd.DataFrame({
'state':['California', 'California', 'Texas', 'Nevada', 'Texas', 'New York'],
'var':[100,200,300,400, 500, 600]
})
df
Out[19]:
state var
0 California 100
1 California 200
2 Texas 300
3 Nevada 400
4 Texas 500
5 New York 600
[6 rows x 2 columns]
Now call get instead:
In [25]:
df.state = df.state.map(lambda x: new_dict.get(x))
df
Out[25]:
state var
0 CA 100
1 CA 200
2 TX 300
3 NV 400
4 TX 500
5 None 600
[6 rows x 2 columns]

ignoring hierarchical index during matrix operations

In the last statement of this routine I get a TypeError
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Missouri'],
'year': [2000, 2001, 2002, 2001, 2002],
'items': [5, 12, 6, 45, 0]}
frame = DataFrame(data)
def summary_pivot(df, row=['state'],column=['year'],value=['items'],func=len):
return df.pivot_table(value, rows=row,cols=column,
margins=True, aggfunc=func, fill_value=0)
test = summary_pivot(frame)
In [545]: test
Out[545]:
items
year 2000 2001 2002 All
state
Missouri 0 0 1 1
Nevada 0 1 0 1
Ohio 1 1 1 3
All 1 2 2 5
price = DataFrame(index=['Missouri', 'Ohio'], columns = ['price'], data = [200, 250])
In [546]: price
Out[546]:
price
Missouri 200
Ohio 250
test * price
TypeError: can only call with other hierarchical index objects
How can I get past this error, so I can multiply correctly the number of items in each state by the corresponding price?
In [659]: price = Series(index = ['Missouri', 'Ohio'], data = [200, 250])
In [660]: test1 = test.items
In [661]: test1.mul(price, axis='index')
Out[661]:
year 2000 2001 2002 All
All NaN NaN NaN NaN
Missouri 0 0 200 200
Nevada NaN NaN NaN NaN
Ohio 250 250 250 750

Categories