Divide Dataframe into 2 dataframe using index - python

I need to divide my dataframe into 2 dataframe based on their index
Df1 with this index:[5, 15, 22, 23, 24]
Df2 with this index:[0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54]
Unable to find solution! Any help would be appreciated

If input is list of index values is possible use Index.isin in boolean indexing (if not exist some values in original index also working correct):
idx = [5, 15, 22, 23, 24]
mask = df.index.isin(idx)
df1 = df[mask]
df2 = df[~mask]
Solution with DataFrame.loc is possible without : and is necessary all values exist in original index:
L1 = [5, 15, 22, 23, 24]
L2 = [0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20,
21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54]
df1 = df.loc[L1]
df2 = df.loc[L2]

You can use .loc:
df_1 = df.loc[[5, 15, 22, 23, 24], :]
df_2 = df.loc[[0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54], :]
Here is the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

Related

Coverting Int64index to list or accessing list of lists

I have a list of lists but because of the Int64Index I cannot access it. Is there a way to access individual values or make it into a normal list?
data_exp = pd.read_csv(path+'/exp.csv')
exp_list=[]
for i in range (1,n+1):
check=data_exp.apply(lambda x: True if x['Set No.']==i else False, axis=1)
temp=[data_exp[check==True].index+1]
exp_list.append(temp)
del temp
display(exp_list)
The for loop just sort values based on a condition. The output is good but it is the format which is problamatic.
Gives me out put as follows:-
[[Int64Index([8, 11, 17, 20, 21, 27, 29, 36, 37, 38], dtype='int64')],
[Int64Index([1, 3, 7, 10, 14, 31, 33, 34, 35], dtype='int64')],
[Int64Index([5, 9, 12, 15, 19, 23, 25, 26, 28, 32], dtype='int64')],
[Int64Index([2, 4, 6, 13, 16, 18, 22, 24, 30, 39, 40], dtype='int64')]]
Thanks in advance
I'm not quite sure what you're doing to get the list of Int64Indexes, but you can access the numpy array underlying the index with the values property:
from pandas import Int64Index
l = [[Int64Index([8, 11, 17, 20, 21, 27, 29, 36, 37, 38], dtype='int64')],
[Int64Index([1, 3, 7, 10, 14, 31, 33, 34, 35], dtype='int64')],
[Int64Index([5, 9, 12, 15, 19, 23, 25, 26, 28, 32], dtype='int64')],
[Int64Index([2, 4, 6, 13, 16, 18, 22, 24, 30, 39, 40], dtype='int64')]]
print(l[0][0].values[0])

Multiply factors from nested lists

I am trying to multiply factors from the first nested list with the second nested list. The result I get is [0,0,0,0]. Help appreciated.
Faktorer = [[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]]
res = []
for Faktorer[0][i] in Faktorer:
for Faktorer[1][j] in Faktorer:
res.append(i*j)
print(res)
Using a nested loop is right, but your syntax is all mixed up.
>>> Faktorer = [[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]]
>>> res = []
>>> for i in Faktorer[0]:
... for j in Faktorer[1]:
... res.append(i * j)
...
>>> res
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 9, 18, 27, 36, 45, 54, 63, 72, 81, 90, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
You can also do this as a list comprehension like this:
>>> [i * j for i in Faktorer[0] for j in Faktorer[1]]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 9, 18, 27, 36, 45, 54, 63, 72, 81, 90, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
If Faktorer had an arbitrary number of sublists (making it impossible to do this with a fixed number of nested loops), or if you just didn't want to use nested loops, you could use product to generate the combinations of all the factors and reduce to multiply them:
>>> from functools import reduce
>>> from itertools import product
>>> [reduce(int.__mul__, f, 1) for f in product(*Faktorer)]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 9, 18, 27, 36, 45, 54, 63, 72, 81, 90, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
What you described in a Kronecker product. If you are allowed to use libraries, you could use numpy.kron:
import numpy as np
res = list(np.kron(Faktorer[0], Faktorer[1]))
print(res)
OUTPUT:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 9, 18, 27, 36, 45, 54, 63, 72, 81, 90, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

Select values from two different dataset in python

i have a trouble when i'm dealing with my 2 dataset, i explain my problem:
I have 2 different dataset:
training_df = pd.read_csv('.../train.csv')
test_df = pd.read_csv('.../test.csv')
I have to take values from some columns from train.csv and take other columns in test.csv, i tried like this:
num_attrib = pd.DataFrame(training_df, columns=[0, 2, 3, 15, 16, 17, 18, 24, 32, 34, 35, 36, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 54, 57, 59, 60, 64, 65, 66, 67, 68, 69, 70, 71, 72])
cat_attrib = pd.DataFrame(training_df, columns=[1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 33, 37, 38, 39, 40, 51, 53, 55, 56, 58, 61, 62, 63, 73, 74])
num_attrib_test = pd.DataFrame(test_df, columns=[0, 2, 3, 15, 16, 17, 18, 24, 32, 34, 35, 36, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 54, 57, 59, 60, 64, 65, 66, 67, 68, 69, 70, 71, 72])
cat_attrib_test = pd.DataFrame(test_df, columns=[1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 33, 37, 38, 39, 40, 51, 53, 55, 56, 58, 61, 62, 63, 73, 74])
Both datasets have numerical and categorial datas. I have to select and separate categorical from numerical datas for each datasets, but my way is wrong.
I have this trouble because i have to make the Columntransformer() on training_df and test_df.
Any suggestion?
Thank you so much
You are looking for iloc. See documentation here.
num_attrib = training_df.iloc[:,[0,2,3,...,15]]
You can also slice:
#even columns
num_attrib = training_df.iloc[:, ::2]
#odd columns
num_attrib = training_df.iloc[:, 1::2]

Reading formatted array from file in Python

I have a file which contains some strings and then two formatted arrays. It looks something like this
megabuck
Hello world
[58, 50, 42, 34, 26, 18, 10, 2,
61, 53, 45, 37, 29, 21, 13, 5,
63, 55, 47, 39, 31, 23, 15, 7]
[57, 49, 41, 33, 25, 17, 9,
1, 58, 50, 42, 34, 26, 18,
14, 6, 61, 53, 45, 37, 29,
21, 13, 5, 28, 20, 12, 4]
I don't know the size of the arrays beforehand. Only thing I know is the delimiter for the array which is []. What can be an elegant way to read the arrays.
I am a newbie in python.
Using Regex. re.findall
Ex:
import re
import ast
with open(filename) as infile:
data = infile.read()
for i in re.findall(r"(\[.*?\])", data, flags=re.S):
print(ast.literal_eval(i))
Output:
[58, 50, 42, 34, 26, 18, 10, 2, 61, 53, 45, 37, 29, 21, 13, 5, 63, 55, 47, 39, 31, 23, 15, 7]
[57, 49, 41, 33, 25, 17, 9, 1, 58, 50, 42, 34, 26, 18, 14, 6, 61, 53, 45, 37, 29, 21, 13, 5, 28, 20, 12, 4]
I wouldn't call it elegant but it works
ars = """
megabuck
Hello world
[58, 50, 42, 34, 26, 18, 10, 2,
61, 53, 45, 37, 29, 21, 13, 5,
63, 55, 47, 39, 31, 23, 15, 7]
[57, 49, 41, 33, 25, 17, 9,
1, 58, 50, 42, 34, 26, 18,
14, 6, 61, 53, 45, 37, 29,
21, 13, 5, 28, 20, 12, 4]
"""
arrays = []
for a in ars.split("["):
if ']' in a:
arrays.append([i.strip() for i in a.replace("]",'').split(',')])

seaborn heatmap color scheme based on row values

I have a dataframe, reproduced partly as such:
import pandas as pd
import numpy as np
tab = pd.DataFrame(np.array([[ 46, 39, 25, 29, 21, 12, 33, 32, 70, 109, 144, 158, 161,
184, 163, 113, 117, 82, 76, 88, 77, 76, 64, 35],
[ 39, 33, 29, 29, 26, 14, 25, 33, 60, 83, 126, 117, 111,
148, 141, 104, 92, 75, 78, 74, 63, 67, 52, 39],
[ 30, 27, 14, 11, 20, 17, 21, 31, 48, 62, 83, 78, 88,
90, 80, 67, 53, 61, 47, 54, 50, 48, 35, 26],
[ 30, 24, 19, 15, 17, 10, 12, 18, 34, 69, 88, 79, 109,
95, 89, 82, 53, 46, 53, 57, 39, 41, 26, 29],
[ 37, 31, 18, 12, 30, 13, 15, 19, 51, 61, 74, 81, 77,
100, 96, 74, 60, 57, 42, 48, 43, 40, 29, 25],
[ 14, 8, 14, 11, 13, 7, 9, 15, 42, 49, 50, 44, 53,
42, 31, 31, 30, 27, 33, 25, 27, 17, 20, 17],
[ 10, 15, 6, 10, 15, 11, 7, 18, 28, 43, 49, 37, 41,
33, 37, 32, 26, 28, 19, 24, 19, 19, 13, 18],
[ 9, 9, 8, 12, 7, 11, 4, 8, 14, 15, 23, 30, 29,
34, 25, 39, 22, 20, 15, 23, 12, 19, 14, 13],
[ 0, 3, 4, 1, 1, 0, 3, 4, 4, 5, 3, 5, 6,
7, 3, 3, 6, 4, 2, 3, 3, 2, 2, 2],
[ 3, 0, 1, 0, 0, 0, 1, 1, 4, 8, 2, 4, 7,
2, 2, 9, 3, 5, 1, 5, 2, 0, 4, 1]]), index =
['Stadsdeel Zuid', 'Stadsdeel West', 'Stadsdeel Nieuw-West',
'Stadsdeel Centrum', 'Stadsdeel Oost', 'Stadsdeel Noord',
'Wijk 00 Amstelveen', 'Stadsdeel Zuidoost', 'Wijk 00',
'Wijk 00 Aalsmeer'])
and I created a heatmap as such
ax = sns.heatmap(tab, linewidths=.5 ,robust=True ,annot_kws = {'size':14})
ax.tick_params(labelsize=14)
ax.figure.set_size_inches((12, 10))
I would like though that values to anchor the colormap are based on min-max values per row so that also rows with lower values are well visible. (in reality the table contains many more rows with low values that the heatmap barely shows color-wise)
How to achieve this ?
I would normalize the tab rows by the maximum value in each row with:
tab_n = tab.div(tab.max(axis=1), axis=0)
where tab_n is the normalized tab having values in the range [0,1]. Hope that helps. Plotting tab_n should return an heatmap like this:

Categories