Create python dataframe based on nested loop - python

I am a new in Python pandas, so sorry if this question is very easy.
I have three lists:
A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']
I want to make those lists combined and show it in data frame.
I have tried to use list.append
new_list = []
for i in A:
new_list.append(i)
for j in (M):
new_list.append(j)
print(new_list)
['A', '1', '2', '3', 'B', '1', '2', '3', 'C', '1', '2', '3']
I confused, how to get the output like this (in dataframe):

It seems as if you want to create all list of all possible permutations. You can do this with itertools and pandas. Itertools is a native library to python:
import pandas as pd
import itertools
A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']
df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
print(df)
Output:
A M F
0 A 1 plus
1 A 1 minus
2 A 1 square
3 A 2 plus
4 A 2 minus
5 A 2 square
6 A 3 plus
7 A 3 minus
8 A 3 square
9 B 1 plus
10 B 1 minus
11 B 1 square
12 B 2 plus
13 B 2 minus
14 B 2 square
15 B 3 plus
16 B 3 minus
17 B 3 square
18 C 1 plus
19 C 1 minus
20 C 1 square
21 C 2 plus
22 C 2 minus
23 C 2 square
24 C 3 plus
25 C 3 minus
26 C 3 square

What you need is a Cartesian product of the three sets:
import pandas as pd
from itertools import product
pd.DataFrame(list(product(A,M,F)), columns=['A', 'M', 'F'])

Related

Based on given number of bins distribute column data into equal average

I have dataframe where data is like & max bins where we want to distribute data is 3 bins
x
count
a
2
b
3
c
5
d
7
e
9
So sum will be of count will be 26 we need to distribute into 3 bins, which averages as 8.66 so each bin should have count close to 8 or 9
cluster_id
group
0
{e}
1
{d,a}
2
{c,b}
So I am able to figure it out I used bin packing solution to work on it
https://en.wikipedia.org/wiki/Bin_packing_problem
>>> import binpacking
>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df['x']= ['a','b','c','d','e']
>>> df['count']=[2,3,5,7,9]
>>> df
x count
0 a 2
1 b 3
2 c 5
3 d 7
4 e 9
>>> map_exe_count = df.set_index('x').to_dict()['count']
>>> groups = binpacking.to_constant_bin_number(map_exe_count, bins)
>>> exes_per_bin = [list(group.keys()) for group in groups if len(group.keys()) > 0]
>>> exes_per_bin
[['e'], ['d', 'a'], ['c', 'b']]

how to append a list of of lists with different lengths

I am trying to use a list of lists to add rows to a dataframe.
The error is as follows:
IndexError: invalid index to scalar variable.
The code is below:
Total_List = [[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]]
Some_List = ['0', '1', '2', '3', '4']
first_row = {'A': [0], 'B': [0], 'C': [0]}
All_Rows = pd.DataFrame(first_row)
#Optimized_Trades
for i in range(len(Some_List)):
for j in range(len(Some_List[i])):
df_temp = { 'A': Total_List[i][j], 'B': Total_List[i][j], 'C': Total_List[i][j]}
All_Rows = All_Rows.append(df_temp, ignore_index = True)
All_Trades = All_Trades[1:]
display(All_Trades)
Ideally, the final output would be:
1,4,7,10,13
2,5,8,11,14
3,6,9,12,15
IIUC, you want to add each of the first, second ... nth elements of each sublist as rows of the data frame, which is equivalent to the dataframe of the transpose of the list of lists.
You don't need a for loop to do this in python.
Using zip with unpacking operator
You can do it in a few ways but the one way would be zip with unpacking operator * using list(zip(*l))
l = [[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]]
lt = pd.DataFrame(zip(*l)) #<---
print(lt)
0 1 2 3 4
0 1 4 7 10 13
1 2 5 8 11 14
2 3 6 9 12 15
Using pandas transpose
A simpler way would be to use pandas to do this where you can simply use transpose -
l = [[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]]
lt = pd.DataFrame(l).T #<---
print(lt)
0 1 2 3 4
0 1 4 7 10 13
1 2 5 8 11 14
2 3 6 9 12 15

substitute all numbers in a matrix with equivalent letters

There is a huge matrix whose elements are numbers in the range of 1 to 15. I want to transform the matrix to the one whose elements be letters such that 1 becomes "a", 2 becomes "b", and so on. As a simple example:
import pandas as pd
import numpy as np, numpy.random
numpy.random.seed(1)
A = pd.DataFrame (np.random.randint(1,16,10).reshape(2,5))
# A 0 1 2 3 4
# 0 6 12 13 9 10
# 1 12 6 1 1 2
The expected output is
# B 0 1 2 3 4
# 0 f l m i j
# 1 l f a a b
I can do it with a loop but for a huge matrix, it doesn't seem logical. There should be a more pythonic way to do it. In R, chartr is the function for such a replacement. For the numbers between 1 to 9, it works like this: chartr("123456789", "ABCDEFGHI", A). What is the equivalent in Python?
You can use chr:
>>> import pandas as pd
>>> import numpy as np
>>> numpy.random.seed(1)
>>> df = pd.DataFrame(np.random.randint(1, 16, 10).reshape(2, 5))
>>> df
0 1 2 3 4
0 6 12 13 9 10
1 12 6 1 1 2
>>> df = df.applymap(lambda n: chr(n + 96))
>>> df
0 1 2 3 4
0 f l m i j
1 l f a a b
This is one way. If possible, I would advise against use of lambda and apply via pandas, as these are loopy and have overheads.
import pandas as pd
import numpy as np
import string
np.random.seed(1)
A = pd.DataFrame(np.random.randint(1,16,10).reshape(2,5))
# 0 1 2 3 4
# 0 6 12 13 9 10
# 1 12 6 1 1 2
d = dict(enumerate(string.ascii_uppercase, 1))
A_mapped = pd.DataFrame(np.vectorize(d.get)(A.values))
# 0 1 2 3 4
# 0 F L M I J
# 1 L F A A B

Remove rows having different consecutive values in dataframe using Pandas

I have the following dataframe:
import pandas as pd
df = pd.DataFrame({"A":['a', 's', 'd', 'f', 'g', 'h', 'j', 'k', 'l'], "M":[11,4,9,2,2,5,5,6,6]})
My goal is to remove all the rows having 2 consecutive values of column M not equal to each other.
Therefore row 0, 1 and 2 should be removed because the values of M are: 11!=4, 4!=9 and 9!=2). However if 2 rows have the same consecutive value the must be kept: row 3 and 4 must be kept because they both have value 2. Same reasoning for row 5 and 6 which have value 5.
I was able to reach my goal by using the following lines of code:
l=[]
for i, row in df.iterrows():
try:
if df["M"].iloc[i]!=df["M"].iloc[i+1] and df["M"].iloc[i]!=df["M"].iloc[i-1]:
l.append(i)
except:
pass
df = df.drop(df.index[l]).reset_index(drop=True)
Can you suggest a smarter and better way to achieve my goal? maybe by using some built-in pandas function?
Here is what the dataframe should look like:
Before:
A M
0 a 11 <----Must be removed
1 s 4 <----Must be removed
2 d 9 <----Must be removed
3 f 2
4 g 2
5 h 5
6 j 5
7 k 6
8 l 6
After
A M
0 f 2
1 g 2
2 h 5
3 j 5
4 k 6
5 l 6
Use boolean indexing with masks created by shift:
m = (df["M"].eq(df["M"].shift()) | df["M"].eq(df["M"].shift(-1)))
#alternative
#m = ~(df["M"].ne(df["M"].shift()) & df["M"].ne(df["M"].shift(-1)))
print (df[m])
A M
3 f 2
4 g 2
5 h 5
6 j 5
7 k 6
8 l 6
By using diff
df.loc[df.M.isin(df[df.M.diff()==0].M),:]
Out[140]:
A M
3 f 2
4 g 2
5 h 5
6 j 5
7 k 6
8 l 6
Notice Previous one may not work .(when 1,1,2,1,3,4)
m=df[df.M.diff()==0].index.values.tolist()
m.extend([x-1 for x in m])
df.loc[set(m)].sort_index()
Another nice answer from MaxU :
df.loc[df.M.diff().eq(0) | df.M.diff(-1).eq(0)]

Replicating rows in a pandas data frame by a column value [duplicate]

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?
You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.
You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8
It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.

Categories